Search | arXiv e-print repository

Universal Gradient Descent Ascent Method for Nonconvex-Nonconcave Minimax Optimization

Authors: Taoli Zheng, Linglingzhi Zhu, Anthony Man-Cho So, Jose Blanchet, Jia** Li

Abstract: Nonconvex-nonconcave minimax optimization has received intense attention over the last decade due to its broad applications in machine learning. Most existing algorithms rely on one-sided information, such as the convexity (resp. concavity) of the primal (resp. dual) functions, or other specific structures, such as the Polyak-Łojasiewicz (PŁ) and Kurdyka-Łojasiewicz (KŁ) conditions. However, verif… ▽ More Nonconvex-nonconcave minimax optimization has received intense attention over the last decade due to its broad applications in machine learning. Most existing algorithms rely on one-sided information, such as the convexity (resp. concavity) of the primal (resp. dual) functions, or other specific structures, such as the Polyak-Łojasiewicz (PŁ) and Kurdyka-Łojasiewicz (KŁ) conditions. However, verifying these regularity conditions is challenging in practice. To meet this challenge, we propose a novel universally applicable single-loop algorithm, the doubly smoothed gradient descent ascent method (DS-GDA), which naturally balances the primal and dual updates. That is, DS-GDA with the same hyperparameters is able to uniformly solve nonconvex-concave, convex-nonconcave, and nonconvex-nonconcave problems with one-sided KŁ properties, achieving convergence with $\mathcal{O}(ε^{-4})$ complexity. Sharper (even optimal) iteration complexity can be obtained when the KŁ exponent is known. Specifically, under the one-sided KŁ condition with exponent $θ\in(0,1)$, DS-GDA converges with an iteration complexity of $\mathcal{O}(ε^{-2\max\{2θ,1\}})$. They all match the corresponding best results in the literature. Moreover, we show that DS-GDA is practically applicable to general nonconvex-nonconcave problems even without any regularity conditions, such as the PŁ condition, KŁ condition, or weak Minty variational inequalities condition. For various challenging nonconvex-nonconcave examples in the literature, including ``Forsaken'', ``Bilinearly-coupled minimax'', ``Sixth-order polynomial'', and ``PolarGame'', the proposed DS-GDA can all get rid of limit cycles. To the best of our knowledge, this is the first first-order algorithm to achieve convergence on all of these formidable problems. △ Less

Submitted 30 October, 2023; v1 submitted 25 December, 2022; originally announced December 2022.

arXiv:2211.10314 [pdf, other]

Prediction scoring of data-driven discoveries for reproducible research

Authors: Anna L. Smith, Tian Zheng, Andrew Gelman

Abstract: Predictive modeling uncovers knowledge and insights regarding a hypothesized data generating mechanism (DGM). Results from different studies on a complex DGM, derived from different data sets, and using complicated models and algorithms, are hard to quantitatively compare due to random noise and statistical uncertainty in model results. This has been one of the main contributors to the replication… ▽ More Predictive modeling uncovers knowledge and insights regarding a hypothesized data generating mechanism (DGM). Results from different studies on a complex DGM, derived from different data sets, and using complicated models and algorithms, are hard to quantitatively compare due to random noise and statistical uncertainty in model results. This has been one of the main contributors to the replication crisis in the behavioral sciences. The contribution of this paper is to apply prediction scoring to the problem of comparing two studies, such as can arise when evaluating replications or competing evidence. We examine the role of predictive models in quantitatively assessing agreement between two datasets that are assumed to come from two distinct DGMs. We formalize a distance between the DGMs that is estimated using cross validation. We argue that the resulting prediction scores depend on the predictive models created by cross validation. In this sense, the prediction scores measure the distance between DGMs, along the dimension of the particular predictive model. Using human behavior data from experimental economics, we demonstrate that prediction scores can be used to evaluate preregistered hypotheses and provide insights comparing data from different populations and settings. We examine the asymptotic behavior of the prediction scores using simulated experimental data and demonstrate that leveraging competing predictive models can reveal important differences between underlying DGMs. Our proposed cross-validated prediction scores are capable of quantifying differences between unobserved data generating mechanisms and allow for the validation and assessment of results from complex models. △ Less

Submitted 18 November, 2022; originally announced November 2022.

arXiv:2209.04991 [pdf, other]

Wasserstein Distributional Learning

Authors: Chengliang Tang, Nathan Lenssen, Ying Wei, Tian Zheng

Abstract: Learning conditional densities and identifying factors that influence the entire distribution are vital tasks in data-driven applications. Conventional approaches work mostly with summary statistics, and are hence inadequate for a comprehensive investigation. Recently, there have been developments on functional regression methods to model density curves as functional outcomes. A major challenge fo… ▽ More Learning conditional densities and identifying factors that influence the entire distribution are vital tasks in data-driven applications. Conventional approaches work mostly with summary statistics, and are hence inadequate for a comprehensive investigation. Recently, there have been developments on functional regression methods to model density curves as functional outcomes. A major challenge for develo** such models lies in the inherent constraint of non-negativity and unit integral for the functional space of density outcomes. To overcome this fundamental issue, we propose Wasserstein Distributional Learning (WDL), a flexible density-on-scalar regression modeling framework that starts with the Wasserstein distance $W_2$ as a proper metric for the space of density outcomes. We then introduce a heterogeneous and flexible class of Semi-parametric Conditional Gaussian Mixture Models (SCGMM) as the model class $\mathfrak{F} \otimes \mathcal{T}$. The resulting metric space $(\mathfrak{F} \otimes \mathcal{T}, W_2)$ satisfies the required constraints and offers a dense and closed functional subspace. For fitting the proposed model, we further develop an efficient algorithm based on Majorization-Minimization optimization with boosted trees. Compared with methods in the previous literature, WDL better characterizes and uncovers the nonlinear dependence of the conditional densities, and their derived summary statistics. We demonstrate the effectiveness of the WDL framework through simulations and real-world applications. △ Less

Submitted 11 September, 2022; originally announced September 2022.

arXiv:2205.07384 [pdf, other]

Incorporating Prior Knowledge into Neural Networks through an Implicit Composite Kernel

Authors: Ziyang Jiang, Tongshu Zheng, Yiling Liu, David Carlson

Abstract: It is challenging to guide neural network (NN) learning with prior knowledge. In contrast, many known properties, such as spatial smoothness or seasonality, are straightforward to model by choosing an appropriate kernel in a Gaussian process (GP). Many deep learning applications could be enhanced by modeling such known properties. For example, convolutional neural networks (CNNs) are frequently us… ▽ More It is challenging to guide neural network (NN) learning with prior knowledge. In contrast, many known properties, such as spatial smoothness or seasonality, are straightforward to model by choosing an appropriate kernel in a Gaussian process (GP). Many deep learning applications could be enhanced by modeling such known properties. For example, convolutional neural networks (CNNs) are frequently used in remote sensing, which is subject to strong seasonal effects. We propose to blend the strengths of deep learning and the clear modeling capabilities of GPs by using a composite kernel that combines a kernel implicitly defined by a neural network with a second kernel function chosen to model known properties (e.g., seasonality). We implement this idea by combining a deep network and an efficient map** based on the Nystrom approximation, which we call Implicit Composite Kernel (ICK). We then adopt a sample-then-optimize approach to approximate the full GP posterior distribution. We demonstrate that ICK has superior performance and flexibility on both synthetic and real-world data sets. We believe that ICK framework can be used to include prior information into neural networks in many applications. △ Less

Submitted 28 February, 2024; v1 submitted 15 May, 2022; originally announced May 2022.

Comments: 27 pages, 13 figures, 5 tables, 3 algorithms, published in Transactions on Machine Learning Research (TMLR)

ACM Class: I.5.1

arXiv:2112.03270 [pdf, other]

Toward a Taxonomy of Trust for Probabilistic Machine Learning

Authors: Tamara Broderick, Andrew Gelman, Rachael Meager, Anna L. Smith, Tian Zheng

Abstract: Probabilistic machine learning increasingly informs critical decisions in medicine, economics, politics, and beyond. We need evidence to support that the resulting decisions are well-founded. To aid development of trust in these decisions, we develop a taxonomy delineating where trust in an analysis can break down: (1) in the translation of real-world goals to goals on a particular set of availabl… ▽ More Probabilistic machine learning increasingly informs critical decisions in medicine, economics, politics, and beyond. We need evidence to support that the resulting decisions are well-founded. To aid development of trust in these decisions, we develop a taxonomy delineating where trust in an analysis can break down: (1) in the translation of real-world goals to goals on a particular set of available training data, (2) in the translation of abstract goals on the training data to a concrete mathematical problem, (3) in the use of an algorithm to solve the stated mathematical problem, and (4) in the use of a particular code implementation of the chosen algorithm. We detail how trust can fail at each step and illustrate our taxonomy with two case studies: an analysis of the efficacy of microcredit and The Economist's predictions of the 2020 US presidential election. Finally, we describe a wide variety of methods that can be used to increase trust at each step of our taxonomy. The use of our taxonomy highlights steps where existing research work on trust tends to concentrate and also steps where establishing trust is particularly challenging. △ Less

Submitted 5 December, 2021; originally announced December 2021.

Comments: 18 pages, 2 figures

arXiv:2106.01485 [pdf, other]

Weakly Supervised Learning Creates a Fusion of Modeling Cultures

Authors: Chengliang Tang, Gan Yuan, Tian Zheng

Abstract: The past two decades have witnessed the great success of the algorithmic modeling framework advocated by Breiman et al. (2001). Nevertheless, the excellent prediction performance of these black-box models rely heavily on the availability of strong supervision, i.e. a large set of accurate and exact ground-truth labels. In practice, strong supervision can be unavailable or expensive, which calls fo… ▽ More The past two decades have witnessed the great success of the algorithmic modeling framework advocated by Breiman et al. (2001). Nevertheless, the excellent prediction performance of these black-box models rely heavily on the availability of strong supervision, i.e. a large set of accurate and exact ground-truth labels. In practice, strong supervision can be unavailable or expensive, which calls for modeling techniques under weak supervision. In this comment, we summarize the key concepts in weakly supervised learning and discuss some recent developments in the field. Using algorithmic modeling alone under a weak supervision might lead to unstable and misleading results. A promising direction would be integrating the data modeling culture into such a framework. △ Less

Submitted 2 June, 2021; originally announced June 2021.

arXiv:2105.05532 [pdf, other]

Generalized Autoregressive Moving Average Models with GARCH Errors

Authors: Tingguo Zheng, Han Xiao, Rong Chen

Abstract: One of the important and widely used classes of models for non-Gaussian time series is the generalized autoregressive model average models (GARMA), which specifies an ARMA structure for the conditional mean process of the underlying time series. However, in many applications one often encounters conditional heteroskedasticity. In this paper we propose a new class of models, referred to as GARMA-GA… ▽ More One of the important and widely used classes of models for non-Gaussian time series is the generalized autoregressive model average models (GARMA), which specifies an ARMA structure for the conditional mean process of the underlying time series. However, in many applications one often encounters conditional heteroskedasticity. In this paper we propose a new class of models, referred to as GARMA-GARCH models, that jointly specify both the conditional mean and conditional variance processes of a general non-Gaussian time series. Under the general modeling framework, we propose three specific models, as examples, for proportional time series, nonnegative time series, and skewed and heavy-tailed financial time series. Maximum likelihood estimator (MLE) and quasi Gaussian MLE (GMLE) are used to estimate the parameters. Simulation studies and three applications are used to demonstrate the properties of the models and the estimation procedures. △ Less

Submitted 12 May, 2021; originally announced May 2021.

arXiv:2012.09598 [pdf, other]

Network Hawkes Process Models for Exploring Latent Hierarchy in Social Animal Interactions

Authors: Owen G. Ward, **g Wu, Tian Zheng, Anna L. Smith, James P. Curley

Abstract: Group-based social dominance hierarchies are of essential interest in animal behavior research. Studies often record aggressive interactions observed over time, and models that can capture such dynamic hierarchy are therefore crucial. Traditional ranking methods summarize interactions across time, using only aggregate counts. Instead, we take advantage of the interaction timestamps, proposing a se… ▽ More Group-based social dominance hierarchies are of essential interest in animal behavior research. Studies often record aggressive interactions observed over time, and models that can capture such dynamic hierarchy are therefore crucial. Traditional ranking methods summarize interactions across time, using only aggregate counts. Instead, we take advantage of the interaction timestamps, proposing a series of network point process models with latent ranks. We carefully design these models to incorporate important characteristics of animal interaction data, including the winner effect, bursting and pair-flip phenomena. Through iteratively constructing and evaluating these models we arrive at the final cohort Markov-Modulated Hawkes process (C-MMHP), which best characterizes all aforementioned patterns observed in interaction data. We compare all models using simulated and real data. Using statistically developed diagnostic perspectives, we demonstrate that the C-MMHP model outperforms other methods, capturing relevant latent ranking structures that lead to meaningful predictions for real data. △ Less

Submitted 16 July, 2022; v1 submitted 17 December, 2020; originally announced December 2020.

Comments: To appear in Journal of the Royal Statistical Society, Series C

arXiv:2009.01742 [pdf, other]

Online Estimation and Community Detection of Network Point Processes for Event Streams

Authors: Guanhua Fang, Owen G. Ward, Tian Zheng

Abstract: A common goal in network modeling is to uncover the latent community structure present among nodes. For many real-world networks, the true connections consist of events arriving as streams, which are then aggregated to form edges, ignoring the dynamic temporal component. A natural way to take account of these temporal dynamics of interactions is to use point processes as the foundation of network… ▽ More A common goal in network modeling is to uncover the latent community structure present among nodes. For many real-world networks, the true connections consist of events arriving as streams, which are then aggregated to form edges, ignoring the dynamic temporal component. A natural way to take account of these temporal dynamics of interactions is to use point processes as the foundation of network models for community detection. Computational complexity hampers the scalability of such approaches to large sparse networks. To circumvent this challenge, we propose a fast online variational inference algorithm for estimating the latent structure underlying dynamic event arrivals on a network, using continuous-time point process latent network models. We describe this procedure for networks models capturing community structure. This structure can be learned as new events are observed on the network, updating the inferred community assignments. We investigate the theoretical properties of such an inference scheme, and provide regret bounds on the loss function of this procedure. The proposed inference procedure is then thoroughly compared, using both simulation studies and real data, to non-online variants. We demonstrate that online inference can obtain comparable performance, in terms of community recovery, to non-online variants, while realising computational gains. Our proposed inference framework can also be readily modified to incorporate other popular network structures. △ Less

Submitted 26 October, 2023; v1 submitted 3 September, 2020; originally announced September 2020.

Comments: 45 pages

arXiv:2007.05385 [pdf, ps, other]

doi 10.1002/sam.11486

Next Waves in Veridical Network Embedding

Authors: Owen G. Ward, Zhen Huang, Andrew Davison, Tian Zheng

Abstract: Embedding nodes of a large network into a metric (e.g., Euclidean) space has become an area of active research in statistical machine learning, which has found applications in natural and social sciences. Generally, a representation of a network object is learned in a Euclidean geometry and is then used for subsequent tasks regarding the nodes and/or edges of the network, such as community detecti… ▽ More Embedding nodes of a large network into a metric (e.g., Euclidean) space has become an area of active research in statistical machine learning, which has found applications in natural and social sciences. Generally, a representation of a network object is learned in a Euclidean geometry and is then used for subsequent tasks regarding the nodes and/or edges of the network, such as community detection, node classification and link prediction. Network embedding algorithms have been proposed in multiple disciplines, often with domain-specific notations and details. In addition, different measures and tools have been adopted to evaluate and compare the methods proposed under different settings, often dependent of the downstream tasks. As a result, it is challenging to study these algorithms in the literature systematically. Motivated by the recently proposed Veridical Data Science (VDS) framework, we propose a framework for network embedding algorithms and discuss how the principles of predictability, computability and stability apply in this context. The utilization of this framework in network embedding holds the potential to motivate and point to new directions for future research. △ Less

Submitted 12 August, 2021; v1 submitted 10 July, 2020; originally announced July 2020.

arXiv:2005.07347 [pdf, other]

Towards Assessment of Randomized Smoothing Mechanisms for Certifying Adversarial Robustness

Authors: Tianhang Zheng, Di Wang, Baochun Li, **hui Xu

Abstract: As a certified defensive technique, randomized smoothing has received considerable attention due to its scalability to large datasets and neural networks. However, several important questions remain unanswered, such as (i) whether the Gaussian mechanism is an appropriate option for certifying $\ell_2$-norm robustness, and (ii) whether there is an appropriate randomized (smoothing) mechanism to cer… ▽ More As a certified defensive technique, randomized smoothing has received considerable attention due to its scalability to large datasets and neural networks. However, several important questions remain unanswered, such as (i) whether the Gaussian mechanism is an appropriate option for certifying $\ell_2$-norm robustness, and (ii) whether there is an appropriate randomized (smoothing) mechanism to certify $\ell_\infty$-norm robustness. To shed light on these questions, we argue that the main difficulty is how to assess the appropriateness of each randomized mechanism. In this paper, we propose a generic framework that connects the existing frameworks in \cite{lecuyer2018certified, li2019certified}, to assess randomized mechanisms. Under our framework, for a randomized mechanism that can certify a certain extent of robustness, we define the magnitude of its required additive noise as the metric for assessing its appropriateness. We also prove lower bounds on this metric for the $\ell_2$-norm and $\ell_\infty$-norm cases as the criteria for assessment. Based on our framework, we assess the Gaussian and Exponential mechanisms by comparing the magnitude of additive noise required by these mechanisms and the lower bounds (criteria). We first conclude that the Gaussian mechanism is indeed an appropriate option to certify $\ell_2$-norm robustness. Surprisingly, we show that the Gaussian mechanism is also an appropriate option for certifying $\ell_\infty$-norm robustness, instead of the Exponential mechanism. Finally, we generalize our framework to $\ell_p$-norm for any $p\geq2$. Our theoretical findings are verified by evaluations on CIFAR10 and ImageNet. △ Less

Submitted 7 June, 2020; v1 submitted 14 May, 2020; originally announced May 2020.

Comments: Correct the some details of the theorems and proofs

arXiv:2001.09359 [pdf, other]

Diagnostics and Visualization of Point Process Models for Event Times on a Social Network

Authors: **g Wu, Anna L. Smith, Tian Zheng

Abstract: Point process models have been used to analyze interaction event times on a social network, in the hope to provides valuable insights for social science research. However, the diagnostics and visualization of the modeling results from such an analysis have received limited discussion in the literature. In this paper, we develop a systematic set of diagnostic tools and visualizations for point proc… ▽ More Point process models have been used to analyze interaction event times on a social network, in the hope to provides valuable insights for social science research. However, the diagnostics and visualization of the modeling results from such an analysis have received limited discussion in the literature. In this paper, we develop a systematic set of diagnostic tools and visualizations for point process models fitted to data from a network setting. We analyze the residual process and Pearson residual on the network by inspecting their structure and clustering structure. Equipped with these tools, we can validate whether a model adequately captures the temporal and/or network structures in the observed data. The utility of our approach is demonstrated using simulation studies and point process models applied to a study of animal social interactions. △ Less

Submitted 25 January, 2020; originally announced January 2020.

arXiv:1904.12052 [pdf, ps, other]

Data Poisoning Attack against Knowledge Graph Embedding

Authors: Hengtong Zhang, Tianhang Zheng, **g Gao, Chenglin Miao, Lu Su, Yaliang Li, Kui Ren

Abstract: Knowledge graph embedding (KGE) is a technique for learning continuous embeddings for entities and relations in the knowledge graph.Due to its benefit to a variety of downstream tasks such as knowledge graph completion, question answering and recommendation, KGE has gained significant attention recently. Despite its effectiveness in a benign environment, KGE' robustness to adversarial attacks is n… ▽ More Knowledge graph embedding (KGE) is a technique for learning continuous embeddings for entities and relations in the knowledge graph.Due to its benefit to a variety of downstream tasks such as knowledge graph completion, question answering and recommendation, KGE has gained significant attention recently. Despite its effectiveness in a benign environment, KGE' robustness to adversarial attacks is not well-studied. Existing attack methods on graph data cannot be directly applied to attack the embeddings of knowledge graph due to its heterogeneity. To fill this gap, we propose a collection of data poisoning attack strategies, which can effectively manipulate the plausibility of arbitrary targeted facts in a knowledge graph by adding or deleting facts on the graph. The effectiveness and efficiency of the proposed attack strategies are verified by extensive evaluations on two widely-used benchmarks. △ Less

Submitted 24 June, 2019; v1 submitted 26 April, 2019; originally announced April 2019.

Comments: Fix typos and version conflicts

arXiv:1903.03223 [pdf, other]

Markov-Modulated Hawkes Processes for Sporadic and Bursty Event Occurrences

Authors: **g Wu, Owen G. Ward, James Curley, Tian Zheng

Abstract: Modeling event dynamics is central to many disciplines. Patterns in observed event arrival times are commonly modeled using point processes. Such event arrival data often exhibits self-exciting, heterogeneous and sporadic trends, which is challenging for conventional models. It is reasonable to assume that there exists a hidden state process that drives different event dynamics at different states… ▽ More Modeling event dynamics is central to many disciplines. Patterns in observed event arrival times are commonly modeled using point processes. Such event arrival data often exhibits self-exciting, heterogeneous and sporadic trends, which is challenging for conventional models. It is reasonable to assume that there exists a hidden state process that drives different event dynamics at different states. In this paper, we propose a Markov Modulated Hawkes Process (MMHP) model for learning such a mixture of event dynamics and develop corresponding inference algorithms. Numerical experiments using synthetic data demonstrate that MMHP with the proposed estimation algorithms consistently recover the true hidden state process in simulations, while email data from a large university and data from an animal behavior study show that the procedure captures distinct event dynamics that reveal interesting social structures in the real data. △ Less

Submitted 12 August, 2021; v1 submitted 7 March, 2019; originally announced March 2019.

arXiv:1810.05665

Is PGD-Adversarial Training Necessary? Alternative Training via a Soft-Quantization Network with Noisy-Natural Samples Only

Authors: Tianhang Zheng, Changyou Chen, Kui Ren

Abstract: Recent work on adversarial attack and defense suggests that PGD is a universal $l_\infty$ first-order attack, and PGD adversarial training can significantly improve network robustness against a wide range of first-order $l_\infty$-bounded attacks, represented as the state-of-the-art defense method. However, an obvious weakness of PGD adversarial training is its highly-computational cost in generat… ▽ More Recent work on adversarial attack and defense suggests that PGD is a universal $l_\infty$ first-order attack, and PGD adversarial training can significantly improve network robustness against a wide range of first-order $l_\infty$-bounded attacks, represented as the state-of-the-art defense method. However, an obvious weakness of PGD adversarial training is its highly-computational cost in generating adversarial samples, making it computationally infeasible for large and high-resolution real datasets such as the ImageNet dataset. In addition, recent work also has suggested a simple "close-form" solution to a robust model on MNIST. Therefore, a natural question raised is that is PGD adversarial training really necessary for robust defense? In this paper, we give a negative answer by proposing a training paradigm that is comparable to PGD adversarial training on several standard datasets, while only using noisy-natural samples. Specifically, we reformulate the min-max objective in PGD adversarial training by a problem to minimize the original network loss plus $l_1$ norms of its gradients w.r.t. the inputs. For the $l_1$-norm loss, we propose a computationally-feasible solution by embedding a differentiable soft-quantization layer after the network input layer. We show formally that the soft-quantization layer trained with noisy-natural samples is an alternative approach to minimizing the $l_1$-gradient norms as in PGD adversarial training. Extensive empirical evaluations on standard datasets show that our proposed models are comparable to PGD-adversarially-trained models under PGD and BPDA attacks. Remarkably, our method achieves a 24X speed-up on MNIST while maintaining a comparable defensive ability, and for the first time fine-tunes a robust Imagenet model within only two days. Code is provided on \url{https://github.com/tianzheng4/Noisy-Training-Soft-Quantization} △ Less

Submitted 19 October, 2018; v1 submitted 9 October, 2018; originally announced October 2018.

Comments: Further improvement

arXiv:1808.05537 [pdf, other]

Distributionally Adversarial Attack

Authors: Tianhang Zheng, Changyou Chen, Kui Ren

Abstract: Recent work on adversarial attack has shown that Projected Gradient Descent (PGD) Adversary is a universal first-order adversary, and the classifier adversarially trained by PGD is robust against a wide range of first-order attacks. It is worth noting that the original objective of an attack/defense model relies on a data distribution $p(\mathbf{x})$, typically in the form of risk maximization/min… ▽ More Recent work on adversarial attack has shown that Projected Gradient Descent (PGD) Adversary is a universal first-order adversary, and the classifier adversarially trained by PGD is robust against a wide range of first-order attacks. It is worth noting that the original objective of an attack/defense model relies on a data distribution $p(\mathbf{x})$, typically in the form of risk maximization/minimization, e.g., $\max/\min\mathbb{E}_{p(\mathbf(x))}\mathcal{L}(\mathbf{x})$ with $p(\mathbf{x})$ some unknown data distribution and $\mathcal{L}(\cdot)$ a loss function. However, since PGD generates attack samples independently for each data sample based on $\mathcal{L}(\cdot)$, the procedure does not necessarily lead to good generalization in terms of risk optimization. In this paper, we achieve the goal by proposing distributionally adversarial attack (DAA), a framework to solve an optimal {\em adversarial-data distribution}, a perturbed distribution that satisfies the $L_\infty$ constraint but deviates from the original data distribution to increase the generalization risk maximally. Algorithmically, DAA performs optimization on the space of potential data distributions, which introduces direct dependency between all data points when generating adversarial samples. DAA is evaluated by attacking state-of-the-art defense models, including the adversarially-trained models provided by {\em MIT MadryLab}. Notably, DAA ranks {\em the first place} on MadryLab's white-box leaderboards, reducing the accuracy of their secret MNIST model to $88.79\%$ (with $l_\infty$ perturbations of $ε= 0.3$) and the accuracy of their secret CIFAR model to $44.71\%$ (with $l_\infty$ perturbations of $ε= 8.0$). Code for the experiments is released on \url{https://github.com/tianzheng4/Distributionally-Adversarial-Attack}. △ Less

Submitted 5 December, 2018; v1 submitted 16 August, 2018; originally announced August 2018.

Comments: accepted to AAAI-19

arXiv:1801.04587 [pdf]

A Bayesian Evidence Synthesis Approach to Estimate Disease Prevalence in Hard-To-Reach Populations: Hepatitis C in New York City

Authors: Sarah Tan, Susanna Makela, Daliah Heller, Kevin Konty, Sharon Balter, Tian Zheng, James H. Stark

Abstract: Existing methods to estimate the prevalence of chronic hepatitis C (HCV) in New York City (NYC) are limited in scope and fail to assess hard-to-reach subpopulations with highest risk such as injecting drug users (IDUs). To address these limitations, we employ a Bayesian multi-parameter evidence synthesis model to systematically combine multiple sources of data, account for bias in certain data sou… ▽ More Existing methods to estimate the prevalence of chronic hepatitis C (HCV) in New York City (NYC) are limited in scope and fail to assess hard-to-reach subpopulations with highest risk such as injecting drug users (IDUs). To address these limitations, we employ a Bayesian multi-parameter evidence synthesis model to systematically combine multiple sources of data, account for bias in certain data sources, and provide unbiased HCV prevalence estimates with associated uncertainty. Our approach improves on previous estimates by explicitly accounting for injecting drug use and including data from high-risk subpopulations such as the incarcerated, and is more inclusive, utilizing ten NYC data sources. In addition, we derive two new equations to allow age at first injecting drug use data for former and current IDUs to be incorporated into the Bayesian evidence synthesis, a first for this type of model. Our estimated overall HCV prevalence as of 2012 among NYC adults aged 20-59 years is 2.78% (95% CI 2.61-2.94%), which represents between 124,900 and 140,000 chronic HCV cases. These estimates suggest that HCV prevalence in NYC is higher than previously indicated from household surveys (2.2%) and the surveillance system (2.37%), and that HCV transmission is increasing among young injecting adults in NYC. An ancillary benefit from our results is an estimate of current IDUs aged 20-59 in NYC: 0.58% or 27,600 individuals. △ Less

Submitted 14 January, 2018; originally announced January 2018.

arXiv:1709.02899 [pdf, other]

Estimating the theoretical error rate for prediction

Authors: Herman Chernoff, Shaw-Hwa Lo, Tian Zheng, Adeline Lo

Abstract: Prediction for very large data sets is typically carried out in two stages, variable selection and pattern recognition. Ordinarily variable selection involves seeing how well individual explanatory variables are correlated with the dependent variable. This practice neglects the possible interactions among the variables. Simulations have shown that a statistic I, that we used for variable selection… ▽ More Prediction for very large data sets is typically carried out in two stages, variable selection and pattern recognition. Ordinarily variable selection involves seeing how well individual explanatory variables are correlated with the dependent variable. This practice neglects the possible interactions among the variables. Simulations have shown that a statistic I, that we used for variable selection is much better correlated with predictivity than significance levels. We explain this by defining theoretical predictivity and show how I is related to predictivity. We calculate the biases of the overoptimistic training estimate of predictivity and of the pessimistic out of sample estimate. Corrections for the bias lead to improved estimates of the potential predictivity using small groups of possibly interacting variables. These results support the use of I in the variable selection phase of prediction for data sets such as in GWAS (Genome wide association studies) where there are very many explanatory variables and modest sample sizes. Reference is made to another publication using I, which led to a reduction in the error rate of prediction from 30% to 8%, for a data set with, 4,918 variables and 97 subjects. This data set had been previously studied by scientists for over 10 years. △ Less

Submitted 8 September, 2017; originally announced September 2017.

arXiv:1604.06498 [pdf, other]

Stabilized Sparse Online Learning for Sparse Data

Authors: Yuting Ma, Tian Zheng

Abstract: Stochastic gradient descent (SGD) is commonly used for optimization in large-scale machine learning problems. Langford et al. (2009) introduce a sparse online learning method to induce sparsity via truncated gradient. With high-dimensional sparse data, however, the method suffers from slow convergence and high variance due to the heterogeneity in feature sparsity. To mitigate this issue, we introd… ▽ More Stochastic gradient descent (SGD) is commonly used for optimization in large-scale machine learning problems. Langford et al. (2009) introduce a sparse online learning method to induce sparsity via truncated gradient. With high-dimensional sparse data, however, the method suffers from slow convergence and high variance due to the heterogeneity in feature sparsity. To mitigate this issue, we introduce a stabilized truncated stochastic gradient descent algorithm. We employ a soft-thresholding scheme on the weight vector where the imposed shrinkage is adaptive to the amount of information available in each feature. The variability in the resulted sparse weight vector is further controlled by stability selection integrated with the informative truncation. To facilitate better convergence, we adopt an annealing strategy on the truncation rate, which leads to a balanced trade-off between exploration and exploitation in learning a sparse weight vector. Numerical experiments show that our algorithm compares favorably with the original algorithm in terms of prediction accuracy, achieved sparsity and stability. △ Less

Submitted 8 May, 2017; v1 submitted 21 April, 2016; originally announced April 2016.

Comments: 45 pages, 4 figures

arXiv:1604.04899 [pdf, other]

Phase-Aligned Spectral Filtering for Decomposing Spatiotemporal Dynamics

Authors: Lu Meng, Tian Zheng

Abstract: Spatiotemporal dynamics is central to a wide range of applications from climatology, computer vision to neural sciences. From temporal observations taken on a high-dimensional vector of spatial locations, we seek to derive knowledge about such dynamics via data assimilation and modeling. It is assumed that the observed spatiotemporal data represent superimposed lower-rank smooth oscillations and m… ▽ More Spatiotemporal dynamics is central to a wide range of applications from climatology, computer vision to neural sciences. From temporal observations taken on a high-dimensional vector of spatial locations, we seek to derive knowledge about such dynamics via data assimilation and modeling. It is assumed that the observed spatiotemporal data represent superimposed lower-rank smooth oscillations and movements from a generative dynamic system, mixed with higher-rank random noises. Separating the signals from noises is essential for us to visualize, model and understand these lower-rank dynamic systems. It is also often the case that such a lower-rank dynamic system have multiple independent components, corresponding to different trends or functionalities of the system under study. In this paper, we present a novel filtering framework for identifying lower-rank dynamics and its components embedded in a high dimensional spatiotemporal system. It is based on an approach of structural decomposition and phase-aligned construction in the frequency domain. In both our simulated examples and real data applications, we illustrate that the proposed method is able to separate and identify meaningful lower-rank movements, while existing methods fail. △ Less

Submitted 17 April, 2016; originally announced April 2016.

Comments: 29 pages, 10 figures

MSC Class: 37M10 ACM Class: G.3; I.5.4

arXiv:1512.03396 [pdf, other]

Boosted Sparse Non-linear Distance Metric Learning

Authors: Yuting Ma, Tian Zheng

Abstract: This paper proposes a boosting-based solution addressing metric learning problems for high-dimensional data. Distance measures have been used as natural measures of (dis)similarity and served as the foundation of various learning methods. The efficiency of distance-based learning methods heavily depends on the chosen distance metric. With increasing dimensionality and complexity of data, however,… ▽ More This paper proposes a boosting-based solution addressing metric learning problems for high-dimensional data. Distance measures have been used as natural measures of (dis)similarity and served as the foundation of various learning methods. The efficiency of distance-based learning methods heavily depends on the chosen distance metric. With increasing dimensionality and complexity of data, however, traditional metric learning methods suffer from poor scalability and the limitation due to linearity as the true signals are usually embedded within a low-dimensional nonlinear subspace. In this paper, we propose a nonlinear sparse metric learning algorithm via boosting. We restructure a global optimization problem into a forward stage-wise learning of weak learners based on a rank-one decomposition of the weight matrix in the Mahalanobis distance metric. A gradient boosting algorithm is devised to obtain a sparse rank-one update of the weight matrix at each step. Nonlinear features are learned by a hierarchical expansion of interactions incorporated within the boosting algorithm. Meanwhile, an early stop** rule is imposed to control the overall complexity of the learned metric. As a result, our approach guarantees three desirable properties of the final metric: positive semi-definiteness, low rank and element-wise sparsity. Numerical experiments show that our learning model compares favorably with the state-of-the-art methods in the current literature of metric learning. △ Less

Submitted 10 December, 2015; originally announced December 2015.

arXiv:1502.07190 [pdf, other]

doi 10.1214/15-AOAS887

Topic-adjusted visibility metric for scientific articles

Authors: Linda S. L. Tan, Aik Hui Chan, Tian Zheng

Abstract: Measuring the impact of scientific articles is important for evaluating the research output of individual scientists, academic institutions and journals. While citations are raw data for constructing impact measures, there exist biases and potential issues if factors affecting citation patterns are not properly accounted for. In this work, we address the problem of field variation and introduce an… ▽ More Measuring the impact of scientific articles is important for evaluating the research output of individual scientists, academic institutions and journals. While citations are raw data for constructing impact measures, there exist biases and potential issues if factors affecting citation patterns are not properly accounted for. In this work, we address the problem of field variation and introduce an article level metric useful for evaluating individual articles' visibility. This measure derives from joint probabilistic modeling of the content in the articles and the citations amongst them using latent Dirichlet allocation (LDA) and the mixed membership stochastic blockmodel (MMSB). Our proposed model provides a visibility metric for individual articles adjusted for field variation in citation rates, a structural understanding of citation behavior in different fields, and article recommendations which take into account article visibility and citation patterns. We develop an efficient algorithm for model fitting using variational methods. To scale up to large networks, we develop an online variant using stochastic gradient methods and case-control likelihood approximation. We apply our methods to the benchmark KDD Cup 2003 dataset with approximately 30,000 high energy physics papers. △ Less

Submitted 16 October, 2015; v1 submitted 25 February, 2015; originally announced February 2015.

Journal ref: Annals of Applied Statistics, Volume 10, Number 1 (2016), 1-31

arXiv:1412.2183 [pdf, other]

Reduced-Rank Covariance Estimation in Vector Autoregressive Modeling

Authors: Richard A. Davis, Pengfei Zang, Tian Zheng

Abstract: We consider reduced-rank modeling of the white noise covariance matrix in a large dimensional vector autoregressive (VAR) model. We first propose the reduced-rank covariance estimator under the setting where independent observations are available. We derive the reduced-rank estimator based on a latent variable model for the vector observation and give the analytical form of its maximum likelihood… ▽ More We consider reduced-rank modeling of the white noise covariance matrix in a large dimensional vector autoregressive (VAR) model. We first propose the reduced-rank covariance estimator under the setting where independent observations are available. We derive the reduced-rank estimator based on a latent variable model for the vector observation and give the analytical form of its maximum likelihood estimate. Simulation results show that the reduced-rank covariance estimator outperforms two competing covariance estimators for estimating large dimensional covariance matrices from independent observations. Then we describe how to integrate the proposed reduced-rank estimator into the fitting of large dimensional VAR models, where we consider two scenarios that require different model fitting procedures. In the VAR modeling context, our reduced-rank covariance estimator not only provides interpretable descriptions of the dependence structure of VAR processes but also leads to improvement in model-fitting and forecasting over unrestricted covariance estimators. Two real data examples are presented to illustrate these fitting procedures. △ Less

Submitted 5 December, 2014; originally announced December 2014.

Comments: 36 pages, 5 figures

arXiv:1304.4851 [pdf, ps, other]

Integrative Analysis of Prognosis Data on Multiple Cancer Subtypes using Penalization

Authors: ** Liu, Jian Huang, Yawei Zhang, Qing Lan, Nathaniel Rothman, Tongzhang Zheng, Shuangge Ma

Abstract: In cancer research, profiling studies have been extensively conducted, searching for genes/SNPs associated with prognosis. Cancer is a heterogeneous disease. Examining similarity and difference in the genetic basis of multiple subtypes of the same cancer can lead to better understanding of their connections and distinctions. Classic meta-analysis approaches analyze each subtype separately and then… ▽ More In cancer research, profiling studies have been extensively conducted, searching for genes/SNPs associated with prognosis. Cancer is a heterogeneous disease. Examining similarity and difference in the genetic basis of multiple subtypes of the same cancer can lead to better understanding of their connections and distinctions. Classic meta-analysis approaches analyze each subtype separately and then compare analysis results across subtypes. Integrative analysis approaches, in contrast, analyze the raw data on multiple subtypes simultaneously and can outperform meta-analysis. In this study, prognosis data on multiple subtypes of the same cancer are analyzed. An AFT (accelerated failure time) model is adopted to describe survival. The genetic basis of multiple subtypes is described using the heterogeneity model, which allows a gene/SNP to be associated with the prognosis of some subtypes but not the others. A compound penalization approach is developed to conduct gene-level analysis and identify genes that contain important SNPs associated with prognosis. The proposed approach has an intuitive formulation and can be realized using an iterative algorithm. Asymptotic properties are rigorously established. Simulation shows that the proposed approach has satisfactory performance and outperforms meta-analysis using penalization. An NHL (non-Hodgkin lymphoma) prognosis study with SNP measurements is analyzed. Genes associated with the three major subtypes, namely DLBCL, FL, and CLL/SLL, are identified. The proposed approach identifies genes different from alternative analysis and has reasonable prediction performance. △ Less

Submitted 17 April, 2013; originally announced April 2013.

Comments: 23 pages (main text) 17 pages (appendix), 12 figures

arXiv:1302.2142 [pdf, other]

Simulation-efficient shortest probability intervals

Authors: Ying Liu, Andrew Gelman, Tian Zheng

Abstract: Bayesian highest posterior density (HPD) intervals can be estimated directly from simulations via empirical shortest intervals. Unfortunately, these can be noisy (that is, have a high Monte Carlo error). We derive an optimal weighting strategy using bootstrap and quadratic programming to obtain a more compu- tationally stable HPD, or in general, shortest probability interval (Spin). We prove the c… ▽ More Bayesian highest posterior density (HPD) intervals can be estimated directly from simulations via empirical shortest intervals. Unfortunately, these can be noisy (that is, have a high Monte Carlo error). We derive an optimal weighting strategy using bootstrap and quadratic programming to obtain a more compu- tationally stable HPD, or in general, shortest probability interval (Spin). We prove the consistency of our method. Simulation studies on a range of theoret- ical and real-data examples, some with symmetric and some with asymmetric posterior densities, show that intervals constructed using Spin have better cov- erage (relative to the posterior distribution) and lower Monte Carlo error than empirical shortest intervals. We implement the new method in an R package (SPIn) so it can be routinely used in post-processing of Bayesian simulations. △ Less

Submitted 8 February, 2013; originally announced February 2013.

Comments: 22 pages, 13 figures

arXiv:1301.2473 [pdf, ps, other]

doi 10.1214/12-AOAS569

Latent demographic profile estimation in hard-to-reach groups

Authors: Tyler H. McCormick, Tian Zheng

Abstract: The sampling frame in most social science surveys excludes members of certain groups, known as hard-to-reach groups. These groups, or subpopulations, may be difficult to access (the homeless, e.g.), camouflaged by stigma (individuals with HIV/AIDS), or both (commercial sex workers). Even basic demographic information about these groups is typically unknown, especially in many develo** nations. W… ▽ More The sampling frame in most social science surveys excludes members of certain groups, known as hard-to-reach groups. These groups, or subpopulations, may be difficult to access (the homeless, e.g.), camouflaged by stigma (individuals with HIV/AIDS), or both (commercial sex workers). Even basic demographic information about these groups is typically unknown, especially in many develo** nations. We present statistical models which leverage social network structure to estimate demographic characteristics of these subpopulations using Aggregated relational data (ARD), or questions of the form "How many X's do you know?" Unlike other network-based techniques for reaching these groups, ARD require no special sampling strategy and are easily incorporated into standard surveys. ARD also do not require respondents to reveal their own group membership. We propose a Bayesian hierarchical model for estimating the demographic characteristics of hard-to-reach groups, or latent demographic profiles, using ARD. We propose two estimation techniques. First, we propose a Markov-chain Monte Carlo algorithm for existing data or cases where the full posterior distribution is of interest. For cases when new data can be collected, we propose guidelines and, based on these guidelines, propose a simple estimate motivated by a missing data approach. Using data from McCarty et al. [Human Organization 60 (2001) 28-39], we estimate the age and gender profiles of six hard-to-reach groups, such as individuals who have HIV, women who were raped, and homeless persons. We also evaluate our simple estimates using simulation studies. △ Less

Submitted 11 January, 2013; originally announced January 2013.

Comments: Published in at http://dx.doi.org/10.1214/12-AOAS569 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS569

Journal ref: Annals of Applied Statistics 2012, Vol. 6, No. 4, 1795-1813

arXiv:1207.0520 [pdf, other]

Sparse Vector Autoregressive Modeling

Authors: Richard A. Davis, Pengfei Zang, Tian Zheng

Abstract: The vector autoregressive (VAR) model has been widely used for modeling temporal dependence in a multivariate time series. For large (and even moderate) dimensions, the number of AR coefficients can be prohibitively large, resulting in noisy estimates, unstable predictions and difficult-to-interpret temporal dependence. To overcome such drawbacks, we propose a 2-stage approach for fitting sparse V… ▽ More The vector autoregressive (VAR) model has been widely used for modeling temporal dependence in a multivariate time series. For large (and even moderate) dimensions, the number of AR coefficients can be prohibitively large, resulting in noisy estimates, unstable predictions and difficult-to-interpret temporal dependence. To overcome such drawbacks, we propose a 2-stage approach for fitting sparse VAR (sVAR) models in which many of the AR coefficients are zero. The first stage selects non-zero AR coefficients based on an estimate of the partial spectral coherence (PSC) together with the use of BIC. The PSC is useful for quantifying the conditional relationship between marginal series in a multivariate process. A refinement second stage is then applied to further reduce the number of parameters. The performance of this 2-stage approach is illustrated with simulation results. The 2-stage approach is also applied to two real data examples: the first is the Google Flu Trends data and the second is a time series of concentration levels of air pollutants. △ Less

Submitted 2 July, 2012; originally announced July 2012.

Comments: 39 pages, 7 figures

arXiv:1102.2993 [pdf, ps, other]

doi 10.1214/08-STS244A

Comment: Quantifying the Fraction of Missing Information for Hypothesis Testing in Statistical and Genetic Studies

Authors: Tian Zheng, Shaw-Hwa Lo

Abstract: Comment on "Quantifying the Fraction of Missing Information for Hypothesis Testing in Statistical and Genetic Studies" [arXiv:1102.2774] Comment on "Quantifying the Fraction of Missing Information for Hypothesis Testing in Statistical and Genetic Studies" [arXiv:1102.2774] △ Less

Submitted 15 February, 2011; originally announced February 2011.

Comments: Published in at http://dx.doi.org/10.1214/08-STS244A the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-STS-STS244A

Journal ref: Statistical Science 2008, Vol. 23, No. 3, 321-324

arXiv:1009.5744 [pdf, ps, other]

doi 10.1214/09-AOAS265

Discovering influential variables: A method of partitions

Authors: Herman Chernoff, Shaw-Hwa Lo, Tian Zheng

Abstract: A trend in all scientific disciplines, based on advances in technology, is the increasing availability of high dimensional data in which are buried important information. A current urgent challenge to statisticians is to develop effective methods of finding the useful information from the vast amounts of messy and noisy data available, most of which are noninformative. This paper presents a genera… ▽ More A trend in all scientific disciplines, based on advances in technology, is the increasing availability of high dimensional data in which are buried important information. A current urgent challenge to statisticians is to develop effective methods of finding the useful information from the vast amounts of messy and noisy data available, most of which are noninformative. This paper presents a general computer intensive approach, based on a method pioneered by Lo and Zheng for detecting which, of many potential explanatory variables, have an influence on a dependent variable $Y$. This approach is suited to detect influential variables, where causal effects depend on the confluence of values of several variables. It has the advantage of avoiding a difficult direct analysis, involving possibly thousands of variables, by dealing with many randomly selected small subsets from which smaller subsets are selected, guided by a measure of influence $I$. The main objective is to discover the influential variables, rather than to measure their effects. Once they are detected, the problem of dealing with a much smaller group of influential variables should be vulnerable to appropriate analysis. In a sense, we are confining our attention to locating a few needles in a haystack. △ Less

Submitted 28 September, 2010; originally announced September 2010.

Comments: Published in at http://dx.doi.org/10.1214/09-AOAS265 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS265

Journal ref: Annals of Applied Statistics 2009, Vol. 3, No. 4, 1335-1369

Showing 1–29 of 29 results for author: Zheng, T