Search | arXiv e-print repository

Bayesian modeling of co-occurrence microbial interaction networks

Authors: Tejasv Bedi, Bencong Zhu, Michael L. Neugent, Kevin C. Lutz, Nicole J. De Nisco, Qiwei Li

Abstract: The human body consists of microbiomes associated with the development and prevention of several diseases. These microbial organisms form several complex interactions that are informative to the scientific community for explaining disease progression and prevention. Contrary to the traditional view of the microbiome as a singular, assortative network, we introduce a novel statistical approach usin… ▽ More The human body consists of microbiomes associated with the development and prevention of several diseases. These microbial organisms form several complex interactions that are informative to the scientific community for explaining disease progression and prevention. Contrary to the traditional view of the microbiome as a singular, assortative network, we introduce a novel statistical approach using a weighted stochastic infinite block model to analyze the complex community structures within microbial co-occurrence microbial interaction networks. Our model defines connections between microbial taxa using a novel semi-parametric rank-based correlation method on their transformed relative abundances within a fully connected network framework. Employing a Bayesian nonparametric approach, the proposed model effectively clusters taxa into distinct communities while estimating the number of communities. The posterior summary of the taxa community membership is obtained based on the posterior probability matrix, which could naturally solve the label switching problem. Through simulation studies and real-world application to microbiome data from postmenopausal patients with recurrent urinary tract infections, we demonstrate that our method has superior clustering accuracy over alternative approaches. This advancement provides a more nuanced understanding of microbiome organization, with significant implications for disease research. △ Less

Submitted 14 April, 2024; originally announced April 2024.

Comments: 25 pages

arXiv:2403.05803 [pdf, other]

Semiparametric Inference for Regression-Discontinuity Designs

Authors: Rong J. B. Zhu, Weiwei Jiang

Abstract: Treatment effects in regression discontinuity designs (RDDs) are often estimated using local regression methods. However, global approximation methods are generally deemed inefficient. In this paper, we propose a semiparametric framework tailored for estimating treatment effects in RDDs. Our global approach conceptualizes the identification of treatment effects within RDDs as a partially linear mo… ▽ More Treatment effects in regression discontinuity designs (RDDs) are often estimated using local regression methods. However, global approximation methods are generally deemed inefficient. In this paper, we propose a semiparametric framework tailored for estimating treatment effects in RDDs. Our global approach conceptualizes the identification of treatment effects within RDDs as a partially linear modeling problem, with the linear component capturing the treatment effect. Furthermore, we utilize the P-spline method to approximate the nonparametric function and develop procedures for inferring treatment effects within this framework. We demonstrate through Monte Carlo simulations that the proposed method performs well across various scenarios. Furthermore, we illustrate using real-world datasets that our global approach may result in more reliable inference. △ Less

Submitted 9 March, 2024; originally announced March 2024.

arXiv:2401.16335 [pdf, other]

Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF

Authors: Banghua Zhu, Michael I. Jordan, Jiantao Jiao

Abstract: Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique that aligns language models closely with human-centric values. The initial phase of RLHF involves learning human values using a reward model from ranking data. It is observed that the performance of the reward model degrades after one epoch of training, and optimizing too much against the learned reward model eventually hinde… ▽ More Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique that aligns language models closely with human-centric values. The initial phase of RLHF involves learning human values using a reward model from ranking data. It is observed that the performance of the reward model degrades after one epoch of training, and optimizing too much against the learned reward model eventually hinders the true objective. This paper delves into these issues, leveraging the theoretical insights to design improved reward learning algorithm termed 'Iterative Data Smoothing' (IDS). The core idea is that during each training epoch, we not only update the model with the data, but also update the date using the model, replacing hard labels with soft labels. Our empirical findings highlight the superior performance of this approach over the traditional methods. △ Less

Submitted 29 January, 2024; originally announced January 2024.

arXiv:2312.08369 [pdf, other]

The Effective Horizon Explains Deep RL Performance in Stochastic Environments

Authors: Cassidy Laidlaw, Banghua Zhu, Stuart Russell, Anca Dragan

Abstract: Reinforcement learning (RL) theory has largely focused on proving minimax sample complexity bounds. These require strategic exploration algorithms that use relatively limited function classes for representing the policy or value function. Our goal is to explain why deep RL algorithms often perform well in practice, despite using random exploration and much more expressive function classes like neu… ▽ More Reinforcement learning (RL) theory has largely focused on proving minimax sample complexity bounds. These require strategic exploration algorithms that use relatively limited function classes for representing the policy or value function. Our goal is to explain why deep RL algorithms often perform well in practice, despite using random exploration and much more expressive function classes like neural networks. Our work arrives at an explanation by showing that many stochastic MDPs can be solved by performing only a few steps of value iteration on the random policy's Q function and then acting greedily. When this is true, we find that it is possible to separate the exploration and learning components of RL, making it much easier to analyze. We introduce a new RL algorithm, SQIRL, that iteratively learns a near-optimal policy by exploring randomly to collect rollouts and then performing a limited number of steps of fitted-Q iteration over those rollouts. Any regression algorithm that satisfies basic in-distribution generalization properties can be used in SQIRL to efficiently solve common MDPs. This can explain why deep RL works, since it is empirically established that neural networks generalize well in-distribution. Furthermore, SQIRL explains why random exploration works well in practice. We leverage SQIRL to derive instance-dependent sample complexity bounds for RL that are exponential only in an "effective horizon" of lookahead and on the complexity of the class used for function approximation. Empirically, we also find that SQIRL performance strongly correlates with PPO and DQN performance in a variety of stochastic environments, supporting that our theoretical analysis is predictive of practical performance. Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon. △ Less

Submitted 12 April, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

Journal ref: ICLR 2024 (Spotlight)

arXiv:2312.08324 [pdf, other]

Bayesian Nonparametric Clustering with Feature Selection for Spatially Resolved Transcriptomics Data

Authors: Bencong Zhu, Guanyu Hu, Yang Xie, Lin Xu, Xiaodan Fan, Qiwei Li

Abstract: The advent of next-generation sequencing-based spatially resolved transcriptomics (SRT) techniques has reshaped genomic studies by enabling high-throughput gene expression profiling while preserving spatial and morphological context. Nevertheless, there are inherent challenges associated with these new high-dimensional spatial data, such as zero-inflation, over-dispersion, and heterogeneity. These… ▽ More The advent of next-generation sequencing-based spatially resolved transcriptomics (SRT) techniques has reshaped genomic studies by enabling high-throughput gene expression profiling while preserving spatial and morphological context. Nevertheless, there are inherent challenges associated with these new high-dimensional spatial data, such as zero-inflation, over-dispersion, and heterogeneity. These challenges pose obstacles to effective clustering, which is a fundamental problem in SRT data analysis. Current computational approaches often rely on heuristic data preprocessing and arbitrary cluster number prespecification, leading to considerable information loss and consequently, suboptimal downstream analysis. In response to these challenges, we introduce BNPSpace, a novel Bayesian nonparametric spatial clustering framework that directly models SRT count data. BNPSpace facilitates the partitioning of the whole spatial domain, which is characterized by substantial heterogeneity, into homogeneous spatial domains with similar molecular characteristics while identifying a parsimonious set of discriminating genes among different spatial domains. Moreover, BNPSpace incorporates spatial information through a Markov random field prior model, encouraging a smooth and biologically meaningful partition pattern. △ Less

Submitted 13 December, 2023; originally announced December 2023.

arXiv:2312.07930 [pdf, other]

Towards Optimal Statistical Watermarking

Authors: Baihe Huang, Hanlin Zhu, Banghua Zhu, Kannan Ramchandran, Michael I. Jordan, Jason D. Lee, Jiantao Jiao

Abstract: We study statistical watermarking by formulating it as a hypothesis testing problem, a general framework which subsumes all previous statistical watermarking methods. Key to our formulation is a coupling of the output tokens and the rejection region, realized by pseudo-random generators in practice, that allows non-trivial trade-offs between the Type I error and Type II error. We characterize the… ▽ More We study statistical watermarking by formulating it as a hypothesis testing problem, a general framework which subsumes all previous statistical watermarking methods. Key to our formulation is a coupling of the output tokens and the rejection region, realized by pseudo-random generators in practice, that allows non-trivial trade-offs between the Type I error and Type II error. We characterize the Uniformly Most Powerful (UMP) watermark in the general hypothesis testing setting and the minimax Type II error in the model-agnostic setting. In the common scenario where the output is a sequence of $n$ tokens, we establish nearly matching upper and lower bounds on the number of i.i.d. tokens required to guarantee small Type I and Type II errors. Our rate of $Θ(h^{-1} \log (1/h))$ with respect to the average entropy per token $h$ highlights potentials for improvement from the rate of $h^{-2}$ in the previous works. Moreover, we formulate the robust watermarking problem where the user is allowed to perform a class of perturbations on the generated texts, and characterize the optimal Type II error of robust UMP tests via a linear programming problem. To the best of our knowledge, this is the first systematic statistical treatment on the watermarking problem with near-optimal rates in the i.i.d. setting, which might be of interest for future works. △ Less

Submitted 6 February, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

arXiv:2310.07838 [pdf, other]

Towards the Fundamental Limits of Knowledge Transfer over Finite Domains

Authors: Qingyue Zhao, Banghua Zhu

Abstract: We characterize the statistical efficiency of knowledge transfer through $n$ samples from a teacher to a probabilistic student classifier with input space $\mathcal S$ over labels $\mathcal A$. We show that privileged information at three progressive levels accelerates the transfer. At the first level, only samples with hard labels are known, via which the maximum likelihood estimator attains the… ▽ More We characterize the statistical efficiency of knowledge transfer through $n$ samples from a teacher to a probabilistic student classifier with input space $\mathcal S$ over labels $\mathcal A$. We show that privileged information at three progressive levels accelerates the transfer. At the first level, only samples with hard labels are known, via which the maximum likelihood estimator attains the minimax rate $\sqrt{{|{\mathcal S}||{\mathcal A}|}/{n}}$. The second level has the teacher probabilities of sampled labels available in addition, which turns out to boost the convergence rate lower bound to ${{|{\mathcal S}||{\mathcal A}|}/{n}}$. However, under this second data acquisition protocol, minimizing a naive adaptation of the cross-entropy loss results in an asymptotically biased student. We overcome this limitation and achieve the fundamental limit by using a novel empirical variant of the squared error logit loss. The third level further equips the student with the soft labels (complete logits) on ${\mathcal A}$ given every sampled input, thereby provably enables the student to enjoy a rate ${|{\mathcal S}|}/{n}$ free of $|{\mathcal A}|$. We find any Kullback-Leibler divergence minimizer to be optimal in the last case. Numerical simulations distinguish the four learners and corroborate our theory. △ Less

Submitted 14 November, 2023; v1 submitted 11 October, 2023; originally announced October 2023.

Comments: 41 pages, 2 figures; Appendix polished

arXiv:2308.12016 [pdf, ps, other]

MKL-$L_{0/1}$-SVM

Authors: Bin Zhu, Yijie Shi

Abstract: This paper presents a Multiple Kernel Learning (abbreviated as MKL) framework for the Support Vector Machine (SVM) with the $(0, 1)$ loss function. Some KKT-like first-order optimality conditions are provided and then exploited to develop a fast ADMM algorithm to solve the nonsmooth nonconvex optimization problem. Numerical experiments on real data sets show that the performance of our MKL-… ▽ More This paper presents a Multiple Kernel Learning (abbreviated as MKL) framework for the Support Vector Machine (SVM) with the $(0, 1)$ loss function. Some KKT-like first-order optimality conditions are provided and then exploited to develop a fast ADMM algorithm to solve the nonsmooth nonconvex optimization problem. Numerical experiments on real data sets show that the performance of our MKL-$L_{0/1}$-SVM is comparable with the one of the leading approaches called SimpleMKL developed by Rakotomamonjy, Bach, Canu, and Grandvalet [Journal of Machine Learning Research, vol. 9, pp. 2491-2521, 2008]. △ Less

Submitted 3 September, 2023; v1 submitted 23 August, 2023; originally announced August 2023.

Comments: 26 pages in the JMLR template, 3 figures, and 2 tables, submitted to the Journal of Machine Learning Research, with minor text overlap with arXiv: 2303.04445 (conference version). arXiv admin note: text overlap with arXiv:2303.04445

arXiv:2306.02584 [pdf, other]

Synthetic Regressing Control Method

Authors: Rong J. B. Zhu

Abstract: Estimating weights in the synthetic control method, typically resulting in sparse weights where only a few control units have non-zero weights, involves an optimization procedure that simultaneously selects and aligns control units to closely match the treated unit. However, this simultaneous selection and alignment of control units may lead to a loss of efficiency. Another concern arising from th… ▽ More Estimating weights in the synthetic control method, typically resulting in sparse weights where only a few control units have non-zero weights, involves an optimization procedure that simultaneously selects and aligns control units to closely match the treated unit. However, this simultaneous selection and alignment of control units may lead to a loss of efficiency. Another concern arising from the aforementioned procedure is its susceptibility to under-fitting due to imperfect pre-treatment fit. It is not uncommon for the linear combination, using nonnegative weights, of pre-treatment period outcomes for the control units to inadequately approximate the pre-treatment outcomes for the treated unit. To address both of these issues, this paper proposes a simple and effective method called Synthetic Regressing Control (SRC). The SRC method begins by performing the univariate linear regression to appropriately align the pre-treatment periods of the control units with the treated unit. Subsequently, a SRC estimator is obtained by synthesizing (taking a weighted average) the fitted controls. To determine the weights in the synthesis procedure, we propose an approach that utilizes a criterion of unbiased risk estimator. Theoretically, we show that the synthesis way is asymptotically optimal in the sense of achieving the lowest possible squared error. Extensive numerical experiments highlight the advantages of the SRC method. △ Less

Submitted 23 October, 2023; v1 submitted 5 June, 2023; originally announced June 2023.

arXiv:2306.02003 [pdf, other]

On Optimal Caching and Model Multiplexing for Large Model Inference

Authors: Banghua Zhu, Ying Sheng, Lianmin Zheng, Clark Barrett, Michael I. Jordan, Jiantao Jiao

Abstract: Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges. In particular, the large-scale deployment of these models is hindered by the significant resource requirements during inference. In this paper, we study two approaches for mitigating these challenges: employing a cache to… ▽ More Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges. In particular, the large-scale deployment of these models is hindered by the significant resource requirements during inference. In this paper, we study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing. Theoretically, we provide an optimal algorithm for jointly optimizing both approaches to reduce the inference cost in both offline and online tabular settings. By combining a caching algorithm, namely Greedy Dual Size with Frequency (GDSF) or Least Expected Cost (LEC), with a model multiplexer, we achieve optimal rates in both offline and online settings. Empirically, simulations show that the combination of our caching and model multiplexing algorithms greatly improves over the baselines, with up to $50\times$ improvement over the baseline when the ratio between the maximum cost and minimum cost is $100$. Experiments on real datasets show a $4.3\times$ improvement in FLOPs over the baseline when the ratio for FLOPs is $10$, and a $1.8\times$ improvement in latency when the ratio for average latency is $1.85$. △ Less

Submitted 28 August, 2023; v1 submitted 3 June, 2023; originally announced June 2023.

arXiv:2306.00265 [pdf, other]

Doubly Robust Self-Training

Authors: Banghua Zhu, Mingyu Ding, Philip Jacobson, Ming Wu, Wei Zhan, Michael Jordan, Jiantao Jiao

Abstract: Self-training is an important technique for solving semi-supervised learning problems. It leverages unlabeled data by generating pseudo-labels and combining them with a limited labeled dataset for training. The effectiveness of self-training heavily relies on the accuracy of these pseudo-labels. In this paper, we introduce doubly robust self-training, a novel semi-supervised algorithm that provabl… ▽ More Self-training is an important technique for solving semi-supervised learning problems. It leverages unlabeled data by generating pseudo-labels and combining them with a limited labeled dataset for training. The effectiveness of self-training heavily relies on the accuracy of these pseudo-labels. In this paper, we introduce doubly robust self-training, a novel semi-supervised algorithm that provably balances between two extremes. When the pseudo-labels are entirely incorrect, our method reduces to a training process solely using labeled data. Conversely, when the pseudo-labels are completely accurate, our method transforms into a training process utilizing all pseudo-labeled data and labeled data, thus increasing the effective sample size. Through empirical evaluations on both the ImageNet dataset for image classification and the nuScenes autonomous driving dataset for 3D object detection, we demonstrate the superiority of the doubly robust loss over the standard self-training baseline. △ Less

Submitted 2 November, 2023; v1 submitted 31 May, 2023; originally announced June 2023.

arXiv:2303.04445 [pdf, ps, other]

An ADMM Solver for the MKL-$L_{0/1}$-SVM

Authors: Yijie Shi, Bin Zhu

Abstract: We formulate the Multiple Kernel Learning (abbreviated as MKL) problem for the support vector machine with the infamous $(0,1)$-loss function. Some first-order optimality conditions are given and then exploited to develop a fast ADMM solver for the nonconvex and nonsmooth optimization problem. A simple numerical experiment on synthetic planar data shows that our MKL-$L_{0/1}$-SVM framework could b… ▽ More We formulate the Multiple Kernel Learning (abbreviated as MKL) problem for the support vector machine with the infamous $(0,1)$-loss function. Some first-order optimality conditions are given and then exploited to develop a fast ADMM solver for the nonconvex and nonsmooth optimization problem. A simple numerical experiment on synthetic planar data shows that our MKL-$L_{0/1}$-SVM framework could be promising. △ Less

Submitted 30 March, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

Comments: 8 pages, 3 figures, 2 tables. Submitted to the 62nd IEEE Conference on Decision and Control as a Regular paper, with a shortened version (arXiv version 1) submitted to the 3rd Chinese Conference on Predictive Control and Intelligent Decision (CPCID) as an Extended Abstract

arXiv:2301.11270 [pdf, other]

Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons

Authors: Banghua Zhu, Jiantao Jiao, Michael I. Jordan

Abstract: We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessim… ▽ More We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstrate that under the PL model, the true MLE and an alternative MLE that splits the $K$-wise comparison into pairwise comparisons both converge. Moreover, the true MLE is asymptotically more efficient. Our results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. Furthermore, our results unify the problem of RLHF and max-entropy Inverse Reinforcement Learning (IRL), and provide the first sample complexity bound for max-entropy IRL. △ Less

Submitted 7 February, 2024; v1 submitted 26 January, 2023; originally announced January 2023.

arXiv:2211.03710 [pdf, other]

Graph Contrastive Learning with Implicit Augmentations

Authors: Huidong Liang, Xingjian Du, Bilei Zhu, Zejun Ma, Ke Chen, Junbin Gao

Abstract: Existing graph contrastive learning methods rely on augmentation techniques based on random perturbations (e.g., randomly adding or drop** edges and nodes). Nevertheless, altering certain edges or nodes can unexpectedly change the graph characteristics, and choosing the optimal perturbing ratio for each dataset requires onerous manual tuning. In this paper, we introduce Implicit Graph Contrastiv… ▽ More Existing graph contrastive learning methods rely on augmentation techniques based on random perturbations (e.g., randomly adding or drop** edges and nodes). Nevertheless, altering certain edges or nodes can unexpectedly change the graph characteristics, and choosing the optimal perturbing ratio for each dataset requires onerous manual tuning. In this paper, we introduce Implicit Graph Contrastive Learning (iGCL), which utilizes augmentations in the latent space learned from a Variational Graph Auto-Encoder by reconstructing graph topological structure. Importantly, instead of explicitly sampling augmentations from latent distributions, we further propose an upper bound for the expected contrastive loss to improve the efficiency of our learning algorithm. Thus, graph semantics can be preserved within the augmentations in an intelligent way without arbitrary manual design or prior human knowledge. Experimental results on both graph-level and node-level tasks show that the proposed method achieves state-of-the-art performance compared to other benchmarks, where ablation studies in the end demonstrate the effectiveness of modules in iGCL. △ Less

Submitted 7 November, 2022; originally announced November 2022.

arXiv:2210.15801 [pdf, ps, other]

Clustering High-dimensional Data via Feature Selection

Authors: Tianqi Liu, Yu Lu, Biqing Zhu, Hongyu Zhao

Abstract: High-dimensional clustering analysis is a challenging problem in statistics and machine learning, with broad applications such as the analysis of microarray data and RNA-seq data. In this paper, we propose a new clustering procedure called Spectral Clustering with Feature Selection (SC-FS), where we first obtain an initial estimate of labels via spectral clustering, then select a small fraction of… ▽ More High-dimensional clustering analysis is a challenging problem in statistics and machine learning, with broad applications such as the analysis of microarray data and RNA-seq data. In this paper, we propose a new clustering procedure called Spectral Clustering with Feature Selection (SC-FS), where we first obtain an initial estimate of labels via spectral clustering, then select a small fraction of features with the largest R-squared with these labels, i.e., the proportion of variation explained by group labels, and conduct clustering again using selected features. Under mild conditions, we prove that the proposed method identifies all informative features with high probability and achieves minimax optimal clustering error rate for the sparse Gaussian mixture model. Applications of SC-FS to four real world data sets demonstrate its usefulness in clustering high-dimensional data. △ Less

Submitted 27 October, 2022; originally announced October 2022.

Comments: Accepted at Biometrics Journal (https://onlinelibrary.wiley.com/doi/epdf/10.1111/biom.13665)

arXiv:2208.10059 [pdf, ps, other]

Sampling Gaussian Stationary Random Fields: A Stochastic Realization Approach

Authors: Bin Zhu, Jiahao Liu, Zhengshou Lai, Tao Qian

Abstract: Generating large-scale samples of stationary random fields is of great importance in the fields such as geomaterial modeling and uncertainty quantification. Traditional methodologies based on covariance matrix decomposition have the diffculty of being computationally expensive, which is even more serious when the dimension of the random field is large. This paper proposes an effcient stochastic re… ▽ More Generating large-scale samples of stationary random fields is of great importance in the fields such as geomaterial modeling and uncertainty quantification. Traditional methodologies based on covariance matrix decomposition have the diffculty of being computationally expensive, which is even more serious when the dimension of the random field is large. This paper proposes an effcient stochastic realization approach for sampling Gaussian stationary random fields from a systems and control point of view. Specifically, we take the exponential and Gaussian covariance functions as examples and make a decoupling assumption when there are multiple dimensions. Then a rational spectral density is constructed in each dimension using techniques from covariance extension, and the corresponding autoregressive moving-average (ARMA) model is obtained via spectral factorization. As a result, samples of the random field with a specific covariance function can be generated very effciently in the space domain by implementing the ARMA recursion using a white noise input. Such a procedure is computationally cheap due to the fact that the constructed ARMA model has a low order. Furthermore, the same method is integrated to multiscale simulations where interpolations of the generated samples are achieved when one zooms into finer scales. Both theoretical analysis and simulation results show that our approach performs favorably compared with covariance matrix decomposition methods. △ Less

Submitted 22 August, 2022; originally announced August 2022.

Comments: 17 pages, 9 figures

arXiv:2205.11765 [pdf, ps, other]

Byzantine-Robust Federated Learning with Optimal Statistical Rates and Privacy Guarantees

Authors: Banghua Zhu, Lun Wang, Qi Pang, Shuai Wang, Jiantao Jiao, Dawn Song, Michael I. Jordan

Abstract: We propose Byzantine-robust federated learning protocols with nearly optimal statistical rates. In contrast to prior work, our proposed protocols improve the dimension dependence and achieve a tight statistical rate in terms of all the parameters for strongly convex losses. We benchmark against competing protocols and show the empirical superiority of the proposed protocols. Finally, we remark tha… ▽ More We propose Byzantine-robust federated learning protocols with nearly optimal statistical rates. In contrast to prior work, our proposed protocols improve the dimension dependence and achieve a tight statistical rate in terms of all the parameters for strongly convex losses. We benchmark against competing protocols and show the empirical superiority of the proposed protocols. Finally, we remark that our protocols with bucketing can be naturally combined with privacy-guaranteeing procedures to introduce security against a semi-honest server. The code for evaluation is provided in https://github.com/wanglun1996/secure-robust-federated-learning. △ Less

Submitted 18 March, 2023; v1 submitted 24 May, 2022; originally announced May 2022.

arXiv:2202.01269 [pdf, ps, other]

Robust Estimation for Nonparametric Families via Generative Adversarial Networks

Authors: Banghua Zhu, Jiantao Jiao, Michael I. Jordan

Abstract: We provide a general framework for designing Generative Adversarial Networks (GANs) to solve high dimensional robust statistics problems, which aim at estimating unknown parameter of the true distribution given adversarially corrupted samples. Prior work focus on the problem of robust mean and covariance estimation when the true distribution lies in the family of Gaussian distributions or elliptic… ▽ More We provide a general framework for designing Generative Adversarial Networks (GANs) to solve high dimensional robust statistics problems, which aim at estimating unknown parameter of the true distribution given adversarially corrupted samples. Prior work focus on the problem of robust mean and covariance estimation when the true distribution lies in the family of Gaussian distributions or elliptical distributions, and analyze depth or scoring rule based GAN losses for the problem. Our work extend these to robust mean estimation, second moment estimation, and robust linear regression when the true distribution only has bounded Orlicz norms, which includes the broad family of sub-Gaussian, sub-Exponential and bounded moment distributions. We also provide a different set of sufficient conditions for the GAN loss to work: we only require its induced distance function to be a cumulative density function of some light-tailed distribution, which is easily satisfied by neural networks with sigmoid activation. In terms of techniques, our proposed GAN losses can be viewed as a smoothed and generalized Kolmogorov-Smirnov distance, which overcomes the computational intractability of the original Kolmogorov-Smirnov distance used in the prior work. △ Less

Submitted 2 February, 2022; originally announced February 2022.

arXiv:2103.12021 [pdf, other]

Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism

Authors: Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, Stuart Russell

Abstract: Offline (or batch) reinforcement learning (RL) algorithms seek to learn an optimal policy from a fixed dataset without active data collection. Based on the composition of the offline dataset, two main categories of methods are used: imitation learning which is suitable for expert datasets and vanilla offline RL which often requires uniform coverage datasets. From a practical standpoint, datasets o… ▽ More Offline (or batch) reinforcement learning (RL) algorithms seek to learn an optimal policy from a fixed dataset without active data collection. Based on the composition of the offline dataset, two main categories of methods are used: imitation learning which is suitable for expert datasets and vanilla offline RL which often requires uniform coverage datasets. From a practical standpoint, datasets often deviate from these two extremes and the exact data composition is usually unknown a priori. To bridge this gap, we present a new offline RL framework that smoothly interpolates between the two extremes of data composition, hence unifying imitation learning and vanilla offline RL. The new framework is centered around a weak version of the concentrability coefficient that measures the deviation from the behavior policy to the expert policy alone. Under this new framework, we further investigate the question on algorithm design: can one develop an algorithm that achieves a minimax optimal rate and also adapts to unknown data composition? To address this question, we consider a lower confidence bound (LCB) algorithm developed based on pessimism in the face of uncertainty in offline RL. We study finite-sample properties of LCB as well as information-theoretic limits in multi-armed bandits, contextual bandits, and Markov decision processes (MDPs). Our analysis reveals surprising facts about optimality rates. In particular, in all three settings, LCB achieves a faster rate of $1/N$ for nearly-expert datasets compared to the usual rate of $1/\sqrt{N}$ in offline RL, where $N$ is the number of samples in the batch dataset. In the case of contextual bandits with at least two contexts, we prove that LCB is adaptively optimal for the entire data composition range, achieving a smooth transition from imitation learning to offline RL. We further show that LCB is almost adaptively optimal in MDPs. △ Less

Submitted 3 July, 2023; v1 submitted 22 March, 2021; originally announced March 2021.

Journal ref: Published at NeurIPS 2021 and IEEE Transactions on Information Theory

arXiv:2102.03240 [pdf]

De-carbonization of global energy use during the COVID-19 pandemic

Authors: Zhu Liu, Biqing Zhu, Philippe Ciais, Steven J. Davis, Chenxi Lu, Haiwang Zhong, Piyu Ke, Yanan Cui, Zhu Deng, Duo Cui, Taochun Sun, Xinyu Dou, Jianguang Tan, Rui Guo, Bo Zheng, Katsumasa Tanaka, Wenli Zhao, Pierre Gentine

Abstract: The COVID-19 pandemic has disrupted human activities, leading to unprecedented decreases in both global energy demand and GHG emissions. Yet a little known that there is also a low carbon shift of the global energy system in 2020. Here, using the near-real-time data on energy-related GHG emissions from 30 countries (about 70% of global power generation), we show that the pandemic caused an unprece… ▽ More The COVID-19 pandemic has disrupted human activities, leading to unprecedented decreases in both global energy demand and GHG emissions. Yet a little known that there is also a low carbon shift of the global energy system in 2020. Here, using the near-real-time data on energy-related GHG emissions from 30 countries (about 70% of global power generation), we show that the pandemic caused an unprecedented de-carbonization of global power system, representing by a dramatic decrease in the carbon intensity of power sector that reached a historical low of 414.9 tCO2eq/GWh in 2020. Moreover, the share of energy derived from renewable and low-carbon sources (nuclear, hydro-energy, wind, solar, geothermal, and biomass) exceeded that from coal and oil for the first time in history in May of 2020. The decrease in global net energy demand (-1.3% in the first half of 2020 relative to the average of the period in 2016-2019) masks a large down-regulation of fossil-fuel-burning power plants supply (-6.1%) coincident with a surge of low-carbon sources (+6.2%). Concomitant changes in the diurnal cycle of electricity demand also favored low-carbon generators, including a flattening of the morning ramp, a lower midday peak, and delays in both the morning and midday load peaks in most countries. However, emission intensities in the power sector have since rebounded in many countries, and a key question for climate mitigation is thus to what extent countries can achieve and maintain lower, pandemic-level carbon intensities of electricity as part of a green recovery. △ Less

Submitted 5 February, 2021; originally announced February 2021.

arXiv:2101.07781 [pdf, other]

Minimax Off-Policy Evaluation for Multi-Armed Bandits

Authors: Cong Ma, Banghua Zhu, Jiantao Jiao, Martin J. Wainwright

Abstract: We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards, and develop minimax rate-optimal procedures under three settings. First, when the behavior policy is known, we show that the Switch estimator, a method that alternates between the plug-in and importance sampling estimators, is minimax rate-optimal for all sample sizes. Second, when the behavior poli… ▽ More We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards, and develop minimax rate-optimal procedures under three settings. First, when the behavior policy is known, we show that the Switch estimator, a method that alternates between the plug-in and importance sampling estimators, is minimax rate-optimal for all sample sizes. Second, when the behavior policy is unknown, we analyze performance in terms of the competitive ratio, thereby revealing a fundamental gap between the settings of known and unknown behavior policies. When the behavior policy is unknown, any estimator must have mean-squared error larger -- relative to the oracle estimator equipped with the knowledge of the behavior policy -- by a multiplicative factor proportional to the support size of the target policy. Moreover, we demonstrate that the plug-in approach achieves this worst-case competitive ratio up to a logarithmic factor. Third, we initiate the study of the partial knowledge setting in which it is assumed that the minimum probability taken by the behavior policy is known. We show that the plug-in estimator is optimal for relatively large values of the minimum probability, but is sub-optimal when the minimum probability is low. In order to remedy this gap, we propose a new estimator based on approximation by Chebyshev polynomials that provably achieves the optimal estimation error. Numerical experiments on both simulated and real data corroborate our theoretical findings. △ Less

Submitted 19 January, 2021; originally announced January 2021.

arXiv:2101.04750 [pdf, other]

Linear Representation Meta-Reinforcement Learning for Instant Adaptation

Authors: Matt Peng, Banghua Zhu, Jiantao Jiao

Abstract: This paper introduces Fast Linearized Adaptive Policy (FLAP), a new meta-reinforcement learning (meta-RL) method that is able to extrapolate well to out-of-distribution tasks without the need to reuse data from training, and adapt almost instantaneously with the need of only a few samples during testing. FLAP builds upon the idea of learning a shared linear representation of the policy so that whe… ▽ More This paper introduces Fast Linearized Adaptive Policy (FLAP), a new meta-reinforcement learning (meta-RL) method that is able to extrapolate well to out-of-distribution tasks without the need to reuse data from training, and adapt almost instantaneously with the need of only a few samples during testing. FLAP builds upon the idea of learning a shared linear representation of the policy so that when adapting to a new task, it suffices to predict a set of linear weights. A separate adapter network is trained simultaneously with the policy such that during adaptation, we can directly use the adapter network to predict these linear weights instead of updating a meta-policy via gradient descent, such as in prior meta-RL methods like MAML, to obtain the new policy. The application of the separate feed-forward network not only speeds up the adaptation run-time significantly, but also generalizes extremely well to very different tasks that prior Meta-RL methods fail to generalize to. Experiments on standard continuous-control meta-RL benchmarks show FLAP presenting significantly stronger performance on out-of-distribution tasks with up to double the average return and up to 8X faster adaptation run-time speeds when compared to prior methods. △ Less

Submitted 12 January, 2021; originally announced January 2021.

arXiv:2010.12636 [pdf, ps, other]

Nonseparable Symplectic Neural Networks

Authors: Shiying Xiong, Yun** Tong, Xingzhe He, Shuqi Yang, Cheng Yang, Bo Zhu

Abstract: Predicting the behaviors of Hamiltonian systems has been drawing increasing attention in scientific machine learning. However, the vast majority of the literature was focused on predicting separable Hamiltonian systems with their kinematic and potential energy terms being explicitly decoupled while building data-driven paradigms to predict nonseparable Hamiltonian systems that are ubiquitous in fl… ▽ More Predicting the behaviors of Hamiltonian systems has been drawing increasing attention in scientific machine learning. However, the vast majority of the literature was focused on predicting separable Hamiltonian systems with their kinematic and potential energy terms being explicitly decoupled while building data-driven paradigms to predict nonseparable Hamiltonian systems that are ubiquitous in fluid dynamics and quantum mechanics were rarely explored. The main computational challenge lies in the effective embedding of symplectic priors to describe the inherently coupled evolution of position and momentum, which typically exhibits intricate dynamics. To solve the problem, we propose a novel neural network architecture, Nonseparable Symplectic Neural Networks (NSSNNs), to uncover and embed the symplectic structure of a nonseparable Hamiltonian system from limited observation data. The enabling mechanics of our approach is an augmented symplectic time integrator to decouple the position and momentum energy terms and facilitate their evolution. We demonstrated the efficacy and versatility of our method by predicting a wide range of Hamiltonian systems, both separable and nonseparable, including chaotic vortical flows. We showed the unique computational merits of our approach to yield long-term, accurate, and robust predictions for large-scale Hamiltonian systems by rigorously enforcing symplectomorphism. △ Less

Submitted 19 February, 2022; v1 submitted 23 October, 2020; originally announced October 2020.

Comments: ICLR2021

arXiv:2007.08165 [pdf, other]

doi 10.1109/TASLP.2020.3008832

Audio Tagging by Cross Filtering Noisy Labels

Authors: Boqing Zhu, Kele Xu, Qiuqiang Kong, Huaimin Wang, Yuxing Peng

Abstract: High quality labeled datasets have allowed deep learning to achieve impressive results on many sound analysis tasks. Yet, it is labor-intensive to accurately annotate large amount of audio data, and the dataset may contain noisy labels in the practical settings. Meanwhile, the deep neural networks are susceptive to those incorrect labeled data because of their outstanding memorization ability. In… ▽ More High quality labeled datasets have allowed deep learning to achieve impressive results on many sound analysis tasks. Yet, it is labor-intensive to accurately annotate large amount of audio data, and the dataset may contain noisy labels in the practical settings. Meanwhile, the deep neural networks are susceptive to those incorrect labeled data because of their outstanding memorization ability. In this paper, we present a novel framework, named CrossFilter, to combat the noisy labels problem for audio tagging. Multiple representations (such as, Logmel and MFCC) are used as the input of our framework for providing more complementary information of the audio. Then, though the cooperation and interaction of two neural networks, we divide the dataset into curated and noisy subsets by incrementally pick out the possibly correctly labeled data from the noisy data. Moreover, our approach leverages the multi-task learning on curated and noisy subsets with different loss function to fully utilize the entire dataset. The noisy-robust loss function is employed to alleviate the adverse effects of incorrect labels. On both the audio tagging datasets FSDKaggle2018 and FSDKaggle2019, empirical results demonstrate the performance improvement compared with other competing approaches. On FSDKaggle2018 dataset, our method achieves state-of-the-art performance and even surpasses the ensemble models. △ Less

Submitted 16 July, 2020; originally announced July 2020.

Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing

arXiv:2006.12972 [pdf, ps, other]

Sparse Symplectically Integrated Neural Networks

Authors: Daniel M. DiPietro, Shiying Xiong, Bo Zhu

Abstract: We introduce Sparse Symplectically Integrated Neural Networks (SSINNs), a novel model for learning Hamiltonian dynamical systems from data. SSINNs combine fourth-order symplectic integration with a learned parameterization of the Hamiltonian obtained using sparse regression through a mathematically elegant function space. This allows for interpretable models that incorporate symplectic inductive b… ▽ More We introduce Sparse Symplectically Integrated Neural Networks (SSINNs), a novel model for learning Hamiltonian dynamical systems from data. SSINNs combine fourth-order symplectic integration with a learned parameterization of the Hamiltonian obtained using sparse regression through a mathematically elegant function space. This allows for interpretable models that incorporate symplectic inductive biases and have low memory requirements. We evaluate SSINNs on four classical Hamiltonian dynamical problems: the Hénon-Heiles system, nonlinearly coupled oscillators, a multi-particle mass-spring system, and a pendulum system. Our results demonstrate promise in both system prediction and conservation of energy, often outperforming the current state-of-the-art black-box prediction techniques by an order of magnitude. Further, SSINNs successfully converge to true governing equations from highly limited and noisy data, demonstrating potential applicability in the discovery of new physical governing equations. △ Less

Submitted 28 October, 2020; v1 submitted 9 June, 2020; originally announced June 2020.

Comments: Accepted as a conference paper to NeurIPS 2020. Main paper has 9 pages and 4 figures

arXiv:2006.07900 [pdf, other]

ResOT: Resource-Efficient Oblique Trees for Neural Signal Classification

Authors: Bingzhao Zhu, Masoud Farivar, Mahsa Shoaran

Abstract: Classifiers that can be implemented on chip with minimal computational and memory resources are essential for edge computing in emerging applications such as medical and IoT devices. This paper introduces a machine learning model based on oblique decision trees to enable resource-efficient classification on a neural implant. By integrating model compression with probabilistic routing and implement… ▽ More Classifiers that can be implemented on chip with minimal computational and memory resources are essential for edge computing in emerging applications such as medical and IoT devices. This paper introduces a machine learning model based on oblique decision trees to enable resource-efficient classification on a neural implant. By integrating model compression with probabilistic routing and implementing cost-aware learning, our proposed model could significantly reduce the memory and hardware cost compared to state-of-the-art models, while maintaining the classification accuracy. We trained the resource-efficient oblique tree with power-efficient regularization (ResOT-PE) on three neural classification tasks to evaluate the performance, memory, and hardware requirements. On seizure detection task, we were able to reduce the model size by 3.4X and the feature extraction cost by 14.6X compared to the ensemble of boosted trees, using the intracranial EEG from 10 epilepsy patients. In a second experiment, we tested the ResOT-PE model on tremor detection for Parkinson's disease, using the local field potentials from 12 patients implanted with a deep-brain stimulation (DBS) device. We achieved a comparable classification performance as the state-of-the-art boosted tree ensemble, while reducing the model size and feature extraction cost by 10.6X and 6.8X, respectively. We also tested on a 6-class finger movement detection task using ECoG recordings from 9 subjects, reducing the model size by 17.6X and feature computation cost by 5.1X. The proposed model can enable a low-power and memory-efficient implementation of classifiers for real-time neurological disease detection and motor decoding. △ Less

Submitted 14 June, 2020; originally announced June 2020.

arXiv:2006.05044 [pdf, other]

Neural Physicist: Learning Physical Dynamics from Image Sequences

Authors: Baocheng Zhu, Shijun Wang, James Zhang

Abstract: We present a novel architecture named Neural Physicist (NeurPhy) to learn physical dynamics directly from image sequences using deep neural networks. For any physical system, given the global system parameters, the time evolution of states is governed by the underlying physical laws. How to learn meaningful system representations in an end-to-end way and estimate accurate state transition dynamics… ▽ More We present a novel architecture named Neural Physicist (NeurPhy) to learn physical dynamics directly from image sequences using deep neural networks. For any physical system, given the global system parameters, the time evolution of states is governed by the underlying physical laws. How to learn meaningful system representations in an end-to-end way and estimate accurate state transition dynamics facilitating long-term prediction have been long-standing challenges. In this paper, by leveraging recent progresses in representation learning and state space models (SSMs), we propose NeurPhy, which uses variational auto-encoder (VAE) to extract underlying Markovian dynamic state at each time step, neural process (NP) to extract the global system parameters, and a non-linear non-recurrent stochastic state space model to learn the physical dynamic transition. We apply NeurPhy to two physical experimental environments, i.e., damped pendulum and planetary orbits motion, and achieve promising results. Our model can not only extract the physically meaningful state representations, but also learn the state transition dynamics enabling long-term predictions for unseen image sequences. Furthermore, from the manifold dimension of the latent state space, we can easily identify the degree of freedom (DoF) of the underlying physical systems. △ Less

Submitted 9 June, 2020; originally announced June 2020.

Comments: 19 pages, 20 figures

arXiv:2005.14073 [pdf, other]

Robust estimation via generalized quasi-gradients

Authors: Banghua Zhu, Jiantao Jiao, Jacob Steinhardt

Abstract: We explore why many recently proposed robust estimation problems are efficiently solvable, even though the underlying optimization problems are non-convex. We study the loss landscape of these robust estimation problems, and identify the existence of "generalized quasi-gradients". Whenever these quasi-gradients exist, a large family of low-regret algorithms are guaranteed to approximate the global… ▽ More We explore why many recently proposed robust estimation problems are efficiently solvable, even though the underlying optimization problems are non-convex. We study the loss landscape of these robust estimation problems, and identify the existence of "generalized quasi-gradients". Whenever these quasi-gradients exist, a large family of low-regret algorithms are guaranteed to approximate the global minimum; this includes the commonly-used filtering algorithm. For robust mean estimation of distributions under bounded covariance, we show that any first-order stationary point of the associated optimization problem is an {approximate global minimum} if and only if the corruption level $ε< 1/3$. Consequently, any optimization algorithm that aproaches a stationary point yields an efficient robust estimator with breakdown point $1/3$. With careful initialization and step size, we improve this to $1/2$, which is optimal. For other tasks, including linear regression and joint mean and covariance estimation, the loss landscape is more rugged: there are stationary points arbitrarily far from the global minimum. Nevertheless, we show that generalized quasi-gradients exist and construct efficient algorithms. These algorithms are simpler than previous ones in the literature, and for linear regression we improve the estimation error from $O(\sqrtε)$ to the optimal rate of $O(ε)$ for small $ε$ assuming certified hypercontractivity. For mean estimation with near-identity covariance, we show that a simple gradient descent algorithm achieves breakdown point $1/3$ and iteration complexity $\tilde{O}(d/ε^2)$. △ Less

Submitted 28 May, 2020; originally announced May 2020.

arXiv:2005.09195 [pdf, other]

Riemannian Proximal Policy Optimization

Authors: Shijun Wang, Baocheng Zhu, Chen Li, Mingzhe Wu, James Zhang, Wei Chu, Yuan Qi

Abstract: In this paper, We propose a general Riemannian proximal optimization algorithm with guaranteed convergence to solve Markov decision process (MDP) problems. To model policy functions in MDP, we employ Gaussian mixture model (GMM) and formulate it as a nonconvex optimization problem in the Riemannian space of positive semidefinite matrices. For two given policy functions, we also provide its lower b… ▽ More In this paper, We propose a general Riemannian proximal optimization algorithm with guaranteed convergence to solve Markov decision process (MDP) problems. To model policy functions in MDP, we employ Gaussian mixture model (GMM) and formulate it as a nonconvex optimization problem in the Riemannian space of positive semidefinite matrices. For two given policy functions, we also provide its lower bound on policy improvement by using bounds derived from the Wasserstein distance of GMMs. Preliminary experiments show the efficacy of our proposed Riemannian proximal policy optimization algorithm. △ Less

Submitted 18 May, 2020; originally announced May 2020.

Comments: 12 pages, 1 figures

arXiv:2005.09194 [pdf, other]

doi 10.1109/IJCNN.2019.8852367

A Riemannian Primal-dual Algorithm Based on Proximal Operator and its Application in Metric Learning

Authors: Shijun Wang, Baocheng Zhu, Lintao Ma, Yuan Qi

Abstract: In this paper, we consider optimizing a smooth, convex, lower semicontinuous function in Riemannian space with constraints. To solve the problem, we first convert it to a dual problem and then propose a general primal-dual algorithm to optimize the primal and dual variables iteratively. In each optimization iteration, we employ a proximal operator to search optimal solution in the primal space. We… ▽ More In this paper, we consider optimizing a smooth, convex, lower semicontinuous function in Riemannian space with constraints. To solve the problem, we first convert it to a dual problem and then propose a general primal-dual algorithm to optimize the primal and dual variables iteratively. In each optimization iteration, we employ a proximal operator to search optimal solution in the primal space. We prove convergence of the proposed algorithm and show its non-asymptotic convergence rate. By utilizing the proposed primal-dual optimization technique, we propose a novel metric learning algorithm which learns an optimal feature transformation matrix in the Riemannian space of positive definite matrices. Preliminary experimental results on an optimal fund selection problem in fund of funds (FOF) management for quantitative investment showed its efficacy. △ Less

Submitted 18 May, 2020; originally announced May 2020.

Comments: 8 pages, 2 figures, published as a conference paper in 2019 International Joint Conference on Neural Networks (IJCNN)

arXiv:2005.06546 [pdf]

Triaging moderate COVID-19 and other viral pneumonias from routine blood tests

Authors: Forrest Sheng Bao, Youbiao He, Jie Liu, Yuanfang Chen, Qian Li, Christina R. Zhang, Lei Han, Baoli Zhu, Yaorong Ge, Shi Chen, Ming Xu, Liu Ouyang

Abstract: The COVID-19 is swee** the world with deadly consequences. Its contagious nature and clinical similarity to other pneumonias make separating subjects contracted with COVID-19 and non-COVID-19 viral pneumonia a priority and a challenge. However, COVID-19 testing has been greatly limited by the availability and cost of existing methods, even in developed countries like the US. Intrigued by the wid… ▽ More The COVID-19 is swee** the world with deadly consequences. Its contagious nature and clinical similarity to other pneumonias make separating subjects contracted with COVID-19 and non-COVID-19 viral pneumonia a priority and a challenge. However, COVID-19 testing has been greatly limited by the availability and cost of existing methods, even in developed countries like the US. Intrigued by the wide availability of routine blood tests, we propose to leverage them for COVID-19 testing using the power of machine learning. Two proven-robust machine learning model families, random forests (RFs) and support vector machines (SVMs), are employed to tackle the challenge. Trained on blood data from 208 moderate COVID-19 subjects and 86 subjects with non-COVID-19 moderate viral pneumonia, the best result is obtained in an SVM-based classifier with an accuracy of 84%, a sensitivity of 88%, a specificity of 80%, and a precision of 92%. The results are found explainable from both machine learning and medical perspectives. A privacy-protected web portal is set up to help medical personnel in their practice and the trained models are released for developers to further build other applications. We hope our results can help the world fight this pandemic and welcome clinical verification of our approach on larger populations. △ Less

Submitted 13 May, 2020; originally announced May 2020.

ACM Class: I.5.4

arXiv:2001.07805 [pdf, other]

When does the Tukey median work?

Authors: Banghua Zhu, Jiantao Jiao, Jacob Steinhardt

Abstract: We analyze the performance of the Tukey median estimator under total variation (TV) distance corruptions. Previous results show that under Huber's additive corruption model, the breakdown point is 1/3 for high-dimensional halfspace-symmetric distributions. We show that under TV corruptions, the breakdown point reduces to 1/4 for the same set of distributions. We also show that a certain projection… ▽ More We analyze the performance of the Tukey median estimator under total variation (TV) distance corruptions. Previous results show that under Huber's additive corruption model, the breakdown point is 1/3 for high-dimensional halfspace-symmetric distributions. We show that under TV corruptions, the breakdown point reduces to 1/4 for the same set of distributions. We also show that a certain projection algorithm can attain the optimal breakdown point of 1/2. Both the Tukey median estimator and the projection algorithm achieve sample complexity linear in dimension. △ Less

Submitted 31 March, 2020; v1 submitted 21 January, 2020; originally announced January 2020.

arXiv:1909.08755 [pdf, ps, other]

Generalized Resilience and Robust Statistics

Authors: Banghua Zhu, Jiantao Jiao, Jacob Steinhardt

Abstract: Robust statistics traditionally focuses on outliers, or perturbations in total variation distance. However, a dataset could be corrupted in many other ways, such as systematic measurement errors and missing covariates. We generalize the robust statistics approach to consider perturbations under any Wasserstein distance, and show that robust estimation is possible whenever a distribution's populati… ▽ More Robust statistics traditionally focuses on outliers, or perturbations in total variation distance. However, a dataset could be corrupted in many other ways, such as systematic measurement errors and missing covariates. We generalize the robust statistics approach to consider perturbations under any Wasserstein distance, and show that robust estimation is possible whenever a distribution's population statistics are robust under a certain family of friendly perturbations. This generalizes a property called resilience previously employed in the special case of mean estimation with outliers. We justify the generalized resilience property by showing that it holds under moment or hypercontractive conditions. Even in the total variation case, these subsume conditions in the literature for mean estimation, regression, and covariance estimation; the resulting analysis simplifies and sometimes improves these known results in both population limit and finite-sample rate. Our robust estimators are based on minimum distance (MD) functionals (Donoho and Liu, 1988), which project onto a set of distributions under a discrepancy related to the perturbation. We present two approaches for designing MD estimators with good finite-sample rates: weakening the discrepancy and expanding the set of distributions. We also present connections to Gao et al. (2019)'s recent analysis of generative adversarial networks for robust estimation. △ Less

Submitted 13 December, 2020; v1 submitted 18 September, 2019; originally announced September 2019.

arXiv:1903.00906 [pdf, other]

Understanding Feature Selection and Feature Memorization in Recurrent Neural Networks

Authors: Bokang Zhu, Richong Zhang, Dingkun Long, Yongyi Mao

Abstract: In this paper, we propose a test, called Flagged-1-Bit (F1B) test, to study the intrinsic capability of recurrent neural networks in sequence learning. Four different recurrent network models are studied both analytically and experimentally using this test. Our results suggest that in general there exists a conflict between feature selection and feature memorization in sequence learning. Such a co… ▽ More In this paper, we propose a test, called Flagged-1-Bit (F1B) test, to study the intrinsic capability of recurrent neural networks in sequence learning. Four different recurrent network models are studied both analytically and experimentally using this test. Our results suggest that in general there exists a conflict between feature selection and feature memorization in sequence learning. Such a conflict can be resolved either using a gating mechanism as in LSTM, or by increasing the state dimension as in Vanilla RNN. Gated models resolve this conflict by adaptively adjusting their state-update equations, whereas Vanilla RNN resolves this conflict by assigning different dimensions different tasks. Insights into feature selection and memorization in recurrent networks are given. △ Less

Submitted 3 March, 2019; originally announced March 2019.

arXiv:1901.09465 [pdf, other]

Deconstructing Generative Adversarial Networks

Authors: Banghua Zhu, Jiantao Jiao, David Tse

Abstract: We deconstruct the performance of GANs into three components: 1. Formulation: we propose a perturbation view of the population target of GANs. Building on this interpretation, we show that GANs can be viewed as a generalization of the robust statistics framework, and propose a novel GAN architecture, termed as Cascade GANs, to provably recover meaningful low-dimensional generator approximations… ▽ More We deconstruct the performance of GANs into three components: 1. Formulation: we propose a perturbation view of the population target of GANs. Building on this interpretation, we show that GANs can be viewed as a generalization of the robust statistics framework, and propose a novel GAN architecture, termed as Cascade GANs, to provably recover meaningful low-dimensional generator approximations when the real distribution is high-dimensional and corrupted by outliers. 2. Generalization: given a population target of GANs, we design a systematic principle, projection under admissible distance, to design GANs to meet the population requirement using finite samples. We implement our principle in three cases to achieve polynomial and sometimes near-optimal sample complexities: (1) learning an arbitrary generator under an arbitrary pseudonorm; (2) learning a Gaussian location family under TV distance, where we utilize our principle provide a new proof for the optimality of Tukey median viewed as GANs; (3) learning a low-dimensional Gaussian approximation of a high-dimensional arbitrary distribution under Wasserstein distance. We demonstrate a fundamental trade-off in the approximation error and statistical error in GANs, and show how to apply our principle with empirical samples to predict how many samples are sufficient for GANs in order not to suffer from the discriminator winning problem. 3. Optimization: we demonstrate alternating gradient descent is provably not locally asymptotically stable in optimizing the GAN formulation of PCA. We diagnose the problem as the minimax duality gap being non-zero, and propose a new GAN architecture whose duality gap is zero, where the value of the game is equal to the previous minimax value (not the maximin value). We prove the new GAN architecture is globally asymptotically stable in optimization under alternating gradient descent. △ Less

Submitted 19 May, 2019; v1 submitted 27 January, 2019; originally announced January 2019.

arXiv:1609.09272 [pdf, ps, other]

A New Algorithm for Circulant Rational Covariance Extension and Applications to Finite-interval Smoothing

Authors: Giorgio Picci, Bin Zhu

Abstract: The partial stochastic realization of periodic processes from finite covariance data has recently been solved by Lindquist and Picci based on convex optimization of a generalized entropy functional. The meaning and the role of this criterion have an unclear origin. In this paper we propose a solution based on a nonlinear generalization of the classical Yule-Walker type equations and on a new itera… ▽ More The partial stochastic realization of periodic processes from finite covariance data has recently been solved by Lindquist and Picci based on convex optimization of a generalized entropy functional. The meaning and the role of this criterion have an unclear origin. In this paper we propose a solution based on a nonlinear generalization of the classical Yule-Walker type equations and on a new iterative algorithm which is shown to converge to the same (unique) solution of the variational problem. This provides a conceptual link to the variational principles and at the same time yields a robust algorithm which can for example be successfully applied to finite-interval smoothing problems providing a simpler procedure if compared with the classical Riccati-based calculations. △ Less

Submitted 29 September, 2016; originally announced September 2016.

Comments: Submitted

arXiv:1212.0181 [pdf, ps, other]

Stochastic Volatility Regression for Functional Data Dynamics

Authors: Bin Zhu, David B. Dunson

Abstract: Although there are many methods for functional data analysis (FDA), little emphasis is put on characterizing variability among volatilities of individual functions. In particular, certain individuals exhibit erratic swings in their trajectory while other individuals have more stable trajectories. There is evidence of such volatility heterogeneity in blood pressure trajectories during pregnancy, fo… ▽ More Although there are many methods for functional data analysis (FDA), little emphasis is put on characterizing variability among volatilities of individual functions. In particular, certain individuals exhibit erratic swings in their trajectory while other individuals have more stable trajectories. There is evidence of such volatility heterogeneity in blood pressure trajectories during pregnancy, for example, and reason to suspect that volatility is a biologically important feature. Most FDA models implicitly assume similar or identical smoothness of the individual functions, and hence can lead to misleading inferences on volatility and an inadequate representation of the functions. We propose a novel class of FDA models characterized using hierarchical stochastic differential equations. We model the derivatives of a mean function and deviation functions using Gaussian processes, while also allowing covariate dependence including on the volatilities of the deviation functions. Following a Bayesian approach to inference, a Markov chain Monte Carlo algorithm is used for posterior computation. The methods are tested on simulated data and applied to blood pressure trajectories during pregnancy. △ Less

Submitted 1 December, 2012; originally announced December 2012.

arXiv:1201.5169 [pdf, ps, other]

Signal extraction and breakpoint identification for array CGH data using robust state space model

Authors: Bin Zhu, Jeremy M. G. Taylor, Peter X. -K. Song

Abstract: Array comparative genomic hybridization(CGH) is a high resolution technique to assess DNA copy number variation. Identifying breakpoints where copy number changes will enhance the understanding of the pathogenesis of human diseases, such as cancers. However, the biological variation and experimental errors contained in array CGH data may lead to false positive identification of breakpoints. We pro… ▽ More Array comparative genomic hybridization(CGH) is a high resolution technique to assess DNA copy number variation. Identifying breakpoints where copy number changes will enhance the understanding of the pathogenesis of human diseases, such as cancers. However, the biological variation and experimental errors contained in array CGH data may lead to false positive identification of breakpoints. We propose a robust state space model for array CGH data analysis. The model consists of two equations: an observation equation and a state equation, in which both the measurement error and evolution error are specified to follow t-distributions with small degrees of freedom. The completely unspecified CGH profiles are estimated by a Markov Chain Monte Carlo(MCMC) algorithm. Breakpoints and outliers are identified by a novel backward selection procedure based on posterior draws of the CGH profiles. Compared to three other popular methods, our method demonstrates several desired features, including false positive rate control, robustness against outliers, and superior power of breakpoint detection. All these properties are illustrated using simulated and real datasets. △ Less

Submitted 24 January, 2012; originally announced January 2012.

arXiv:1201.4403 [pdf, ps, other]

Locally Adaptive Bayes Nonparametric Regression via Nested Gaussian Processes

Authors: Bin Zhu, David B. Dunson

Abstract: We propose a nested Gaussian process (nGP) as a locally adaptive prior for Bayesian nonparametric regression. Specified through a set of stochastic differential equations (SDEs), the nGP imposes a Gaussian process prior for the function's $m$th-order derivative. The nesting comes in through including a local instantaneous mean function, which is drawn from another Gaussian process inducing adaptiv… ▽ More We propose a nested Gaussian process (nGP) as a locally adaptive prior for Bayesian nonparametric regression. Specified through a set of stochastic differential equations (SDEs), the nGP imposes a Gaussian process prior for the function's $m$th-order derivative. The nesting comes in through including a local instantaneous mean function, which is drawn from another Gaussian process inducing adaptivity to locally-varying smoothness. We discuss the support of the nGP prior in terms of the closure of a reproducing kernel Hilbert space, and consider theoretical properties of the posterior. The posterior mean under the nGP prior is shown to be equivalent to the minimizer of a nested penalized sum-of-squares involving penalties for both the global and local roughness of the function. Using highly-efficient Markov chain Monte Carlo for posterior inference, the proposed method performs well in simulation studies compared to several alternatives, and is scalable to massive data, illustrated through a proteomics application. △ Less

Submitted 20 January, 2012; originally announced January 2012.

arXiv:1111.5563 [pdf, ps, other]

Adverse Subpopulation Regression for Multivariate Outcomes with High-Dimensional Predictors

Authors: Bin Zhu, David B. Dunson, Allison E. Ashley-Koch

Abstract: Biomedical studies have a common interest in assessing relationships between multiple related health outcomes and high-dimensional predictors. For example, in reproductive epidemiology, one may collect pregnancy outcomes such as length of gestation and birth weight and predictors such as single nucleotide polymorphisms in multiple candidate genes and environmental exposures. In such settings, ther… ▽ More Biomedical studies have a common interest in assessing relationships between multiple related health outcomes and high-dimensional predictors. For example, in reproductive epidemiology, one may collect pregnancy outcomes such as length of gestation and birth weight and predictors such as single nucleotide polymorphisms in multiple candidate genes and environmental exposures. In such settings, there is a need for simple yet flexible methods for selecting true predictors of adverse health responses from a high-dimensional set of candidate predictors. To address this problem, one may either consider linear regression models for the continuous outcomes or convert these outcomes into binary indicators of adverse responses using pre-defined cutoffs. The former strategy has the disadvantage of often leading to a poorly fitting model that does not predict risk well, while the latter approach can be very sensitive to the cutoff choice. As a simple yet flexible alternative, we propose a method for adverse subpopulation regression (ASPR), which relies on a two component latent class model, with the dominant component corresponding to (presumed) healthy individuals and the risk of falling in the minority component characterized via a logistic regression. The logistic regression model is designed to accommodate high-dimensional predictors, as occur in studies with a large number of gene by environment interactions, through use of a flexible nonparametric multiple shrinkage approach. The Gibbs sampler is developed for posterior computation. The methods are evaluated using simulation studies and applied to a genetic epidemiology study of pregnancy outcomes. △ Less

Submitted 23 November, 2011; originally announced November 2011.

arXiv:1111.5551 [pdf, ps, other]

Generalized Admixture Map** for Complex Traits

Authors: Bin Zhu, Allison E. Ashley-Koch, David B. Dunson

Abstract: Admixture map** is a popular tool to identify regions of the genome associated with traits in a recently admixed population. Existing methods have been developed primarily for identification of a single locus influencing a dichotomous trait within a case-control study design. We propose a generalized admixture map** (GLEAM) approach, a flexible and powerful regression method for both quantitat… ▽ More Admixture map** is a popular tool to identify regions of the genome associated with traits in a recently admixed population. Existing methods have been developed primarily for identification of a single locus influencing a dichotomous trait within a case-control study design. We propose a generalized admixture map** (GLEAM) approach, a flexible and powerful regression method for both quantitative and qualitative traits, which is able to test for association between the trait and local ancestries in multiple loci simultaneously and adjust for covariates. The new method is based on the generalized linear model and utilizes a quadratic normal moment prior to incorporate admixture prior information. Through simulation, we demonstrate that GLEAM achieves lower type I error rate and higher power than existing methods both for qualitative traits and more significantly for quantitative traits. We applied GLEAM to genome-wide SNP data from the Illumina African American panel derived from a cohort of black woman participating in the Healthy Pregnancy, Healthy Baby study and identified a locus on chromosome 2 associated with the averaged maternal mean arterial pressure during 24 to 28 weeks of pregnancy. △ Less

Submitted 23 November, 2011; originally announced November 2011.

Showing 1–41 of 41 results for author: Zhu, B