Search | arXiv e-print repository

arXiv:2406.19049 [pdf, other]

Accuracy on the wrong line: On the pitfalls of noisy data for out-of-distribution generalisation

Authors: Amartya Sanyal, Yaxi Hu, Yaodong Yu, Yian Ma, Yixin Wang, Bernhard Schölkopf

Abstract: "Accuracy-on-the-line" is a widely observed phenomenon in machine learning, where a model's accuracy on in-distribution (ID) and out-of-distribution (OOD) data is positively correlated across different hyperparameters and data configurations. But when does this useful relationship break down? In this work, we explore its robustness. The key observation is that noisy data and the presence of nuisan… ▽ More "Accuracy-on-the-line" is a widely observed phenomenon in machine learning, where a model's accuracy on in-distribution (ID) and out-of-distribution (OOD) data is positively correlated across different hyperparameters and data configurations. But when does this useful relationship break down? In this work, we explore its robustness. The key observation is that noisy data and the presence of nuisance features can be sufficient to shatter the Accuracy-on-the-line phenomenon. In these cases, ID and OOD accuracy can become negatively correlated, leading to "Accuracy-on-the-wrong-line". This phenomenon can also occur in the presence of spurious (shortcut) features, which tend to overshadow the more complex signal (core, non-spurious) features, resulting in a large nuisance feature space. Moreover, scaling to larger datasets does not mitigate this undesirable behavior and may even exacerbate it. We formally prove a lower bound on Out-of-distribution (OOD) error in a linear classification model, characterizing the conditions on the noise and nuisance features for a large OOD error. We finally demonstrate this phenomenon across both synthetic and real datasets with noisy data and nuisance features. △ Less

Submitted 27 June, 2024; originally announced June 2024.

arXiv:2406.03296 [pdf, other]

Multi-relational Network Autoregression Model with Latent Group Structures

Authors: Yimeng Ren, Xuening Zhu, Ganggang Xu, Yanyuan Ma

Abstract: Multi-relational networks among entities are frequently observed in the era of big data. Quantifying the effects of multiple networks have attracted significant research interest recently. In this work, we model multiple network effects through an autoregressive framework for tensor-valued time series. To characterize the potential heterogeneity of the networks and handle the high dimensionality o… ▽ More Multi-relational networks among entities are frequently observed in the era of big data. Quantifying the effects of multiple networks have attracted significant research interest recently. In this work, we model multiple network effects through an autoregressive framework for tensor-valued time series. To characterize the potential heterogeneity of the networks and handle the high dimensionality of the time series data simultaneously, we assume a separate group structure for entities in each network and estimate all group memberships in a data-driven fashion. Specifically, we propose a group tensor network autoregression (GTNAR) model, which assumes that within each network, entities in the same group share the same set of model parameters, and the parameters differ across networks. An iterative algorithm is developed to estimate the model parameters and the latent group memberships simultaneously. Theoretically, we show that the group-wise parameters and group memberships can be consistently estimated when the group numbers are correctly- or possibly over-specified. An information criterion for group number estimation of each network is also provided to consistently select the group numbers. Lastly, we implement the method on a Yelp dataset to illustrate the usefulness of the method. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: arXiv admin note: text overlap with arXiv:2212.02107

arXiv:2406.00920 [pdf, ps, other]

Demystifying SGD with Doubly Stochastic Gradients

Authors: Kyurae Kim, Joohwan Ko, Yi-An Ma, Jacob R. Gardner

Abstract: Optimization objectives in the form of a sum of intractable expectations are rising in importance (e.g., diffusion models, variational autoencoders, and many more), a setting also known as "finite sum with infinite data." For these problems, a popular strategy is to employ SGD with doubly stochastic gradients (doubly SGD): the expectations are estimated using the gradient estimator of each compone… ▽ More Optimization objectives in the form of a sum of intractable expectations are rising in importance (e.g., diffusion models, variational autoencoders, and many more), a setting also known as "finite sum with infinite data." For these problems, a popular strategy is to employ SGD with doubly stochastic gradients (doubly SGD): the expectations are estimated using the gradient estimator of each component, while the sum is estimated by subsampling over these estimators. Despite its popularity, little is known about the convergence properties of doubly SGD, except under strong assumptions such as bounded variance. In this work, we establish the convergence of doubly SGD with independent minibatching and random reshuffling under general conditions, which encompasses dependent component gradient estimators. In particular, for dependent estimators, our analysis allows fined-grained analysis of the effect correlations. As a result, under a per-iteration computational budget of $b \times m$, where $b$ is the minibatch size and $m$ is the number of Monte Carlo samples, our analysis suggests where one should invest most of the budget in general. Furthermore, we prove that random reshuffling (RR) improves the complexity dependence on the subsampling noise. △ Less

Submitted 2 June, 2024; originally announced June 2024.

Comments: Accepted to ICML'24

arXiv:2405.16734 [pdf, other]

Faster Sampling via Stochastic Gradient Proximal Sampler

Authors: Xunpeng Huang, Difan Zou, Yi-An Ma, Hanze Dong, Tong Zhang

Abstract: Stochastic gradients have been widely integrated into Langevin-based methods to improve their scalability and efficiency in solving large-scale sampling problems. However, the proximal sampler, which exhibits much faster convergence than Langevin-based algorithms in the deterministic setting Lee et al. (2021), has yet to be explored in its stochastic variants. In this paper, we study the Stochasti… ▽ More Stochastic gradients have been widely integrated into Langevin-based methods to improve their scalability and efficiency in solving large-scale sampling problems. However, the proximal sampler, which exhibits much faster convergence than Langevin-based algorithms in the deterministic setting Lee et al. (2021), has yet to be explored in its stochastic variants. In this paper, we study the Stochastic Proximal Samplers (SPS) for sampling from non-log-concave distributions. We first establish a general framework for implementing stochastic proximal samplers and establish the convergence theory accordingly. We show that the convergence to the target distribution can be guaranteed as long as the second moment of the algorithm trajectory is bounded and restricted Gaussian oracles can be well approximated. We then provide two implementable variants based on Stochastic gradient Langevin dynamics (SGLD) and Metropolis-adjusted Langevin algorithm (MALA), giving rise to SPS-SGLD and SPS-MALA. We further show that SPS-SGLD and SPS-MALA can achieve $ε$-sampling error in total variation (TV) distance within $\tilde{\mathcal{O}}(dε^{-2})$ and $\tilde{\mathcal{O}}(d^{1/2}ε^{-2})$ gradient complexities, which outperform the best-known result by at least an $\tilde{\mathcal{O}}(d^{1/3})$ factor. This enhancement in performance is corroborated by our empirical studies on synthetic data with various dimensions, demonstrating the efficiency of our proposed algorithm. △ Less

Submitted 26 May, 2024; originally announced May 2024.

Comments: 48 pages, 2 figures, 5 tables

arXiv:2405.16387 [pdf, other]

Reverse Transition Kernel: A Flexible Framework to Accelerate Diffusion Inference

Authors: Xunpeng Huang, Difan Zou, Hanze Dong, Yi Zhang, Yi-An Ma, Tong Zhang

Abstract: To generate data from trained diffusion models, most inference algorithms, such as DDPM, DDIM, and other variants, rely on discretizing the reverse SDEs or their equivalent ODEs. In this paper, we view such approaches as decomposing the entire denoising diffusion process into several segments, each corresponding to a reverse transition kernel (RTK) sampling subproblem. Specifically, DDPM uses a Ga… ▽ More To generate data from trained diffusion models, most inference algorithms, such as DDPM, DDIM, and other variants, rely on discretizing the reverse SDEs or their equivalent ODEs. In this paper, we view such approaches as decomposing the entire denoising diffusion process into several segments, each corresponding to a reverse transition kernel (RTK) sampling subproblem. Specifically, DDPM uses a Gaussian approximation for the RTK, resulting in low per-subproblem complexity but requiring a large number of segments (i.e., subproblems), which is conjectured to be inefficient. To address this, we develop a general RTK framework that enables a more balanced subproblem decomposition, resulting in $\tilde O(1)$ subproblems, each with strongly log-concave targets. We then propose leveraging two fast sampling algorithms, the Metropolis-Adjusted Langevin Algorithm (MALA) and Underdamped Langevin Dynamics (ULD), for solving these strongly log-concave subproblems. This gives rise to the RTK-MALA and RTK-ULD algorithms for diffusion inference. In theory, we further develop the convergence guarantees for RTK-MALA and RTK-ULD in total variation (TV) distance: RTK-ULD can achieve $ε$ target error within $\tilde{\mathcal O}(d^{1/2}ε^{-1})$ under mild conditions, and RTK-MALA enjoys a $\mathcal{O}(d^{2}\log(d/ε))$ convergence rate under slightly stricter conditions. These theoretical results surpass the state-of-the-art convergence rates for diffusion inference and are well supported by numerical experiments. △ Less

Submitted 25 May, 2024; originally announced May 2024.

Comments: 68 pages, 2 figures

arXiv:2405.13481 [pdf, other]

Locally Private Estimation with Public Features

Authors: Yuheng Ma, Ke Jia, Hanfang Yang

Abstract: We initiate the study of locally differentially private (LDP) learning with public features. We define semi-feature LDP, where some features are publicly available while the remaining ones, along with the label, require protection under local differential privacy. Under semi-feature LDP, we demonstrate that the mini-max convergence rate for non-parametric regression is significantly reduced compar… ▽ More We initiate the study of locally differentially private (LDP) learning with public features. We define semi-feature LDP, where some features are publicly available while the remaining ones, along with the label, require protection under local differential privacy. Under semi-feature LDP, we demonstrate that the mini-max convergence rate for non-parametric regression is significantly reduced compared to that of classical LDP. Then we propose HistOfTree, an estimator that fully leverages the information contained in both public and private features. Theoretically, HistOfTree reaches the mini-max optimal convergence rate. Empirically, HistOfTree achieves superior performance on both synthetic and real data. We also explore scenarios where users have the flexibility to select features for protection manually. In such cases, we propose an estimator and a data-driven parameter tuning strategy, leading to analogous theoretical and empirical results. △ Less

Submitted 22 May, 2024; originally announced May 2024.

arXiv:2405.10461 [pdf, other]

Prediction in Measurement Error Models

Authors: Fei Jiang, Yanyuan Ma

Abstract: We study the well known difficult problem of prediction in measurement error models. By targeting directly at the prediction interval instead of the point prediction, we construct a prediction interval by providing estimators of both the center and the length of the interval which achieves a pre-determined prediction level. The constructing procedure requires a working model for the distribution o… ▽ More We study the well known difficult problem of prediction in measurement error models. By targeting directly at the prediction interval instead of the point prediction, we construct a prediction interval by providing estimators of both the center and the length of the interval which achieves a pre-determined prediction level. The constructing procedure requires a working model for the distribution of the variable prone to error. If the working model is correct, the prediction interval estimator obtains the smallest variability in terms of assessing the true center and length. If the working model is incorrect, the prediction interval estimation is still consistent. We further study how the length of the prediction interval depends on the choice of the true prediction interval center and provide guidance on obtaining minimal prediction interval length. Numerical experiments are conducted to illustrate the performance and we apply our method to predict concentration of Abeta1-12 in cerebrospinal fluid in an Alzheimer's disease data. △ Less

Submitted 16 May, 2024; originally announced May 2024.

arXiv:2405.06889 [pdf, other]

Tuning parameter selection for the adaptive nuclear norm regularized trace regression

Authors: Pan Shang, Lingchen Kong, Yiting Ma

Abstract: Regularized models have been applied in lots of areas, with high-dimensional data sets being popular. Because tuning parameter decides the theoretical performance and computational efficiency of the regularized models, tuning parameter selection is a basic and important issue. We consider the tuning parameter selection for adaptive nuclear norm regularized trace regression, which achieves by the B… ▽ More Regularized models have been applied in lots of areas, with high-dimensional data sets being popular. Because tuning parameter decides the theoretical performance and computational efficiency of the regularized models, tuning parameter selection is a basic and important issue. We consider the tuning parameter selection for adaptive nuclear norm regularized trace regression, which achieves by the Bayesian information criterion (BIC). The proposed BIC is established with the help of an unbiased estimator of degrees of freedom. Under some regularized conditions, this BIC is proved to achieve the rank consistency of the tuning parameter selection. That is the model solution under selected tuning parameter converges to the true solution and has the same rank with that of the true solution in probability. Some numerical results are presented to evaluate the performance of the proposed BIC on tuning parameter selection. △ Less

Submitted 10 May, 2024; originally announced May 2024.

arXiv:2404.08913 [pdf, ps, other]

On the best approximation by finite Gaussian mixtures

Authors: Yun Ma, Yihong Wu, Pengkun Yang

Abstract: We consider the problem of approximating a general Gaussian location mixture by finite mixtures. The minimum order of finite mixtures that achieve a prescribed accuracy (measured by various $f$-divergences) is determined within constant factors for the family of mixing distributions with compactly support or appropriate assumptions on the tail probability including subgaussian and subexponential.… ▽ More We consider the problem of approximating a general Gaussian location mixture by finite mixtures. The minimum order of finite mixtures that achieve a prescribed accuracy (measured by various $f$-divergences) is determined within constant factors for the family of mixing distributions with compactly support or appropriate assumptions on the tail probability including subgaussian and subexponential. While the upper bound is achieved using the technique of local moment matching, the lower bound is established by relating the best approximation error to the low-rank approximation of certain trigonometric moment matrices, followed by a refined spectral analysis of their minimum eigenvalue. In the case of Gaussian mixing distributions, this result corrects a previous lower bound in [Allerton Conference 48 (2010) 620-628]. △ Less

Submitted 13 April, 2024; originally announced April 2024.

arXiv:2404.02446 [pdf, other]

Masked Completion via Structured Diffusion with White-Box Transformers

Authors: Druv Pai, Ziyang Wu, Sam Buchanan, Yaodong Yu, Yi Ma

Abstract: Modern learning frameworks often train deep neural networks with massive amounts of unlabeled data to learn representations by solving simple pretext tasks, then use the representations as foundations for downstream tasks. These networks are empirically designed; as such, they are usually not interpretable, their representations are not structured, and their designs are potentially redundant. Whit… ▽ More Modern learning frameworks often train deep neural networks with massive amounts of unlabeled data to learn representations by solving simple pretext tasks, then use the representations as foundations for downstream tasks. These networks are empirically designed; as such, they are usually not interpretable, their representations are not structured, and their designs are potentially redundant. White-box deep networks, in which each layer explicitly identifies and transforms structures in the data, present a promising alternative. However, existing white-box architectures have only been shown to work at scale in supervised settings with labeled data, such as classification. In this work, we provide the first instantiation of the white-box design paradigm that can be applied to large-scale unsupervised representation learning. We do this by exploiting a fundamental connection between diffusion, compression, and (masked) completion, deriving a deep transformer-like masked autoencoder architecture, called CRATE-MAE, in which the role of each layer is mathematically fully interpretable: they transform the data distribution to and from a structured representation. Extensive empirical evaluations confirm our analytical insights. CRATE-MAE demonstrates highly promising performance on large-scale imagery datasets while using only ~30% of the parameters compared to the standard masked autoencoder with the same model configuration. The representations learned by CRATE-MAE have explicit structure and also contain semantic meaning. Code is available at https://github.com/Ma-Lab-Berkeley/CRATE . △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: To be published at ICLR 2024; 44 pages. arXiv admin note: substantial text overlap with arXiv:2311.13110

arXiv:2403.11163 [pdf, ps, other]

doi 10.1080/24754269.2024.2343151

A Selective Review on Statistical Methods for Massive Data Computation: Distributed Computing, Subsampling, and Minibatch Techniques

Authors: Xuetong Li, Yuan Gao, Hong Chang, Danyang Huang, Yingying Ma, Rui Pan, Haobo Qi, Feifei Wang, Shuyuan Wu, Ke Xu, **g Zhou, Xuening Zhu, Yingqiu Zhu, Hansheng Wang

Abstract: This paper presents a selective review of statistical computation methods for massive data analysis. A huge amount of statistical methods for massive data computation have been rapidly developed in the past decades. In this work, we focus on three categories of statistical computation methods: (1) distributed computing, (2) subsampling methods, and (3) minibatch gradient techniques. The first clas… ▽ More This paper presents a selective review of statistical computation methods for massive data analysis. A huge amount of statistical methods for massive data computation have been rapidly developed in the past decades. In this work, we focus on three categories of statistical computation methods: (1) distributed computing, (2) subsampling methods, and (3) minibatch gradient techniques. The first class of literature is about distributed computing and focuses on the situation, where the dataset size is too huge to be comfortably handled by one single computer. In this case, a distributed computation system with multiple computers has to be utilized. The second class of literature is about subsampling methods and concerns about the situation, where the sample size of dataset is small enough to be placed on one single computer but too large to be easily processed by its memory as a whole. The last class of literature studies those minibatch gradient related optimization techniques, which have been extensively used for optimizing various deep learning models. △ Less

Submitted 17 March, 2024; originally announced March 2024.

arXiv:2402.18533 [pdf, other]

Constructing Bayesian Optimal Designs for Discrete Choice Experiments by Simulated Annealing

Authors: Yicheng Mao, Roselinde Kessels, Tom van der Zanden

Abstract: Discrete Choice Experiments (DCEs) investigate the attributes that influence individuals' choices when selecting among various options. To enhance the quality of the estimated choice models, researchers opt for Bayesian optimal designs that utilize existing information about the attributes' preferences. Given the nonlinear nature of choice models, the construction of an appropriate design requires… ▽ More Discrete Choice Experiments (DCEs) investigate the attributes that influence individuals' choices when selecting among various options. To enhance the quality of the estimated choice models, researchers opt for Bayesian optimal designs that utilize existing information about the attributes' preferences. Given the nonlinear nature of choice models, the construction of an appropriate design requires efficient algorithms. Among these, the Coordinate-Exchange (CE) algorithm is most commonly employed for constructing designs based on the multinomial logit model. Since this is a hill-climbing algorithm, obtaining better designs necessitates multiple random starting designs. This approach increases the algorithm's run-time, but may not lead to a significant improvement in results. We propose the use of a Simulated Annealing (SA) algorithm to construct Bayesian D-optimal designs. This algorithm accepts both superior and inferior solutions, avoiding premature convergence and allowing a more thorough exploration of potential designs. Consequently, it ultimately obtains higher-quality choice designs within the same time-frame. Our work represents the first application of an SA algorithm in constructing Bayesian optimal designs for DCEs. Through computational experiments and a real-life case study, we demonstrate that the SA designs consistently outperform the CE designs in terms of Bayesian D-efficiency, especially when the prior preference information is highly uncertain. △ Less

Submitted 28 February, 2024; originally announced February 2024.

arXiv:2402.15086 [pdf, other]

A modified debiased inverse-variance weighted estimator in two-sample summary-data Mendelian randomization

Authors: Youpeng Su, Siqi Xu, Yilei Ma, ** Yin, Wing Kam Fung, Hongwei Jiang, Peng Wang

Abstract: Mendelian randomization uses genetic variants as instrumental variables to make causal inferences about the effects of modifiable risk factors on diseases from observational data. One of the major challenges in Mendelian randomization is that many genetic variants are only modestly or even weakly associated with the risk factor of interest, a setting known as many weak instruments. Many existing m… ▽ More Mendelian randomization uses genetic variants as instrumental variables to make causal inferences about the effects of modifiable risk factors on diseases from observational data. One of the major challenges in Mendelian randomization is that many genetic variants are only modestly or even weakly associated with the risk factor of interest, a setting known as many weak instruments. Many existing methods, such as the popular inverse-variance weighted (IVW) method, could be biased when the instrument strength is weak. To address this issue, the debiased IVW (dIVW) estimator, which is shown to be robust to many weak instruments, was recently proposed. However, this estimator still has non-ignorable bias when the effective sample size is small. In this paper, we propose a modified debiased IVW (mdIVW) estimator by multiplying a modification factor to the original dIVW estimator. After this simple correction, we show that the bias of the mdIVW estimator converges to zero at a faster rate than that of the dIVW estimator under some regularity conditions. Moreover, the mdIVW estimator has smaller variance than the dIVW estimator.We further extend the proposed method to account for the presence of instrumental variable selection and balanced horizontal pleiotropy. We demonstrate the improvement of the mdIVW estimator over the dIVW estimator through extensive simulation studies and real data analysis. △ Less

Submitted 18 March, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

Comments: 33 pages, 6 figures

arXiv:2402.03726 [pdf, other]

Learning Granger Causality from Instance-wise Self-attentive Hawkes Processes

Authors: Dongxia Wu, Tsuyoshi Idé, Aurélie Lozano, Georgios Kollias, Jiří Navrátil, Naoki Abe, Yi-An Ma, Rose Yu

Abstract: We address the problem of learning Granger causality from asynchronous, interdependent, multi-type event sequences. In particular, we are interested in discovering instance-level causal structures in an unsupervised manner. Instance-level causality identifies causal relationships among individual events, providing more fine-grained information for decision-making. Existing work in the literature e… ▽ More We address the problem of learning Granger causality from asynchronous, interdependent, multi-type event sequences. In particular, we are interested in discovering instance-level causal structures in an unsupervised manner. Instance-level causality identifies causal relationships among individual events, providing more fine-grained information for decision-making. Existing work in the literature either requires strong assumptions, such as linearity in the intensity function, or heuristically defined model parameters that do not necessarily meet the requirements of Granger causality. We propose Instance-wise Self-Attentive Hawkes Processes (ISAHP), a novel deep learning framework that can directly infer the Granger causality at the event instance level. ISAHP is the first neural point process model that meets the requirements of Granger causality. It leverages the self-attention mechanism of the transformer to align with the principles of Granger causality. We empirically demonstrate that ISAHP is capable of discovering complex instance-level causal structures that cannot be handled by classical models. We also show that ISAHP achieves state-of-the-art performance in proxy tasks involving type-level causal discovery and instance-level event type prediction. △ Less

Submitted 29 February, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

arXiv:2402.01887 [pdf, other]

On f-Divergence Principled Domain Adaptation: An Improved Framework

Authors: Ziqiao Wang, Yongyi Mao

Abstract: Unsupervised domain adaptation (UDA) plays a crucial role in addressing distribution shifts in machine learning. In this work, we improve the theoretical foundations of UDA proposed by Acuna et al. (2021) by refining their f-divergence-based discrepancy and additionally introducing a new measure, f-domain discrepancy (f-DD). By removing the absolute value function and incorporating a scaling param… ▽ More Unsupervised domain adaptation (UDA) plays a crucial role in addressing distribution shifts in machine learning. In this work, we improve the theoretical foundations of UDA proposed by Acuna et al. (2021) by refining their f-divergence-based discrepancy and additionally introducing a new measure, f-domain discrepancy (f-DD). By removing the absolute value function and incorporating a scaling parameter, f-DD yields novel target error and sample complexity bounds, allowing us to recover previous KL-based results and bridging the gap between algorithms and theory presented in Acuna et al. (2021). Leveraging a localization technique, we also develop a fast-rate generalization bound. Empirical results demonstrate the superior performance of f-DD-based domain learning algorithms over previous works in popular UDA benchmarks. △ Less

Submitted 2 February, 2024; originally announced February 2024.

arXiv:2402.01710 [pdf]

Exploring Educational Equity: A Machine Learning Approach to Unravel Achievement Disparities in Georgia

Authors: Yichen Ma, Dima Nazzal

Abstract: The COVID-19 pandemic has significantly exacerbated existing educational disparities in Georgia's K-12 system, particularly in terms of racial and ethnic achievement gaps. Utilizing machine learning methods, the study conducts a comprehensive analysis of student achievement rates across different demographics, regions, and subjects. The findings highlight a significant decline in proficiency in En… ▽ More The COVID-19 pandemic has significantly exacerbated existing educational disparities in Georgia's K-12 system, particularly in terms of racial and ethnic achievement gaps. Utilizing machine learning methods, the study conducts a comprehensive analysis of student achievement rates across different demographics, regions, and subjects. The findings highlight a significant decline in proficiency in English and Math during the pandemic, with a noticeable contraction in score distribution and a greater impact on economically disadvantaged and Black students. Socio-economic status, as represented by the Directly Certified Percentage -- the percentage of students eligible for free lunch, emerges as the most crucial factor, with additional insights drawn from faculty resources such as teacher salaries and expenditure on instruction. The study also identifies disparities in achievement rates between urban and rural settings, as well as variations across counties, underscoring the influence of geographical and socio-economic factors. The data suggests that targeted interventions and resource allocation, particularly in schools with higher percentages of economically disadvantaged students, are essential for mitigating educational disparities. △ Less

Submitted 25 January, 2024; originally announced February 2024.

arXiv:2401.11742 [pdf]

Knowledge Navigation: Inferring the Interlocking Map of Knowledge from Research Trajectories

Authors: Shibing Xiang, Xin Jiang, Bing Liu, Yurui Huang, Chaolin Tian, Yifang Ma

Abstract: "If I have seen further, it is by standing on the shoulders of giants," Isaac Newton's renowned statement hints that new knowledge builds upon existing foundations, which means there exists an interdependent relationship between knowledge, which, yet uncovered, is implied in the historical development of scientific systems for hundreds of years. By leveraging natural language processing techniques… ▽ More "If I have seen further, it is by standing on the shoulders of giants," Isaac Newton's renowned statement hints that new knowledge builds upon existing foundations, which means there exists an interdependent relationship between knowledge, which, yet uncovered, is implied in the historical development of scientific systems for hundreds of years. By leveraging natural language processing techniques, this study introduces an innovative embedding scheme designed to infer the "knowledge interlocking map." This map, derived from the research trajectories of millions of scholars, reveals the intricate connections among knowledge. We validate that the inferred map effectively delineates disciplinary boundaries and captures the intricate relationships between diverse concepts. The utility of the interlocking map is showcased through multiple applications. Firstly, we demonstrated the multi-step analogy inferences within the knowledge space and the functional connectivity between concepts in different disciplines. Secondly, we trace the evolution of knowledge across domains, observing trends such as shifts from "Theoretical" to "Applied" or "Chemistry" to "Biomedical" along predefined functional directions. Lastly, by analyzing the high-dimensional knowledge network structure, we found that knowledge connects each other with shorter global pathways, and the interdisciplinary knowledge plays a critical role in accessibility of the global knowledge network. Our framework offers a novel approach to mining knowledge inheritance pathways in extensive scientific literature, which is of great significance for understanding scientific development patterns, tailoring scientific learning trajectories, and accelerating scientific progress. △ Less

Submitted 27 January, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

Comments: 28 pages, 9 figures, 5 tables

arXiv:2401.06325 [pdf, other]

Faster Sampling without Isoperimetry via Diffusion-based Monte Carlo

Authors: Xunpeng Huang, Difan Zou, Hanze Dong, Yian Ma, Tong Zhang

Abstract: To sample from a general target distribution $p_*\propto e^{-f_*}$ beyond the isoperimetric condition, Huang et al. (2023) proposed to perform sampling through reverse diffusion, giving rise to Diffusion-based Monte Carlo (DMC). Specifically, DMC follows the reverse SDE of a diffusion process that transforms the target distribution to the standard Gaussian, utilizing a non-parametric score estimat… ▽ More To sample from a general target distribution $p_*\propto e^{-f_*}$ beyond the isoperimetric condition, Huang et al. (2023) proposed to perform sampling through reverse diffusion, giving rise to Diffusion-based Monte Carlo (DMC). Specifically, DMC follows the reverse SDE of a diffusion process that transforms the target distribution to the standard Gaussian, utilizing a non-parametric score estimation. However, the original DMC algorithm encountered high gradient complexity, resulting in an exponential dependency on the error tolerance $ε$ of the obtained samples. In this paper, we demonstrate that the high complexity of DMC originates from its redundant design of score estimation, and proposed a more efficient algorithm, called RS-DMC, based on a novel recursive score estimation method. In particular, we first divide the entire diffusion process into multiple segments and then formulate the score estimation step (at any time step) as a series of interconnected mean estimation and sampling subproblems accordingly, which are correlated in a recursive manner. Importantly, we show that with a proper design of the segment decomposition, all sampling subproblems will only need to tackle a strongly log-concave distribution, which can be very efficient to solve using the Langevin-based samplers with a provably rapid convergence rate. As a result, we prove that the gradient complexity of RS-DMC only has a quasi-polynomial dependency on $ε$, which significantly improves exponential gradient complexity in Huang et al. (2023). Furthermore, under commonly used dissipative conditions, our algorithm is provably much faster than the popular Langevin-based algorithms. Our algorithm design and theoretical framework illuminate a novel direction for addressing sampling problems, which could be of broader applicability in the community. △ Less

Submitted 11 January, 2024; originally announced January 2024.

Comments: 54 pages

arXiv:2312.02199 [pdf, other]

USat: A Unified Self-Supervised Encoder for Multi-Sensor Satellite Imagery

Authors: Jeremy Irvin, Lucas Tao, Joanne Zhou, Yuntao Ma, Langston Nashold, Benjamin Liu, Andrew Y. Ng

Abstract: Large, self-supervised vision models have led to substantial advancements for automatically interpreting natural images. Recent works have begun tailoring these methods to remote sensing data which has rich structure with multi-sensor, multi-spectral, and temporal information providing massive amounts of self-labeled data that can be used for self-supervised pre-training. In this work, we develop… ▽ More Large, self-supervised vision models have led to substantial advancements for automatically interpreting natural images. Recent works have begun tailoring these methods to remote sensing data which has rich structure with multi-sensor, multi-spectral, and temporal information providing massive amounts of self-labeled data that can be used for self-supervised pre-training. In this work, we develop a new encoder architecture called USat that can input multi-spectral data from multiple sensors for self-supervised pre-training. USat is a vision transformer with modified patch projection layers and positional encodings to model spectral bands with varying spatial scales from multiple sensors. We integrate USat into a Masked Autoencoder (MAE) self-supervised pre-training procedure and find that a pre-trained USat outperforms state-of-the-art self-supervised MAE models trained on remote sensing data on multiple remote sensing benchmark datasets (up to 8%) and leads to improvements in low data regimes (up to 7%). Code and pre-trained weights are available at https://github.com/stanfordmlgroup/USat . △ Less

Submitted 2 December, 2023; originally announced December 2023.

arXiv:2312.01046 [pdf, other]

Bagged Regularized $k$-Distances for Anomaly Detection

Authors: Yuchao Cai, Yuheng Ma, Hanfang Yang, Hanyuan Hang

Abstract: We consider the paradigm of unsupervised anomaly detection, which involves the identification of anomalies within a dataset in the absence of labeled examples. Though distance-based methods are top-performing for unsupervised anomaly detection, they suffer heavily from the sensitivity to the choice of the number of the nearest neighbors. In this paper, we propose a new distance-based algorithm cal… ▽ More We consider the paradigm of unsupervised anomaly detection, which involves the identification of anomalies within a dataset in the absence of labeled examples. Though distance-based methods are top-performing for unsupervised anomaly detection, they suffer heavily from the sensitivity to the choice of the number of the nearest neighbors. In this paper, we propose a new distance-based algorithm called bagged regularized $k$-distances for anomaly detection (BRDAD) converting the unsupervised anomaly detection problem into a convex optimization problem. Our BRDAD algorithm selects the weights by minimizing the surrogate risk, i.e., the finite sample bound of the empirical risk of the bagged weighted $k$-distances for density estimation (BWDDE). This approach enables us to successfully address the sensitivity challenge of the hyperparameter choice in distance-based algorithms. Moreover, when dealing with large-scale datasets, the efficiency issues can be addressed by the incorporated bagging technique in our BRDAD algorithm. On the theoretical side, we establish fast convergence rates of the AUC regret of our algorithm and demonstrate that the bagging technique significantly reduces the computational complexity. On the practical side, we conduct numerical experiments on anomaly detection benchmarks to illustrate the insensitivity of parameter selection of our algorithm compared with other state-of-the-art distance-based methods. Moreover, promising improvements are brought by applying the bagging technique in our algorithm on real-world datasets. △ Less

Submitted 13 February, 2024; v1 submitted 2 December, 2023; originally announced December 2023.

arXiv:2311.11369 [pdf, other]

Optimal Locally Private Nonparametric Classification with Public Data

Authors: Yuheng Ma, Hanfang Yang

Abstract: In this work, we investigate the problem of public data assisted non-interactive Local Differentially Private (LDP) learning with a focus on non-parametric classification. Under the posterior drift assumption, we for the first time derive the mini-max optimal convergence rate with LDP constraint. Then, we present a novel approach, the locally differentially private classification tree, which attai… ▽ More In this work, we investigate the problem of public data assisted non-interactive Local Differentially Private (LDP) learning with a focus on non-parametric classification. Under the posterior drift assumption, we for the first time derive the mini-max optimal convergence rate with LDP constraint. Then, we present a novel approach, the locally differentially private classification tree, which attains the mini-max optimal convergence rate. Furthermore, we design a data-driven pruning procedure that avoids parameter tuning and provides a fast converging estimator. Comprehensive experiments conducted on synthetic and real data sets show the superior performance of our proposed methods. Both our theoretical and experimental findings demonstrate the effectiveness of public data compared to private data, which leads to practical suggestions for prioritizing non-private data collection. △ Less

Submitted 2 June, 2024; v1 submitted 19 November, 2023; originally announced November 2023.

arXiv:2310.20102 [pdf, ps, other]

Sample-Conditioned Hypothesis Stability Sharpens Information-Theoretic Generalization Bounds

Authors: Ziqiao Wang, Yongyi Mao

Abstract: We present new information-theoretic generalization guarantees through the a novel construction of the "neighboring-hypothesis" matrix and a new family of stability notions termed sample-conditioned hypothesis (SCH) stability. Our approach yields sharper bounds that improve upon previous information-theoretic bounds in various learning scenarios. Notably, these bounds address the limitations of ex… ▽ More We present new information-theoretic generalization guarantees through the a novel construction of the "neighboring-hypothesis" matrix and a new family of stability notions termed sample-conditioned hypothesis (SCH) stability. Our approach yields sharper bounds that improve upon previous information-theoretic bounds in various learning scenarios. Notably, these bounds address the limitations of existing information-theoretic bounds in the context of stochastic convex optimization (SCO) problems, as explored in the recent work by Haghifam et al. (2023). △ Less

Submitted 30 October, 2023; originally announced October 2023.

Comments: Accepted at NeurIPS 2023

arXiv:2310.18919 [pdf, other]

Posterior Sampling with Delayed Feedback for Reinforcement Learning with Linear Function Approximation

Authors: Nikki Li**g Kuang, Ming Yin, Mengdi Wang, Yu-Xiang Wang, Yi-An Ma

Abstract: Recent studies in reinforcement learning (RL) have made significant progress by leveraging function approximation to alleviate the sample complexity hurdle for better performance. Despite the success, existing provably efficient algorithms typically rely on the accessibility of immediate feedback upon taking actions. The failure to account for the impact of delay in observations can significantly… ▽ More Recent studies in reinforcement learning (RL) have made significant progress by leveraging function approximation to alleviate the sample complexity hurdle for better performance. Despite the success, existing provably efficient algorithms typically rely on the accessibility of immediate feedback upon taking actions. The failure to account for the impact of delay in observations can significantly degrade the performance of real-world systems due to the regret blow-up. In this work, we tackle the challenge of delayed feedback in RL with linear function approximation by employing posterior sampling, which has been shown to empirically outperform the popular UCB algorithms in a wide range of regimes. We first introduce Delayed-PSVI, an optimistic value-based algorithm that effectively explores the value function space via noise perturbation with posterior sampling. We provide the first analysis for posterior sampling algorithms with delayed feedback in RL and show our algorithm achieves $\widetilde{O}(\sqrt{d^3H^3 T} + d^2H^2 E[τ])$ worst-case regret in the presence of unknown stochastic delays. Here $E[τ]$ is the expected delay. To further improve its computational efficiency and to expand its applicability in high-dimensional RL problems, we incorporate a gradient-based approximate sampling scheme via Langevin dynamics for Delayed-LPSVI, which maintains the same order-optimal regret guarantee with $\widetilde{O}(dHK)$ computational cost. Empirical evaluations are performed to demonstrate the statistical and computational efficacy of our algorithms. △ Less

Submitted 3 November, 2023; v1 submitted 29 October, 2023; originally announced October 2023.

arXiv:2310.14661 [pdf, other]

Tractable MCMC for Private Learning with Pure and Gaussian Differential Privacy

Authors: Yingyu Lin, Yi-An Ma, Yu-Xiang Wang, Rachel Redberg, Zhiqi Bu

Abstract: Posterior sampling, i.e., exponential mechanism to sample from the posterior distribution, provides $\varepsilon$-pure differential privacy (DP) guarantees and does not suffer from potentially unbounded privacy breach introduced by $(\varepsilon,δ)$-approximate DP. In practice, however, one needs to apply approximate sampling methods such as Markov chain Monte Carlo (MCMC), thus re-introducing the… ▽ More Posterior sampling, i.e., exponential mechanism to sample from the posterior distribution, provides $\varepsilon$-pure differential privacy (DP) guarantees and does not suffer from potentially unbounded privacy breach introduced by $(\varepsilon,δ)$-approximate DP. In practice, however, one needs to apply approximate sampling methods such as Markov chain Monte Carlo (MCMC), thus re-introducing the unappealing $δ$-approximation error into the privacy guarantees. To bridge this gap, we propose the Approximate SAample Perturbation (abbr. ASAP) algorithm which perturbs an MCMC sample with noise proportional to its Wasserstein-infinity ($W_\infty$) distance from a reference distribution that satisfies pure DP or pure Gaussian DP (i.e., $δ=0$). We then leverage a Metropolis-Hastings algorithm to generate the sample and prove that the algorithm converges in $W_\infty$ distance. We show that by combining our new techniques with a localization step, we obtain the first nearly linear-time algorithm that achieves the optimal rates in the DP-ERM problem with strongly convex and smooth losses. △ Less

Submitted 1 May, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

arXiv:2310.10048 [pdf, other]

Evaluation of transplant benefits with the U.S. Scientific Registry of Transplant Recipients by semiparametric regression of mean residual life

Authors: Ge Zhao, Yanyuan Ma, Huazhen Lin, Yi Li

Abstract: Kidney transplantation is the most effective renal replacement therapy for end stage renal disease patients. With the severe shortage of kidney supplies and for the clinical effectiveness of transplantation, patient's life expectancy post transplantation is used to prioritize patients for transplantation; however, severe comorbidity conditions and old age are the most dominant factors that negativ… ▽ More Kidney transplantation is the most effective renal replacement therapy for end stage renal disease patients. With the severe shortage of kidney supplies and for the clinical effectiveness of transplantation, patient's life expectancy post transplantation is used to prioritize patients for transplantation; however, severe comorbidity conditions and old age are the most dominant factors that negatively impact post-transplantation life expectancy, effectively precluding sick or old patients from receiving transplants. It would be crucial to design objective measures to quantify the transplantation benefit by comparing the mean residual life with and without a transplant, after adjusting for comorbidity and demographic conditions. To address this urgent need, we propose a new class of semiparametric covariate-dependent mean residual life models. Our method estimates covariate effects semiparametrically efficiently and the mean residual life function nonparametrically, enabling us to predict the residual life increment potential for any given patient. Our method potentially leads to a more fair system that prioritizes patients who would have the largest residual life gains. Our analysis of the kidney transplant data from the U.S. Scientific Registry of Transplant Recipients also suggests that a single index of covariates summarize well the impacts of multiple covariates, which may facilitate interpretations of each covariate's effect. Our subgroup analysis further disclosed inequalities in survival gains across groups defined by race, gender and insurance type (reflecting socioeconomic status). △ Less

Submitted 17 October, 2023; v1 submitted 16 October, 2023; originally announced October 2023.

Comments: 68 pages, 13 figures. arXiv admin note: text overlap with arXiv:2011.04067

arXiv:2310.06312 [pdf, other]

Discovering Mixtures of Structural Causal Models from Time Series Data

Authors: Sumanth Varambally, Yi-An Ma, Rose Yu

Abstract: Discovering causal relationships from time series data is significant in fields such as finance, climate science, and neuroscience. However, contemporary techniques rely on the simplifying assumption that data originates from the same causal model, while in practice, data is heterogeneous and can stem from different causal models. In this work, we relax this assumption and perform causal discovery… ▽ More Discovering causal relationships from time series data is significant in fields such as finance, climate science, and neuroscience. However, contemporary techniques rely on the simplifying assumption that data originates from the same causal model, while in practice, data is heterogeneous and can stem from different causal models. In this work, we relax this assumption and perform causal discovery from time series data originating from a mixture of causal models. We propose a general variational inference-based framework called MCD to infer the underlying causal models as well as the mixing probability of each sample. Our approach employs an end-to-end training process that maximizes an evidence-lower bound for the data likelihood. We present two variants: MCD-Linear for linear relationships and independent noise, and MCD-Nonlinear for nonlinear causal relationships and history-dependent noise. We demonstrate that our method surpasses state-of-the-art benchmarks in causal discovery tasks through extensive experimentation on synthetic and real-world datasets, particularly when the data emanates from diverse underlying causal graphs. Theoretically, we prove the identifiability of such a model under some mild assumptions. △ Less

Submitted 23 June, 2024; v1 submitted 10 October, 2023; originally announced October 2023.

arXiv:2309.08543 [pdf, other]

Fisher's combined probability test for cross-sectional independence in panel data models with serial correlation

Authors: Hongfei Wang, Binghui Liu, Long Feng, Yanyuan Ma

Abstract: Testing cross-sectional independence in panel data models is of fundamental importance in econometric analysis with high-dimensional panels. Recently, econometricians began to turn their attention to the problem in the presence of serial dependence. The existing procedure for testing cross-sectional independence with serial correlation is based on the sum of the sample cross-sectional correlations… ▽ More Testing cross-sectional independence in panel data models is of fundamental importance in econometric analysis with high-dimensional panels. Recently, econometricians began to turn their attention to the problem in the presence of serial dependence. The existing procedure for testing cross-sectional independence with serial correlation is based on the sum of the sample cross-sectional correlations, which generally performs well when the alternative has dense cross-sectional correlations, but suffers from low power against sparse alternatives. To deal with sparse alternatives, we propose a test based on the maximum of the squared sample cross-sectional correlations. Furthermore, we propose a combined test to combine the p-values of the max based and sum based tests, which performs well under both dense and sparse alternatives. The combined test relies on the asymptotic independence of the max based and sum based test statistics, which we show rigorously. We show that the proposed max based and combined tests have attractive theoretical properties and demonstrate the superior performance via extensive simulation results. We apply the two new tests to analyze the weekly returns on the securities in the S\&P 500 index under the Fama-French three-factor model, and confirm the usefulness of the proposed combined test in detecting cross-sectional independence. △ Less

Submitted 15 September, 2023; originally announced September 2023.

arXiv:2307.14642 [pdf, ps, other]

Linear Convergence of Black-Box Variational Inference: Should We Stick the Landing?

Authors: Kyurae Kim, Yian Ma, Jacob R. Gardner

Abstract: We prove that black-box variational inference (BBVI) with control variates, particularly the sticking-the-landing (STL) estimator, converges at a geometric (traditionally called "linear") rate under perfect variational family specification. In particular, we prove a quadratic bound on the gradient variance of the STL estimator, one which encompasses misspecified variational families. Combined with… ▽ More We prove that black-box variational inference (BBVI) with control variates, particularly the sticking-the-landing (STL) estimator, converges at a geometric (traditionally called "linear") rate under perfect variational family specification. In particular, we prove a quadratic bound on the gradient variance of the STL estimator, one which encompasses misspecified variational families. Combined with previous works on the quadratic variance condition, this directly implies convergence of BBVI with the use of projected stochastic gradient descent. For the projection operator, we consider a domain with triangular scale matrices, which the projection onto is computable in $Θ(d)$ time, where $d$ is the dimensionality of the target posterior. We also improve existing analysis on the regular closed-form entropy gradient estimators, which enables comparison against the STL estimator, providing explicit non-asymptotic complexity guarantees for both. △ Less

Submitted 18 June, 2024; v1 submitted 27 July, 2023; originally announced July 2023.

Comments: Accepted to AISTATS'24; v5: fixed missing expectations in iteration complexity statements; v6: changed to an indexing-friendly bibliography style

arXiv:2307.13381 [pdf, other]

Scaff-PD: Communication Efficient Fair and Robust Federated Learning

Authors: Yaodong Yu, Sai Praneeth Karimireddy, Yi Ma, Michael I. Jordan

Abstract: We present Scaff-PD, a fast and communication-efficient algorithm for distributionally robust federated learning. Our approach improves fairness by optimizing a family of distributionally robust objectives tailored to heterogeneous clients. We leverage the special structure of these objectives, and design an accelerated primal dual (APD) algorithm which uses bias corrected local steps (as in Scaff… ▽ More We present Scaff-PD, a fast and communication-efficient algorithm for distributionally robust federated learning. Our approach improves fairness by optimizing a family of distributionally robust objectives tailored to heterogeneous clients. We leverage the special structure of these objectives, and design an accelerated primal dual (APD) algorithm which uses bias corrected local steps (as in Scaffold) to achieve significant gains in communication efficiency and convergence speed. We evaluate Scaff-PD on several benchmark datasets and demonstrate its effectiveness in improving fairness and robustness while maintaining competitive accuracy. Our results suggest that Scaff-PD is a promising approach for federated learning in resource-constrained and heterogeneous settings. △ Less

Submitted 25 July, 2023; originally announced July 2023.

MSC Class: 68W40; 68W15; 90C25; 90C06 ACM Class: G.1.6; F.2.1; E.4

arXiv:2307.04250 [pdf, ps, other]

Doubly Flexible Estimation under Label Shift

Authors: Seong-ho Lee, Yanyuan Ma, Jiwei Zhao

Abstract: In studies ranging from clinical medicine to policy research, complete data are usually available from a population $\mathscr{P}$, but the quantity of interest is often sought for a related but different population $\mathscr{Q}$ which only has partial data. In this paper, we consider the setting that both outcome $Y$ and covariate ${\bf X}$ are available from $\mathscr{P}$ whereas only ${\bf X}$ i… ▽ More In studies ranging from clinical medicine to policy research, complete data are usually available from a population $\mathscr{P}$, but the quantity of interest is often sought for a related but different population $\mathscr{Q}$ which only has partial data. In this paper, we consider the setting that both outcome $Y$ and covariate ${\bf X}$ are available from $\mathscr{P}$ whereas only ${\bf X}$ is available from $\mathscr{Q}$, under the so-called label shift assumption, i.e., the conditional distribution of ${\bf X}$ given $Y$ remains the same across the two populations. To estimate the parameter of interest in $\mathscr{Q}$ via leveraging the information from $\mathscr{P}$, the following three ingredients are essential: (a) the common conditional distribution of ${\bf X}$ given $Y$, (b) the regression model of $Y$ given ${\bf X}$ in $\mathscr{P}$, and (c) the density ratio of $Y$ between the two populations. We propose an estimation procedure that only needs standard nonparametric technique to approximate the conditional expectations with respect to (a), while by no means needs an estimate or model for (b) or (c); i.e., doubly flexible to the possible model misspecifications of both (b) and (c). This is conceptually different from the well-known doubly robust estimation in that, double robustness allows at most one model to be misspecified whereas our proposal can allow both (b) and (c) to be misspecified. This is of particular interest in our setting because estimating (c) is difficult, if not impossible, by virtue of the absence of the $Y$-data in $\mathscr{Q}$. Furthermore, even though the estimation of (b) is sometimes off-the-shelf, it can face curse of dimensionality or computational challenges. We develop the large sample theory for the proposed estimator, and examine its finite-sample performance through simulation studies as well as an application to the MIMIC-III database. △ Less

Submitted 9 July, 2023; originally announced July 2023.

arXiv:2307.02037 [pdf, other]

Reverse Diffusion Monte Carlo

Authors: Xunpeng Huang, Hanze Dong, Yifan Hao, Yi-An Ma, Tong Zhang

Abstract: We propose a Monte Carlo sampler from the reverse diffusion process. Unlike the practice of diffusion models, where the intermediary updates -- the score functions -- are learned with a neural network, we transform the score matching problem into a mean estimation one. By estimating the means of the regularized posterior distributions, we derive a novel Monte Carlo sampling algorithm called revers… ▽ More We propose a Monte Carlo sampler from the reverse diffusion process. Unlike the practice of diffusion models, where the intermediary updates -- the score functions -- are learned with a neural network, we transform the score matching problem into a mean estimation one. By estimating the means of the regularized posterior distributions, we derive a novel Monte Carlo sampling algorithm called reverse diffusion Monte Carlo (rdMC), which is distinct from the Markov chain Monte Carlo (MCMC) methods. We determine the sample size from the error tolerance and the properties of the posterior distribution to yield an algorithm that can approximately sample the target distribution with any desired accuracy. Additionally, we demonstrate and prove under suitable conditions that sampling with rdMC can be significantly faster than that with MCMC. For multi-modal target distributions such as those in Gaussian mixture models, rdMC greatly improves over the Langevin-style MCMC sampling methods both theoretically and in practice. The proposed rdMC method offers a new perspective and solution beyond classical MCMC algorithms for the challenging complex distributions. △ Less

Submitted 13 March, 2024; v1 submitted 5 July, 2023; originally announced July 2023.

Comments: 44 pages, 16 figures, ICLR 2024

arXiv:2306.08803 [pdf, other]

Langevin Thompson Sampling with Logarithmic Communication: Bandits and Reinforcement Learning

Authors: Amin Karbasi, Nikki Li**g Kuang, Yi-An Ma, Siddharth Mitra

Abstract: Thompson sampling (TS) is widely used in sequential decision making due to its ease of use and appealing empirical performance. However, many existing analytical and empirical results for TS rely on restrictive assumptions on reward distributions, such as belonging to conjugate families, which limits their applicability in realistic scenarios. Moreover, sequential decision making problems are ofte… ▽ More Thompson sampling (TS) is widely used in sequential decision making due to its ease of use and appealing empirical performance. However, many existing analytical and empirical results for TS rely on restrictive assumptions on reward distributions, such as belonging to conjugate families, which limits their applicability in realistic scenarios. Moreover, sequential decision making problems are often carried out in a batched manner, either due to the inherent nature of the problem or to serve the purpose of reducing communication and computation costs. In this work, we jointly study these problems in two popular settings, namely, stochastic multi-armed bandits (MABs) and infinite-horizon reinforcement learning (RL), where TS is used to learn the unknown reward distributions and transition dynamics, respectively. We propose batched $\textit{Langevin Thompson Sampling}$ algorithms that leverage MCMC methods to sample from approximate posteriors with only logarithmic communication costs in terms of batches. Our algorithms are computationally efficient and maintain the same order-optimal regret guarantees of $\mathcal{O}(\log T)$ for stochastic MABs, and $\mathcal{O}(\sqrt{T})$ for RL. We complement our theoretical findings with experimental results. △ Less

Submitted 14 June, 2023; originally announced June 2023.

Comments: ICML 2023

ACM Class: G.3; I.2.0

arXiv:2306.07549 [pdf, other]

Fixed-Budget Best-Arm Identification with Heterogeneous Reward Variances

Authors: Anusha Lalitha, Kousha Kalantari, Yifei Ma, Anoop Deoras, Branislav Kveton

Abstract: We study the problem of best-arm identification (BAI) in the fixed-budget setting with heterogeneous reward variances. We propose two variance-adaptive BAI algorithms for this setting: SHVar for known reward variances and SHAdaVar for unknown reward variances. Our algorithms rely on non-uniform budget allocations among the arms where the arms with higher reward variances are pulled more often than… ▽ More We study the problem of best-arm identification (BAI) in the fixed-budget setting with heterogeneous reward variances. We propose two variance-adaptive BAI algorithms for this setting: SHVar for known reward variances and SHAdaVar for unknown reward variances. Our algorithms rely on non-uniform budget allocations among the arms where the arms with higher reward variances are pulled more often than those with lower variances. The main algorithmic novelty is in the design of SHAdaVar, which allocates budget greedily based on overestimating the unknown reward variances. We bound probabilities of misidentifying the best arms in both SHVar and SHAdaVar. Our analyses rely on novel lower bounds on the number of pulls of an arm that do not require closed-form solutions to the budget allocation problem. Since one of our budget allocation problems is analogous to the optimal experiment design with unknown variances, we believe that our results are of a broad interest. Our experiments validate our theory, and show that SHVar and SHAdaVar outperform algorithms from prior works with analytical guarantees. △ Less

Submitted 13 June, 2023; originally announced June 2023.

arXiv:2306.02601 [pdf, other]

Aiming towards the minimizers: fast convergence of SGD for overparametrized problems

Authors: Chaoyue Liu, Dmitriy Drusvyatskiy, Mikhail Belkin, Damek Davis, Yi-An Ma

Abstract: Modern machine learning paradigms, such as deep learning, occur in or close to the interpolation regime, wherein the number of model parameters is much larger than the number of data samples. In this work, we propose a regularity condition within the interpolation regime which endows the stochastic gradient method with the same worst-case iteration complexity as the deterministic gradient method,… ▽ More Modern machine learning paradigms, such as deep learning, occur in or close to the interpolation regime, wherein the number of model parameters is much larger than the number of data samples. In this work, we propose a regularity condition within the interpolation regime which endows the stochastic gradient method with the same worst-case iteration complexity as the deterministic gradient method, while using only a single sampled gradient (or a minibatch) in each iteration. In contrast, all existing guarantees require the stochastic gradient method to take small steps, thereby resulting in a much slower linear rate of convergence. Finally, we demonstrate that our condition holds when training sufficiently wide feedforward neural networks with a linear output layer. △ Less

Submitted 5 June, 2023; originally announced June 2023.

arXiv:2305.15349 [pdf, other]

On the Convergence of Black-Box Variational Inference

Authors: Kyurae Kim, Jisu Oh, Kaiwen Wu, Yi-An Ma, Jacob R. Gardner

Abstract: We provide the first convergence guarantee for full black-box variational inference (BBVI), also known as Monte Carlo variational inference. While preliminary investigations worked on simplified versions of BBVI (e.g., bounded domain, bounded support, only optimizing for the scale, and such), our setup does not need any such algorithmic modifications. Our results hold for log-smooth posterior dens… ▽ More We provide the first convergence guarantee for full black-box variational inference (BBVI), also known as Monte Carlo variational inference. While preliminary investigations worked on simplified versions of BBVI (e.g., bounded domain, bounded support, only optimizing for the scale, and such), our setup does not need any such algorithmic modifications. Our results hold for log-smooth posterior densities with and without strong log-concavity and the location-scale variational family. Also, our analysis reveals that certain algorithm design choices commonly employed in practice, particularly, nonlinear parameterizations of the scale of the variational approximation, can result in suboptimal convergence rates. Fortunately, running BBVI with proximal stochastic gradient descent fixes these limitations, and thus achieves the strongest known convergence rate guarantees. We evaluate this theoretical insight by comparing proximal SGD against other standard implementations of BBVI on large-scale Bayesian inference problems. △ Less

Submitted 10 January, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: Accepted to NeurIPS'23; previous title: "Black-Box Variational Inference Converges"

arXiv:2304.08974 [pdf, ps, other]

Doubly Robust Estimators with Weak Overlap

Authors: Yukun Ma, Pedro H. C. Sant'Anna, Yuya Sasaki, Takuya Ura

Abstract: In this paper, we derive a new class of doubly robust estimators for treatment effect estimands that is also robust against weak covariate overlap. Our proposed estimator relies on trimming observations with extreme propensity scores and uses a bias correction device for trimming bias. Our framework accommodates many research designs, such as unconfoundedness, local treatment effects, and differen… ▽ More In this paper, we derive a new class of doubly robust estimators for treatment effect estimands that is also robust against weak covariate overlap. Our proposed estimator relies on trimming observations with extreme propensity scores and uses a bias correction device for trimming bias. Our framework accommodates many research designs, such as unconfoundedness, local treatment effects, and difference-in-differences. Simulation exercises illustrate that our proposed tools indeed have attractive finite sample properties, which are aligned with our theoretical asymptotic results. △ Less

Submitted 22 April, 2023; v1 submitted 18 April, 2023; originally announced April 2023.

arXiv:2303.14900 [pdf, other]

Nonparametric approaches for analyzing carbon emission: from statistical and machine learning perspectives

Authors: Yiming Ma, Hang Liu, Shanyong Wang

Abstract: Linear regression models, especially the extended STIRPAT model, are routinely-applied for analyzing carbon emissions data. However, since the relationship between carbon emissions and the influencing factors is complex, fitting a simple parametric model may not be an ideal solution. This paper investigated various nonparametric approaches in statistics and machine learning (ML) for modeling carbo… ▽ More Linear regression models, especially the extended STIRPAT model, are routinely-applied for analyzing carbon emissions data. However, since the relationship between carbon emissions and the influencing factors is complex, fitting a simple parametric model may not be an ideal solution. This paper investigated various nonparametric approaches in statistics and machine learning (ML) for modeling carbon emissions data, including kernel regression, random forest and neural network. We selected data from ten Chinese cities from 2005 to 2019 for modeling studies. We found that neural network had the best performance in both fitting and prediction accuracy, which implies its capability of expressing the complex relationships between carbon emissions and the influencing factors. This study provides a new means for quantitative modeling of carbon emissions research that helps to understand how to characterize urban carbon emissions and to propose policy recommendations for "carbon reduction". In addition, we used the carbon emissions data of Wuhu city as an example to illustrate how to use this new approach. △ Less

Submitted 26 March, 2023; originally announced March 2023.

arXiv:2303.11054 [pdf, other]

Some novel aspects of quantile regression: local stationarity, random forests and optimal transportation

Authors: Manon Felix, Davide La Vecchia, Hang Liu, Yiming Ma

Abstract: This paper is written for a Festschrift in honour of Professor Marc Hallin and it proposes some developments on quantile regression. We connect our investigation to Marc's scientific production and we present some theoretical and methodological advances for quantiles estimation in non standard settings. We split our contributions in two parts. The first part is about conditional quantiles estimati… ▽ More This paper is written for a Festschrift in honour of Professor Marc Hallin and it proposes some developments on quantile regression. We connect our investigation to Marc's scientific production and we present some theoretical and methodological advances for quantiles estimation in non standard settings. We split our contributions in two parts. The first part is about conditional quantiles estimation for nonstationary time series. The second part is about conditional quantiles estimation for the analysis of multivariate independent data in the presence of possibly large dimensional covariates. Monte Carlo studies illustrate numerically the performance of our methods and compare them to some extant techniques. △ Less

Submitted 9 September, 2023; v1 submitted 20 March, 2023; originally announced March 2023.

arXiv:2302.07533 [pdf, ps, other]

Optimal Subsampling Bootstrap for Massive Data

Authors: Yingying Ma, Chenlei Leng, Hansheng Wang

Abstract: The bootstrap is a widely used procedure for statistical inference because of its simplicity and attractive statistical properties. However, the vanilla version of bootstrap is no longer feasible computationally for many modern massive datasets due to the need to repeatedly resample the entire data. Therefore, several improvements to the bootstrap method have been made in recent years, which asses… ▽ More The bootstrap is a widely used procedure for statistical inference because of its simplicity and attractive statistical properties. However, the vanilla version of bootstrap is no longer feasible computationally for many modern massive datasets due to the need to repeatedly resample the entire data. Therefore, several improvements to the bootstrap method have been made in recent years, which assess the quality of estimators by subsampling the full dataset before resampling the subsamples. Naturally, the performance of these modern subsampling methods is influenced by tuning parameters such as the size of subsamples, the number of subsamples, and the number of resamples per subsample. In this paper, we develop a novel hyperparameter selection methodology for selecting these tuning parameters. Formulated as an optimization problem to find the optimal value of some measure of accuracy of an estimator subject to computational cost, our framework provides closed-form solutions for the optimal hyperparameter values for subsampled bootstrap, subsampled double bootstrap and bag of little bootstraps, at no or little extra time cost. Using the mean square errors as a proxy of the accuracy measure, we apply our methodology to study, compare and improve the performance of these modern versions of bootstrap developed for massive data through simulation study. The results are promising. △ Less

Submitted 15 February, 2023; originally announced February 2023.

arXiv:2302.02768 [pdf, other]

Network Autoregression for Incomplete Matrix-Valued Time Series

Authors: Xuening Zhu, Feifei Wang, Zeng Li, Yanyuan Ma

Abstract: We study the dynamics of matrix-valued time series with observed network structures by proposing a matrix network autoregression model with row and column networks of the subjects. We incorporate covariate information and a low rank intercept matrix. We allow incomplete observations in the matrices and the missing mechanism can be covariate dependent. To estimate the model, a two-step estimation p… ▽ More We study the dynamics of matrix-valued time series with observed network structures by proposing a matrix network autoregression model with row and column networks of the subjects. We incorporate covariate information and a low rank intercept matrix. We allow incomplete observations in the matrices and the missing mechanism can be covariate dependent. To estimate the model, a two-step estimation procedure is proposed. The first step aims to estimate the network autoregression coefficients, and the second step aims to estimate the regression parameters, which are matrices themselves. Theoretically, we first separately establish the asymptotic properties of the autoregression coefficients and the error bounds of the regression parameters. Subsequently, a bias reduction procedure is proposed to reduce the asymptotic bias and the theoretical property of the debiased estimator is studied. Lastly, we illustrate the usefulness of the proposed method through a number of numerical studies and an analysis of a Yelp data set. △ Less

Submitted 6 February, 2023; originally announced February 2023.

arXiv:2302.02432 [pdf, other]

Tighter Information-Theoretic Generalization Bounds from Supersamples

Authors: Ziqiao Wang, Yongyi Mao

Abstract: In this work, we present a variety of novel information-theoretic generalization bounds for learning algorithms, from the supersample setting of Steinke & Zakynthinou (2020)-the setting of the "conditional mutual information" framework. Our development exploits projecting the loss pair (obtained from a training instance and a testing instance) down to a single number and correlating loss values wi… ▽ More In this work, we present a variety of novel information-theoretic generalization bounds for learning algorithms, from the supersample setting of Steinke & Zakynthinou (2020)-the setting of the "conditional mutual information" framework. Our development exploits projecting the loss pair (obtained from a training instance and a testing instance) down to a single number and correlating loss values with a Rademacher sequence (and its shifted variants). The presented bounds include square-root bounds, fast-rate bounds, including those based on variance and sharpness, and bounds for interpolating algorithms etc. We show theoretically or empirically that these bounds are tighter than all information-theoretic bounds known to date on the same supersample setting. △ Less

Submitted 15 June, 2023; v1 submitted 5 February, 2023; originally announced February 2023.

Comments: Accepted to ICML 2023, fixed some typos in the camera-ready version

arXiv:2301.06297 [pdf, other]

Inference via robust optimal transportation: theory and methods

Authors: Yiming Ma, Hang Liu, Davide La Vecchia, Metthieu Lerasle

Abstract: Optimal transportation theory and the related $p$-Wasserstein distance ($W_p$, $p\geq 1$) are widely-applied in statistics and machine learning. In spite of their popularity, inference based on these tools has some issues. For instance, it is sensitive to outliers and it may not be even defined when the underlying model has infinite moments. To cope with these problems, first we consider a robust… ▽ More Optimal transportation theory and the related $p$-Wasserstein distance ($W_p$, $p\geq 1$) are widely-applied in statistics and machine learning. In spite of their popularity, inference based on these tools has some issues. For instance, it is sensitive to outliers and it may not be even defined when the underlying model has infinite moments. To cope with these problems, first we consider a robust version of the primal transportation problem and show that it defines the {robust Wasserstein distance}, $W^{(λ)}$, depending on a tuning parameter $λ> 0$. Second, we illustrate the link between $W_1$ and $W^{(λ)}$ and study its key measure theoretic aspects. Third, we derive some concentration inequalities for $W^{(λ)}$. Fourth, we use $W^{(λ)}$ to define minimum distance estimators, we provide their statistical guarantees and we illustrate how to apply the derived concentration inequalities for a data driven selection of $λ$. Fifth, we provide the {dual} form of the robust optimal transportation problem and we apply it to machine learning problems (generative adversarial networks and domain adaptation). Numerical exercises provide evidence of the benefits yielded by our novel methods. △ Less

Submitted 29 February, 2024; v1 submitted 16 January, 2023; originally announced January 2023.

arXiv:2212.06338 [pdf, other]

Minimax Optimal Estimation of Stability Under Distribution Shift

Authors: Hongseok Namkoong, Yuanzhe Ma, Peter W. Glynn

Abstract: The performance of decision policies and prediction models often deteriorates when applied to environments different from the ones seen during training. To ensure reliable operation, we analyze the stability of a system under distribution shift, which is defined as the smallest change in the underlying environment that causes the system's performance to deteriorate beyond a permissible threshold.… ▽ More The performance of decision policies and prediction models often deteriorates when applied to environments different from the ones seen during training. To ensure reliable operation, we analyze the stability of a system under distribution shift, which is defined as the smallest change in the underlying environment that causes the system's performance to deteriorate beyond a permissible threshold. In contrast to standard tail risk measures and distributionally robust losses that require the specification of a plausible magnitude of distribution shift, the stability measure is defined in terms of a more intuitive quantity: the level of acceptable performance degradation. We develop a minimax optimal estimator of stability and analyze its convergence rate, which exhibits a fundamental phase shift behavior. Our characterization of the minimax convergence rate shows that evaluating stability against large performance degradation incurs a statistical cost. Empirically, we demonstrate the practical utility of our stability framework by using it to compare system designs on problems where robustness to distribution shift is critical. △ Less

Submitted 24 June, 2024; v1 submitted 12 December, 2022; originally announced December 2022.

arXiv:2212.02107 [pdf, other]

Matrix-valued Network Autoregression Model with Latent Group Structure

Authors: Yimeng Ren, Xuening Zhu, Yanyuan Ma

Abstract: Matrix-valued time series data are frequently observed in a broad range of areas and have attracted great attention recently. In this work, we model network effects for high dimensional matrix-valued time series data in a matrix autoregression framework. To characterize the potential heterogeneity of the subjects and handle the high dimensionality simultaneously, we assume that each subject has a… ▽ More Matrix-valued time series data are frequently observed in a broad range of areas and have attracted great attention recently. In this work, we model network effects for high dimensional matrix-valued time series data in a matrix autoregression framework. To characterize the potential heterogeneity of the subjects and handle the high dimensionality simultaneously, we assume that each subject has a latent group label, which enables us to cluster the subject into the corresponding row and column groups. We propose a group matrix network autoregression (GMNAR) model, which assumes that the subjects in the same group share the same set of model parameters. To estimate the model, we develop an iterative algorithm. Theoretically, we show that the group-wise parameters and group memberships can be consistently estimated when the group numbers are correctly or possibly over-specified. An information criterion for group number estimation is also provided to consistently select the group numbers. Lastly, we implement the method on a Yelp dataset to illustrate the usefulness of the method. △ Less

Submitted 5 December, 2022; originally announced December 2022.

arXiv:2211.13549 [pdf, ps, other]

Online Regularized Learning Algorithm for Functional Data

Authors: Yuan Mao, Zheng-Chu Guo

Abstract: In recent years, functional linear models have attracted growing attention in statistics and machine learning, with the aim of recovering the slope function or its functional predictor. This paper considers online regularized learning algorithm for functional linear models in reproducing kernel Hilbert spaces. Convergence analysis of excess prediction error and estimation error are provided with p… ▽ More In recent years, functional linear models have attracted growing attention in statistics and machine learning, with the aim of recovering the slope function or its functional predictor. This paper considers online regularized learning algorithm for functional linear models in reproducing kernel Hilbert spaces. Convergence analysis of excess prediction error and estimation error are provided with polynomially decaying step-size and constant step-size, respectively. Fast convergence rates can be derived via a capacity dependent analysis. By introducing an explicit regularization term, we uplift the saturation boundary of unregularized online learning algorithms when the step-size decays polynomially, and establish fast convergence rates of estimation error without capacity assumption. However, it remains an open problem to obtain capacity independent convergence rates for the estimation error of the unregularized online learning algorithm with decaying step-size. It also shows that convergence rates of both prediction error and estimation error with constant step-size are competitive with those in the literature. △ Less

Submitted 24 November, 2022; originally announced November 2022.

Comments: 32 pages

arXiv:2211.02964 [pdf, other]

Testing for high-dimensional white noise

Authors: Long Feng, Binghui Liu, Yanyuan Ma

Abstract: Testing for multi-dimensional white noise is an important subject in statistical inference. Such test in the high-dimensional case becomes an open problem waiting to be solved, especially when the dimension of a time series is comparable to or even greater than the sample size. To detect an arbitrary form of departure from high-dimensional white noise, a few tests have been developed. Some of thes… ▽ More Testing for multi-dimensional white noise is an important subject in statistical inference. Such test in the high-dimensional case becomes an open problem waiting to be solved, especially when the dimension of a time series is comparable to or even greater than the sample size. To detect an arbitrary form of departure from high-dimensional white noise, a few tests have been developed. Some of these tests are based on max-type statistics, while others are based on sum-type ones. Despite the progress, an urgent issue awaits to be resolved: none of these tests is robust to the sparsity of the serial correlation structure. Motivated by this, we propose a Fisher's combination test by combining the max-type and the sum-type statistics, based on the established asymptotically independence between them. This combination test can achieve robustness to the sparsity of the serial correlation structure,and combine the advantages of the two types of tests. We demonstrate the advantages of the proposed test over some existing tests through extensive numerical results and an empirical analysis. △ Less

Submitted 5 November, 2022; originally announced November 2022.

Comments: 84 pages

MSC Class: 62H15

arXiv:2209.15261 [pdf, other]

Minimalistic Unsupervised Learning with the Sparse Manifold Transform

Authors: Yubei Chen, Zeyu Yun, Yi Ma, Bruno Olshausen, Yann LeCun

Abstract: We describe a minimalistic and interpretable method for unsupervised learning, without resorting to data augmentation, hyperparameter tuning, or other engineering designs, that achieves performance close to the SOTA SSL methods. Our approach leverages the sparse manifold transform, which unifies sparse coding, manifold learning, and slow feature analysis. With a one-layer deterministic sparse mani… ▽ More We describe a minimalistic and interpretable method for unsupervised learning, without resorting to data augmentation, hyperparameter tuning, or other engineering designs, that achieves performance close to the SOTA SSL methods. Our approach leverages the sparse manifold transform, which unifies sparse coding, manifold learning, and slow feature analysis. With a one-layer deterministic sparse manifold transform, one can achieve 99.3% KNN top-1 accuracy on MNIST, 81.1% KNN top-1 accuracy on CIFAR-10 and 53.2% on CIFAR-100. With a simple gray-scale augmentation, the model gets 83.2% KNN top-1 accuracy on CIFAR-10 and 57% on CIFAR-100. These results significantly close the gap between simplistic "white-box" methods and the SOTA methods. Additionally, we provide visualization to explain how an unsupervised representation transform is formed. The proposed method is closely connected to latent-embedding self-supervised methods and can be treated as the simplest form of VICReg. Though there remains a small performance gap between our simple constructive model and SOTA methods, the evidence points to this as a promising direction for achieving a principled and white-box approach to unsupervised learning. △ Less

Submitted 27 April, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

Comments: This paper is published at ICLR 2023

Journal ref: The Eleventh International Conference on Learning Representations (2023)

arXiv:2208.12427 [pdf, ps, other]

Coefficient-based Regularized Distribution Regression

Authors: Yuan Mao, Lei Shi, Zheng-Chu Guo

Abstract: In this paper, we consider the coefficient-based regularized distribution regression which aims to regress from probability measures to real-valued responses over a reproducing kernel Hilbert space (RKHS), where the regularization is put on the coefficients and kernels are assumed to be indefinite. The algorithm involves two stages of sampling, the first stage sample consists of distributions and… ▽ More In this paper, we consider the coefficient-based regularized distribution regression which aims to regress from probability measures to real-valued responses over a reproducing kernel Hilbert space (RKHS), where the regularization is put on the coefficients and kernels are assumed to be indefinite. The algorithm involves two stages of sampling, the first stage sample consists of distributions and the second stage sample is obtained from these distributions. Asymptotic behaviors of the algorithm in different regularity ranges of the regression function are comprehensively studied and learning rates are derived via integral operator techniques. We get the optimal rates under some mild conditions, which matches the one-stage sampled minimax optimal rate. Compared with the kernel methods for distribution regression in the literature, the algorithm under consideration does not require the kernel to be symmetric and positive semi-definite and hence provides a simple paradigm for designing indefinite kernel methods, which enriches the theme of the distribution regression. To the best of our knowledge, this is the first result for distribution regression with indefinite kernels, and our algorithm can improve the saturation effect. △ Less

Submitted 25 August, 2022; originally announced August 2022.

arXiv:2207.11208 [pdf, other]

Statistical and Computational Trade-offs in Variational Inference: A Case Study in Inferential Model Selection

Authors: Kush Bhatia, Nikki Li**g Kuang, Yi-An Ma, Yixin Wang

Abstract: Variational inference has recently emerged as a popular alternative to the classical Markov chain Monte Carlo (MCMC) in large-scale Bayesian inference. The core idea is to trade statistical accuracy for computational efficiency. In this work, we study these statistical and computational trade-offs in variational inference via a case study in inferential model selection. Focusing on Gaussian infere… ▽ More Variational inference has recently emerged as a popular alternative to the classical Markov chain Monte Carlo (MCMC) in large-scale Bayesian inference. The core idea is to trade statistical accuracy for computational efficiency. In this work, we study these statistical and computational trade-offs in variational inference via a case study in inferential model selection. Focusing on Gaussian inferential models (or variational approximating families) with diagonal plus low-rank precision matrices, we initiate a theoretical study of the trade-offs in two aspects, Bayesian posterior inference error and frequentist uncertainty quantification error. From the Bayesian posterior inference perspective, we characterize the error of the variational posterior relative to the exact posterior. We prove that, given a fixed computation budget, a lower-rank inferential model produces variational posteriors with a higher statistical approximation error, but a lower computational error; it reduces variance in stochastic optimization and, in turn, accelerates convergence. From the frequentist uncertainty quantification perspective, we consider the precision matrix of the variational posterior as an uncertainty estimate, which involves an additional statistical error originating from the sampling uncertainty of the data. As a consequence, for small datasets, the inferential model need not be full-rank to achieve optimal estimation error (even with unlimited computation budget). △ Less

Submitted 6 August, 2023; v1 submitted 22 July, 2022; originally announced July 2022.

Comments: 57 pages, 8 figures

arXiv:2207.06343 [pdf, other]

TCT: Convexifying Federated Learning using Bootstrapped Neural Tangent Kernels

Authors: Yaodong Yu, Alexander Wei, Sai Praneeth Karimireddy, Yi Ma, Michael I. Jordan

Abstract: State-of-the-art federated learning methods can perform far worse than their centralized counterparts when clients have dissimilar data distributions. For neural networks, even when centralized SGD easily finds a solution that is simultaneously performant for all clients, current federated optimization methods fail to converge to a comparable solution. We show that this performance disparity can l… ▽ More State-of-the-art federated learning methods can perform far worse than their centralized counterparts when clients have dissimilar data distributions. For neural networks, even when centralized SGD easily finds a solution that is simultaneously performant for all clients, current federated optimization methods fail to converge to a comparable solution. We show that this performance disparity can largely be attributed to optimization challenges presented by nonconvexity. Specifically, we find that the early layers of the network do learn useful features, but the final layers fail to make use of them. That is, federated optimization applied to this non-convex problem distorts the learning of the final layers. Leveraging this observation, we propose a Train-Convexify-Train (TCT) procedure to sidestep this issue: first, learn features using off-the-shelf methods (e.g., FedAvg); then, optimize a convexified problem obtained from the network's empirical neural tangent kernel approximation. Our technique yields accuracy improvements of up to +36% on FMNIST and +37% on CIFAR10 when clients have dissimilar data. △ Less

Submitted 5 October, 2022; v1 submitted 13 July, 2022; originally announced July 2022.

Comments: Accepted at Neural Information Processing Systems (NeurIPS) 2022. V2 releases code

MSC Class: 68W40; 68W15; 90C25; 90C06 ACM Class: G.1.6; F.2.1; E.4

Showing 1–50 of 191 results for author: Mao, Y