Search | arXiv e-print repository

Markovian Gaussian Process: A Universal State-Space Representation for Stationary Temporal Gaussian Process

Authors: Weihan Li, Yule Wang, Chengrui Li, Anqi Wu

Abstract: Gaussian Processes (GPs) and Linear Dynamical Systems (LDSs) are essential time series and dynamic system modeling tools. GPs can handle complex, nonlinear dynamics but are computationally demanding, while LDSs offer efficient computation but lack the expressive power of GPs. To combine their benefits, we introduce a universal method that allows an LDS to mirror stationary temporal GPs. This state… ▽ More Gaussian Processes (GPs) and Linear Dynamical Systems (LDSs) are essential time series and dynamic system modeling tools. GPs can handle complex, nonlinear dynamics but are computationally demanding, while LDSs offer efficient computation but lack the expressive power of GPs. To combine their benefits, we introduce a universal method that allows an LDS to mirror stationary temporal GPs. This state-space representation, known as the Markovian Gaussian Process (Markovian GP), leverages the flexibility of kernel functions while maintaining efficient linear computation. Unlike existing GP-LDS conversion methods, which require separability for most multi-output kernels, our approach works universally for single- and multi-output stationary temporal kernels. We evaluate our method by computing covariance, performing regression tasks, and applying it to a neuroscience application, demonstrating that our method provides an accurate state-space representation for stationary temporal GPs. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2406.15523 [pdf, other]

Unifying Unsupervised Graph-Level Anomaly Detection and Out-of-Distribution Detection: A Benchmark

Authors: Yili Wang, Yixin Liu, Xu Shen, Chenyu Li, Kaize Ding, Rui Miao, Ying Wang, Shirui Pan, Xin Wang

Abstract: To build safe and reliable graph machine learning systems, unsupervised graph-level anomaly detection (GLAD) and unsupervised graph-level out-of-distribution (OOD) detection (GLOD) have received significant attention in recent years. Though those two lines of research indeed share the same objective, they have been studied independently in the community due to distinct evaluation setups, creating… ▽ More To build safe and reliable graph machine learning systems, unsupervised graph-level anomaly detection (GLAD) and unsupervised graph-level out-of-distribution (OOD) detection (GLOD) have received significant attention in recent years. Though those two lines of research indeed share the same objective, they have been studied independently in the community due to distinct evaluation setups, creating a gap that hinders the application and evaluation of methods from one to the other. To bridge the gap, in this work, we present a Unified Benchmark for unsupervised Graph-level OOD and anomaly Detection (our method), a comprehensive evaluation framework that unifies GLAD and GLOD under the concept of generalized graph-level OOD detection. Our benchmark encompasses 35 datasets spanning four practical anomaly and OOD detection scenarios, facilitating the comparison of 16 representative GLAD/GLOD methods. We conduct multi-dimensional analyses to explore the effectiveness, generalizability, robustness, and efficiency of existing methods, shedding light on their strengths and limitations. Furthermore, we provide an open-source codebase (https://github.com/UB-GOLD/UB-GOLD) of our method to foster reproducible research and outline potential directions for future investigations based on our insights. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.13036 [pdf, other]

Sharp detection of low-dimensional structure in probability measures via dimensional logarithmic Sobolev inequalities

Authors: Matthew T. C. Li, Tiangang Cui, Fengyi Li, Youssef Marzouk, Olivier Zahm

Abstract: Identifying low-dimensional structure in high-dimensional probability measures is an essential pre-processing step for efficient sampling. We introduce a method for identifying and approximating a target measure $π$ as a perturbation of a given reference measure $μ$ along a few significant directions of $\mathbb{R}^{d}$. The reference measure can be a Gaussian or a nonlinear transformation of a Ga… ▽ More Identifying low-dimensional structure in high-dimensional probability measures is an essential pre-processing step for efficient sampling. We introduce a method for identifying and approximating a target measure $π$ as a perturbation of a given reference measure $μ$ along a few significant directions of $\mathbb{R}^{d}$. The reference measure can be a Gaussian or a nonlinear transformation of a Gaussian, as commonly arising in generative modeling. Our method extends prior work on minimizing majorizations of the Kullback--Leibler divergence to identify optimal approximations within this class of measures. Our main contribution unveils a connection between the \emph{dimensional} logarithmic Sobolev inequality (LSI) and approximations with this ansatz. Specifically, when the target and reference are both Gaussian, we show that minimizing the dimensional LSI is equivalent to minimizing the KL divergence restricted to this ansatz. For general non-Gaussian measures, the dimensional LSI produces majorants that uniformly improve on previous majorants for gradient-based dimension reduction. We further demonstrate the applicability of this analysis to the squared Hellinger distance, where analogous reasoning shows that the dimensional Poincaré inequality offers improved bounds. △ Less

Submitted 21 June, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

arXiv:2405.20782 [pdf, other]

Universal Exact Compression of Differentially Private Mechanisms

Authors: Yanxiao Liu, Wei-Ning Chen, Ayfer Özgür, Cheuk Ting Li

Abstract: To reduce the communication cost of differential privacy mechanisms, we introduce a novel construction, called Poisson private representation (PPR), designed to compress and simulate any local randomizer while ensuring local differential privacy. Unlike previous simulation-based local differential privacy mechanisms, PPR exactly preserves the joint distribution of the data and the output of the or… ▽ More To reduce the communication cost of differential privacy mechanisms, we introduce a novel construction, called Poisson private representation (PPR), designed to compress and simulate any local randomizer while ensuring local differential privacy. Unlike previous simulation-based local differential privacy mechanisms, PPR exactly preserves the joint distribution of the data and the output of the original local randomizer. Hence, the PPR-compressed privacy mechanism retains all desirable statistical properties of the original privacy mechanism such as unbiasedness and Gaussianity. Moreover, PPR achieves a compression size within a logarithmic gap from the theoretical lower bound. Using the PPR, we give a new order-wise trade-off between communication, accuracy, central and local differential privacy for distributed mean estimation. Experiment results on distributed mean estimation show that PPR consistently gives a better trade-off between communication, accuracy and central differential privacy compared to the coordinate subsampled Gaussian mechanism, while also providing local differential privacy. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: 30 pages, 3 figures

arXiv:2405.16845 [pdf, other]

On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability

Authors: Chenyu Zheng, Wei Huang, Rongzhen Wang, Guoqiang Wu, Jun Zhu, Chongxuan Li

Abstract: Autoregressively trained transformers have brought a profound revolution to the world, especially with their in-context learning (ICL) ability to address downstream tasks. Recently, several studies suggest that transformers learn a mesa-optimizer during autoregressive (AR) pretraining to implement ICL. Namely, the forward pass of the trained transformer is equivalent to optimizing an inner objecti… ▽ More Autoregressively trained transformers have brought a profound revolution to the world, especially with their in-context learning (ICL) ability to address downstream tasks. Recently, several studies suggest that transformers learn a mesa-optimizer during autoregressive (AR) pretraining to implement ICL. Namely, the forward pass of the trained transformer is equivalent to optimizing an inner objective function in-context. However, whether the practical non-convex training dynamics will converge to the ideal mesa-optimizer is still unclear. Towards filling this gap, we investigate the non-convex dynamics of a one-layer linear causal self-attention model autoregressively trained by gradient flow, where the sequences are generated by an AR process $x_{t+1} = W x_t$. First, under a certain condition of data distribution, we prove that an autoregressively trained transformer learns $W$ by implementing one step of gradient descent to minimize an ordinary least squares (OLS) problem in-context. It then applies the learned $\widehat{W}$ for next-token prediction, thereby verifying the mesa-optimization hypothesis. Next, under the same data conditions, we explore the capability limitations of the obtained mesa-optimizer. We show that a stronger assumption related to the moments of data is the sufficient and necessary condition that the learned mesa-optimizer recovers the distribution. Besides, we conduct exploratory analyses beyond the first data condition and prove that generally, the trained transformer will not perform vanilla gradient descent for the OLS problem. Finally, our simulation results verify the theoretical results. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: 37pages

arXiv:2405.09485 [pdf, other]

Predicting Future Change-points in Time Series

Authors: Chak Fung Choi, Chunxue Li, Chun Yip Yau, Zifeng Zhao

Abstract: Change-point detection and estimation procedures have been widely developed in the literature. However, commonly used approaches in change-point analysis have mainly been focusing on detecting change-points within an entire time series (off-line methods), or quickest detection of change-points in sequentially observed data (on-line methods). Both classes of methods are concerned with change-points… ▽ More Change-point detection and estimation procedures have been widely developed in the literature. However, commonly used approaches in change-point analysis have mainly been focusing on detecting change-points within an entire time series (off-line methods), or quickest detection of change-points in sequentially observed data (on-line methods). Both classes of methods are concerned with change-points that have already occurred. The arguably more important question of when future change-points may occur, remains largely unexplored. In this paper, we develop a novel statistical model that describes the mechanism of change-point occurrence. Specifically, the model assumes a latent process in the form of a random walk driven by non-negative innovations, and an observed process which behaves differently when the latent process belongs to different regimes. By construction, an occurrence of a change-point is equivalent to hitting a regime threshold by the latent process. Therefore, by predicting when the latent process will hit the next regime threshold, future change-points can be forecasted. The probabilistic properties of the model such as stationarity and ergodicity are established. A composite likelihood-based approach is developed for parameter estimation and model selection. Moreover, we construct the predictor and prediction interval for future change points based on the estimated model. △ Less

Submitted 23 May, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

Comments: 37 pages, 4 figures

MSC Class: 62M10

arXiv:2405.04566 [pdf, other]

Fast Decentralized Gradient Tracking for Federated Minimax Optimization with Local Updates

Authors: Chris Junchi Li

Abstract: Federated learning (FL) for minimax optimization has emerged as a powerful paradigm for training models across distributed nodes/clients while preserving data privacy and model robustness on data heterogeneity. In this work, we delve into the decentralized implementation of federated minimax optimization by proposing \texttt{K-GT-Minimax}, a novel decentralized minimax optimization algorithm that… ▽ More Federated learning (FL) for minimax optimization has emerged as a powerful paradigm for training models across distributed nodes/clients while preserving data privacy and model robustness on data heterogeneity. In this work, we delve into the decentralized implementation of federated minimax optimization by proposing \texttt{K-GT-Minimax}, a novel decentralized minimax optimization algorithm that combines local updates and gradient tracking techniques. Our analysis showcases the algorithm's communication efficiency and convergence rate for nonconvex-strongly-concave (NC-SC) minimax optimization, demonstrating a superior convergence rate compared to existing methods. \texttt{K-GT-Minimax}'s ability to handle data heterogeneity and ensure robustness underscores its significance in advancing federated learning research and applications. △ Less

Submitted 7 May, 2024; originally announced May 2024.

arXiv:2405.01275 [pdf, other]

Variable Selection in Ultra-high Dimensional Feature Space for the Cox Model with Interval-Censored Data

Authors: Daewoo Pak, Jianrui Zhang, Di Wu, Haolei Weng, Chenxi Li

Abstract: We develop a set of variable selection methods for the Cox model under interval censoring, in the ultra-high dimensional setting where the dimensionality can grow exponentially with the sample size. The methods select covariates via a penalized nonparametric maximum likelihood estimation with some popular penalty functions, including lasso, adaptive lasso, SCAD, and MCP. We prove that our penalize… ▽ More We develop a set of variable selection methods for the Cox model under interval censoring, in the ultra-high dimensional setting where the dimensionality can grow exponentially with the sample size. The methods select covariates via a penalized nonparametric maximum likelihood estimation with some popular penalty functions, including lasso, adaptive lasso, SCAD, and MCP. We prove that our penalized variable selection methods with folded concave penalties or adaptive lasso penalty enjoy the oracle property. Extensive numerical experiments show that the proposed methods have satisfactory empirical performance under various scenarios. The utility of the methods is illustrated through an application to a genome-wide association study of age to early childhood caries. △ Less

Submitted 2 May, 2024; originally announced May 2024.

arXiv:2405.00914 [pdf, other]

Accelerated Fully First-Order Methods for Bilevel and Minimax Optimization

Authors: Chris Junchi Li

Abstract: This paper presents a new algorithm member for accelerating first-order methods for bilevel optimization, namely the \emph{(Perturbed) Restarted Accelerated Fully First-order methods for Bilevel Approximation}, abbreviated as \texttt{(P)RAF${}^2$BA}. The algorithm leverages \emph{fully} first-order oracles and seeks approximate stationary points in nonconvex-strongly-convex bilevel optimization, e… ▽ More This paper presents a new algorithm member for accelerating first-order methods for bilevel optimization, namely the \emph{(Perturbed) Restarted Accelerated Fully First-order methods for Bilevel Approximation}, abbreviated as \texttt{(P)RAF${}^2$BA}. The algorithm leverages \emph{fully} first-order oracles and seeks approximate stationary points in nonconvex-strongly-convex bilevel optimization, enhancing oracle complexity for efficient optimization. Theoretical guarantees for finding approximate first-order stationary points and second-order stationary points at the state-of-the-art query complexities are established, showcasing their effectiveness in solving complex optimization tasks. Empirical studies for real-world problems are provided to further validate the outperformance of our proposed algorithms. The significance of \texttt{(P)RAF${}^2$BA} in optimizing nonconvex-strongly-convex bilevel optimization problems is underscored by its state-of-the-art convergence rates and computational efficiency. △ Less

Submitted 3 May, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

Comments: Minor typographical updates. arXiv admin note: text overlap with arXiv:2307.00126

arXiv:2405.00742 [pdf, other]

Federated Graph Learning for EV Charging Demand Forecasting with Personalization Against Cyberattacks

Authors: Yi Li, Renyou Xie, Chaojie Li, Yi Wang, Zhaoyang Dong

Abstract: Mitigating cybersecurity risk in electric vehicle (EV) charging demand forecasting plays a crucial role in the safe operation of collective EV chargings, the stability of the power grid, and the cost-effective infrastructure expansion. However, existing methods either suffer from the data privacy issue and the susceptibility to cyberattacks or fail to consider the spatial correlation among differe… ▽ More Mitigating cybersecurity risk in electric vehicle (EV) charging demand forecasting plays a crucial role in the safe operation of collective EV chargings, the stability of the power grid, and the cost-effective infrastructure expansion. However, existing methods either suffer from the data privacy issue and the susceptibility to cyberattacks or fail to consider the spatial correlation among different stations. To address these challenges, a federated graph learning approach involving multiple charging stations is proposed to collaboratively train a more generalized deep learning model for demand forecasting while capturing spatial correlations among various stations and enhancing robustness against potential attacks. Firstly, for better model performance, a Graph Neural Network (GNN) model is leveraged to characterize the geographic correlation among different charging stations in a federated manner. Secondly, to ensure robustness and deal with the data heterogeneity in a federated setting, a message passing that utilizes a global attention mechanism to aggregate personalized models for each client is proposed. Thirdly, by concerning cyberattacks, a special credit-based function is designed to mitigate potential threats from malicious clients or unwanted attacks. Extensive experiments on a public EV charging dataset are conducted using various deep learning techniques and federated learning methods to demonstrate the prediction accuracy and robustness of the proposed approach. △ Less

Submitted 30 April, 2024; originally announced May 2024.

Comments: 11 pages,4 figures

arXiv:2402.17287 [pdf, other]

An Interpretable Evaluation of Entropy-based Novelty of Generative Models

Authors: **gwei Zhang, Cheuk Ting Li, Farzan Farnia

Abstract: The massive developments of generative model frameworks require principled methods for the evaluation of a model's novelty compared to a reference dataset. While the literature has extensively studied the evaluation of the quality, diversity, and generalizability of generative models, the assessment of a model's novelty compared to a reference model has not been adequately explored in the machine… ▽ More The massive developments of generative model frameworks require principled methods for the evaluation of a model's novelty compared to a reference dataset. While the literature has extensively studied the evaluation of the quality, diversity, and generalizability of generative models, the assessment of a model's novelty compared to a reference model has not been adequately explored in the machine learning community. In this work, we focus on the novelty assessment for multi-modal distributions and attempt to address the following differential clustering task: Given samples of a generative model $P_\mathcal{G}$ and a reference model $P_\mathrm{ref}$, how can we discover the sample types expressed by $P_\mathcal{G}$ more frequently than in $P_\mathrm{ref}$? We introduce a spectral approach to the differential clustering task and propose the Kernel-based Entropic Novelty (KEN) score to quantify the mode-based novelty of $P_\mathcal{G}$ with respect to $P_\mathrm{ref}$. We analyze the KEN score for mixture distributions with well-separable components and develop a kernel-based method to compute the KEN score from empirical data. We support the KEN framework by presenting numerical results on synthetic and real image datasets, indicating the framework's effectiveness in detecting novel modes and comparing generative models. The paper's code is available at: www.github.com/buyeah1109/KEN △ Less

Submitted 13 June, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

arXiv:2402.14402 [pdf, other]

Global Safe Sequential Learning via Efficient Knowledge Transfer

Authors: Cen-You Li, Olaf Duennbier, Marc Toussaint, Barbara Rakitsch, Christoph Zimmer

Abstract: Sequential learning methods such as active learning and Bayesian optimization select the most informative data to learn about a task. In many medical or engineering applications, the data selection is constrained by a priori unknown safety conditions. A promissing line of safe learning methods utilize Gaussian processes (GPs) to model the safety probability and perform data selection in areas with… ▽ More Sequential learning methods such as active learning and Bayesian optimization select the most informative data to learn about a task. In many medical or engineering applications, the data selection is constrained by a priori unknown safety conditions. A promissing line of safe learning methods utilize Gaussian processes (GPs) to model the safety probability and perform data selection in areas with high safety confidence. However, accurate safety modeling requires prior knowledge or consumes data. In addition, the safety confidence centers around the given observations which leads to local exploration. As transferable source knowledge is often available in safety critical experiments, we propose to consider transfer safe sequential learning to accelerate the learning of safety. We further consider a pre-computation of source components to reduce the additional computational load that is introduced by incorporating source data. In this paper, we theoretically analyze the maximum explorable safe regions of conventional safe learning methods. Furthermore, we empirically demonstrate that our approach 1) learns a task with lower data consumption, 2) globally explores multiple disjoint safe regions under guidance of the source knowledge, and 3) operates with computation comparable to conventional safe learning methods. △ Less

Submitted 15 April, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

arXiv:2402.11341 [pdf, other]

Between- and Within-Cluster Spearman Rank Correlations

Authors: Shengxin Tu, Chun Li, Bryan E. Shepherd

Abstract: Clustered data are common in practice. Clustering arises when subjects are measured repeatedly, or subjects are nested in groups (e.g., households, schools). It is often of interest to evaluate the correlation between two variables with clustered data. There are three commonly used Pearson correlation coefficients (total, between-, and within-cluster), which together provide an enriched perspectiv… ▽ More Clustered data are common in practice. Clustering arises when subjects are measured repeatedly, or subjects are nested in groups (e.g., households, schools). It is often of interest to evaluate the correlation between two variables with clustered data. There are three commonly used Pearson correlation coefficients (total, between-, and within-cluster), which together provide an enriched perspective of the correlation. However, these Pearson correlation coefficients are sensitive to extreme values and skewed distributions. They also depend on the scale of the data and are not applicable to ordered categorical data. Current non-parametric measures for clustered data are only for the total correlation. Here we define population parameters for the between- and within-cluster Spearman rank correlations. The definitions are natural extensions of the Pearson between- and within-cluster correlations to the rank scale. We show that the total Spearman rank correlation approximates a weighted sum of the between- and within-cluster Spearman rank correlations, where the weights are functions of rank intraclass correlations of the two random variables. We also discuss the equivalence between the within-cluster Spearman rank correlation and the covariate-adjusted partial Spearman rank correlation. Furthermore, we describe estimation and inference for the three Spearman rank correlations, conduct simulations to evaluate the performance of our estimators, and illustrate their use with data from a longitudinal biomarker study and a clustered randomized trial. △ Less

Submitted 17 February, 2024; originally announced February 2024.

arXiv:2402.09469 [pdf, other]

Fourier Circuits in Neural Networks: Unlocking the Potential of Large Language Models in Mathematical Reasoning and Modular Arithmetic

Authors: Jiuxiang Gu, Chenyang Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Tianyi Zhou

Abstract: In the evolving landscape of machine learning, a pivotal challenge lies in deciphering the internal representations harnessed by neural networks and Transformers. Building on recent progress toward comprehending how networks execute distinct target functions, our study embarks on an exploration of the underlying reasons behind networks adopting specific computational strategies. We direct our focu… ▽ More In the evolving landscape of machine learning, a pivotal challenge lies in deciphering the internal representations harnessed by neural networks and Transformers. Building on recent progress toward comprehending how networks execute distinct target functions, our study embarks on an exploration of the underlying reasons behind networks adopting specific computational strategies. We direct our focus to the complex algebraic learning task of modular addition involving $k$ inputs. Our research presents a thorough analytical characterization of the features learned by stylized one-hidden layer neural networks and one-layer Transformers in addressing this task. A cornerstone of our theoretical framework is the elucidation of how the principle of margin maximization shapes the features adopted by one-hidden layer neural networks. Let $p$ denote the modulus, $D_p$ denote the dataset of modular arithmetic with $k$ inputs and $m$ denote the network width. We demonstrate that a neuron count of $ m \geq 2^{2k-2} \cdot (p-1) $, these networks attain a maximum $ L_{2,k+1} $-margin on the dataset $ D_p $. Furthermore, we establish that each hidden-layer neuron aligns with a specific Fourier spectrum, integral to solving modular addition problems. By correlating our findings with the empirical observations of similar studies, we contribute to a deeper comprehension of the intrinsic computational mechanisms of neural networks. Furthermore, we observe similar computational mechanisms in the attention matrix of the one-layer Transformer. This research stands as a significant stride in unraveling their operation complexities, particularly in the realm of complex algebraic tasks. △ Less

Submitted 24 May, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

Comments: Update Section 5.3; clean up problem setup

arXiv:2402.02459 [pdf, other]

On Minimum Trace Factor Analysis -- An Old Song Sung to a New Tune

Authors: C. Li, A. Shkolnik

Abstract: Dimensionality reduction methods, such as principal component analysis (PCA) and factor analysis, are central to many problems in data science. There are, however, serious and well-understood challenges to finding robust low dimensional approximations for data with significant heteroskedastic noise. This paper introduces a relaxed version of Minimum Trace Factor Analysis (MTFA), a convex optimizat… ▽ More Dimensionality reduction methods, such as principal component analysis (PCA) and factor analysis, are central to many problems in data science. There are, however, serious and well-understood challenges to finding robust low dimensional approximations for data with significant heteroskedastic noise. This paper introduces a relaxed version of Minimum Trace Factor Analysis (MTFA), a convex optimization method with roots dating back to the work of Ledermann in 1940. This relaxation is particularly effective at not overfitting to heteroskedastic perturbations and addresses the commonly cited Heywood cases in factor analysis and the recently identified "curse of ill-conditioning" for existing spectral methods. We provide theoretical guarantees on the accuracy of the resulting low rank subspace and the convergence rate of the proposed algorithm to compute that matrix. We develop a number of interesting connections to existing methods, including HeteroPCA, Lasso, and Soft-Impute, to fill an important gap in the already large literature on low rank matrix estimation. Numerical experiments benchmark our results against several recent proposals for dealing with heteroskedastic noise. △ Less

Submitted 4 February, 2024; originally announced February 2024.

arXiv:2402.02368 [pdf, other]

Timer: Generative Pre-trained Transformers Are Large Time Series Models

Authors: Yong Liu, Haoran Zhang, Chenyu Li, Xiangdong Huang, Jianmin Wang, Mingsheng Long

Abstract: Deep learning has contributed remarkably to the advancement of time series analysis. Still, deep models can encounter performance bottlenecks in real-world data-scarce scenarios, which can be concealed due to the performance saturation with small models on current benchmarks. Meanwhile, large models have demonstrated great powers in these scenarios through large-scale pre-training. Continuous prog… ▽ More Deep learning has contributed remarkably to the advancement of time series analysis. Still, deep models can encounter performance bottlenecks in real-world data-scarce scenarios, which can be concealed due to the performance saturation with small models on current benchmarks. Meanwhile, large models have demonstrated great powers in these scenarios through large-scale pre-training. Continuous progress has been achieved with the emergence of large language models, exhibiting unprecedented abilities such as few-shot generalization, scalability, and task generality, which are however absent in small deep models. To change the status quo of training scenario-specific small models from scratch, this paper aims at the early development of large time series models (LTSM). During pre-training, we curate large-scale datasets with up to 1 billion time points, unify heterogeneous time series into single-series sequence (S3) format, and develop the GPT-style architecture toward LTSMs. To meet diverse application needs, we convert forecasting, imputation, and anomaly detection of time series into a unified generative task. The outcome of this study is a Time Series Transformer (Timer), which is generative pre-trained by next token prediction and adapted to various downstream tasks with promising capabilities as an LTSM. Code and datasets are available at: https://github.com/thuml/Large-Time-Series-Model. △ Less

Submitted 4 June, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

arXiv:2401.09719 [pdf, ps, other]

Kernel-based multi-marker tests of association based on the accelerated failure time model

Authors: Chenxi Li, Di Wu, Qing Lu

Abstract: Kernel-based multi-marker tests for survival outcomes use primarily the Cox model to adjust for covariates. The proportional hazards assumption made by the Cox model could be unrealistic, especially in the long-term follow-up. We develop a suite of novel multi-marker survival tests for genetic association based on the accelerated failure time model, which is a popular alternative to the Cox model… ▽ More Kernel-based multi-marker tests for survival outcomes use primarily the Cox model to adjust for covariates. The proportional hazards assumption made by the Cox model could be unrealistic, especially in the long-term follow-up. We develop a suite of novel multi-marker survival tests for genetic association based on the accelerated failure time model, which is a popular alternative to the Cox model due to its direct physical interpretation. The tests are based on the asymptotic distributions of their test statistics and are thus computationally efficient. The association tests can account for the heterogeneity of genetic effects across sub-populations/individuals to increase the power. All the new tests can deal with competing risks and left truncation. Moreover, we develop small-sample corrections to the tests to improve their accuracy under small samples. Extensive numerical experiments show that the new tests perform very well in various scenarios. An application to a genetic dataset of Alzheimer's disease illustrates the tests' practical utility. △ Less

Submitted 17 January, 2024; originally announced January 2024.

arXiv:2312.16607 [pdf, other]

A Polarization and Radiomics Feature Fusion Network for the Classification of Hepatocellular Carcinoma and Intrahepatic Cholangiocarcinoma

Authors: Jia Dong, Yao Yao, Liyan Lin, Yang Dong, Jiachen Wan, Ran Peng, Chao Li, Hui Ma

Abstract: Classifying hepatocellular carcinoma (HCC) and intrahepatic cholangiocarcinoma (ICC) is a critical step in treatment selection and prognosis evaluation for patients with liver diseases. Traditional histopathological diagnosis poses challenges in this context. In this study, we introduce a novel polarization and radiomics feature fusion network, which combines polarization features obtained from Mu… ▽ More Classifying hepatocellular carcinoma (HCC) and intrahepatic cholangiocarcinoma (ICC) is a critical step in treatment selection and prognosis evaluation for patients with liver diseases. Traditional histopathological diagnosis poses challenges in this context. In this study, we introduce a novel polarization and radiomics feature fusion network, which combines polarization features obtained from Mueller matrix images of liver pathological samples with radiomics features derived from corresponding pathological images to classify HCC and ICC. Our fusion network integrates a two-tier fusion approach, comprising early feature-level fusion and late classification-level fusion. By harnessing the strengths of polarization imaging techniques and image feature-based machine learning, our proposed fusion network significantly enhances classification accuracy. Notably, even at reduced imaging resolutions, the fusion network maintains robust performance due to the additional information provided by polarization features, which may not align with human visual perception. Our experimental results underscore the potential of this fusion network as a powerful tool for computer-aided diagnosis of HCC and ICC, showcasing the benefits and prospects of integrating polarization imaging techniques into the current image-intensive digital pathological diagnosis. We aim to contribute this innovative approach to top-tier journals, offering fresh insights and valuable tools in the fields of medical imaging and cancer diagnosis. By introducing polarization imaging into liver cancer classification, we demonstrate its interdisciplinary potential in addressing challenges in medical image analysis, promising advancements in medical imaging and cancer diagnosis. △ Less

Submitted 27 December, 2023; originally announced December 2023.

arXiv:2312.14165 [pdf, other]

Spatiotemporal risk prediction for infectious disease spread and mortality

Authors: Catherine Li, Daniel Lazarev

Abstract: With the outbreak of the COVID-19 pandemic, various studies have focused on predicting the trajectory and risk factors of the virus and its variants. Building on previous work that addressed this problem using genetic and epidemiological data, we introduce a method, Geo Score, that also incorporates geographic, socioeconomic, and demographic data to estimate infection and mortality risk by region… ▽ More With the outbreak of the COVID-19 pandemic, various studies have focused on predicting the trajectory and risk factors of the virus and its variants. Building on previous work that addressed this problem using genetic and epidemiological data, we introduce a method, Geo Score, that also incorporates geographic, socioeconomic, and demographic data to estimate infection and mortality risk by region and time. We employ gradient descent to find the optimal weights of the factors' significance in determining risk. Such spatiotemporal risk prediction is important for informed public health decision-making so that individuals are aware of the risks of travel during an epidemic or pandemic, and, perhaps more importantly, so that policymakers know how to triage limited resources during a crisis. We apply our method to New York City COVID-19 data from 2020, predicting ZIP code-level COVID-19 risk for 2021. △ Less

Submitted 5 December, 2023; originally announced December 2023.

Comments: 16 pages, 19 figures, completed as a part of MIT PRIMES program, to be presented in Joint Mathematics Meetings 2024

arXiv:2312.10958 [pdf, ps, other]

Large-sample properties of multiple imputation estimators for parameters of logistic regression with covariates missing at random separately or simultaneously

Authors: Phuoc-Loc Tran, Shen-Ming Lee, Truong-Nhat Le, Chin-Shang Li

Abstract: We consider logistic regression including two sets of discrete or categorical covariates that are missing at random (MAR) separately or simultaneously. We examine the asymptotic properties of two multiple imputation (MI) estimators, given in the study of Lee at al. (2023), for the parameters of the logistic regression model with both sets of discrete or categorical covariates that are MAR separate… ▽ More We consider logistic regression including two sets of discrete or categorical covariates that are missing at random (MAR) separately or simultaneously. We examine the asymptotic properties of two multiple imputation (MI) estimators, given in the study of Lee at al. (2023), for the parameters of the logistic regression model with both sets of discrete or categorical covariates that are MAR separately or simultaneously. The proposed estimated asymptotic variances of the two MI estimators address a limitation observed with Rubin's type estimated variances, which lead to underestimate the variances of the two MI estimators (Rubin, 1987). Simulation results demonstrate that our two proposed MI methods outperform the complete-case, semiparametric inverse probability weighting, random forest MI using chained equations, and stochastic approximation of expectation-maximization methods. To illustrate the methodology's practical application, we provide a real data example from a survey conducted in the Feng Chia night market in Taichung City, Taiwan. △ Less

Submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.05802 [pdf, other]

Enhancing Scalability in Bayesian Nonparametric Factor Analysis of Spatiotemporal Data

Authors: Yifan Cheng, Cheng Li

Abstract: This manuscript puts forward novel practicable spatiotemporal Bayesian factor analysis frameworks computationally feasible for moderate to large data. Our models exhibit significantly enhanced computational scalability and storage efficiency, deliver high overall modeling performances, and possess powerful inferential capabilities for adequately predicting outcomes at future time points or new spa… ▽ More This manuscript puts forward novel practicable spatiotemporal Bayesian factor analysis frameworks computationally feasible for moderate to large data. Our models exhibit significantly enhanced computational scalability and storage efficiency, deliver high overall modeling performances, and possess powerful inferential capabilities for adequately predicting outcomes at future time points or new spatial locations and satisfactorily clustering spatial locations into regions with similar temporal trajectories, a frequently encountered crucial task. We integrate on top of a baseline separable factor model with temporally dependent latent factors and spatially dependent factor loadings under a probit stick breaking process (PSBP) prior a new slice sampling algorithm that permits unknown varying numbers of spatial mixture components across all factors and guarantees them to be non-increasing through the MCMC iterations, thus considerably enhancing model flexibility, efficiency, and scalability. We further introduce a novel spatial latent nearest-neighbor Gaussian process (NNGP) prior and new sequential updating algorithms for the spatially varying latent variables in the PSBP prior, thereby attaining high spatial scalability. The markedly accelerated posterior sampling and spatial prediction as well as the great modeling and inferential performances of our models are substantiated by our simulation experiments. △ Less

Submitted 2 June, 2024; v1 submitted 10 December, 2023; originally announced December 2023.

Comments: Added summaries of computational complexity and memory improvements brought about by our three novelties (the new Appendix H)

arXiv:2312.05515 [pdf, other]

doi 10.1109/TPWRS.2023.3341458

Exploration of Superposition Theorem in Spectrum Space for Composite Event Analysis in an ADN

Authors: Xing He, Qian Ai, Yuezhong Tang, Robert Qiu, Canbing Li

Abstract: This study presents a formulation of the Superposition Theorem (ST) in the spectrum space, tailored for the analysis of composite events in an active distribution network (ADN). Our formulated ST enables a quantitative analysis on a composite event, uncovering the property of additivity among independent atom events in the spectrum space. This contribution is a significant addition to the existing… ▽ More This study presents a formulation of the Superposition Theorem (ST) in the spectrum space, tailored for the analysis of composite events in an active distribution network (ADN). Our formulated ST enables a quantitative analysis on a composite event, uncovering the property of additivity among independent atom events in the spectrum space. This contribution is a significant addition to the existing literature and has profound implications in various application scenarios. To accomplish this, we leverage random matrix theory (RMT), specifically the asymptotic empirical spectral distribution, Stieltjes transform, and R transform. These mathematical tools establish a nonlinear, model-free, and unsupervised addition operation in the spectrum space. Comprehensive details, including a related roadmap,theorems, deductions, and proofs, are provided in this work. Case studies, utilizing field data, validate our newly derived ST formulation by demonstrating a remarkable performance. Our ST formulation is model-free, non-linear, non-supervised, theory-guided, and uncertainty-insensitive, making it a valuable asset in the realm of composite event analysis in ADN. △ Less

Submitted 9 December, 2023; originally announced December 2023.

Comments: 12 pages. Accepted by IEEE TPWRS

arXiv:2311.17326 [pdf, other]

Mostly Beneficial Clustering: Aggregating Data for Operational Decision Making

Authors: Chengzhang Li, Zhenkang Peng, Ying Rong

Abstract: With increasingly volatile market conditions and rapid product innovations, operational decision-making for large-scale systems entails solving thousands of problems with limited data. Data aggregation is proposed to combine the data across problems to improve the decisions obtained by solving those problems individually. We propose a novel cluster-based Shrunken-SAA approach that can exploit the… ▽ More With increasingly volatile market conditions and rapid product innovations, operational decision-making for large-scale systems entails solving thousands of problems with limited data. Data aggregation is proposed to combine the data across problems to improve the decisions obtained by solving those problems individually. We propose a novel cluster-based Shrunken-SAA approach that can exploit the cluster structure among problems when implementing the data aggregation approaches. We prove that, as the number of problems grows, leveraging the given cluster structure among problems yields additional benefits over the data aggregation approaches that neglect such structure. When the cluster structure is unknown, we show that unveiling the cluster structure, even at the cost of a few data points, can be beneficial, especially when the distance between clusters of problems is substantial. Our proposed approach can be extended to general cost functions under mild conditions. When the number of problems gets large, the optimality gap of our proposed approach decreases exponentially in the distance between the clusters. We explore the performance of the proposed approach through the application of managing newsvendor systems via numerical experiments. We investigate the impacts of distance metrics between problem instances on the performance of the cluster-based Shrunken-SAA approach with synthetic data. We further validate our proposed approach with real data and highlight the advantages of cluster-based data aggregation, especially in the small-data large-scale regime, compared to the existing approaches. △ Less

Submitted 17 December, 2023; v1 submitted 28 November, 2023; originally announced November 2023.

arXiv:2311.14652 [pdf, other]

One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space

Authors: Raghav Addanki, Chenyang Li, Zhao Song, Chiwun Yang

Abstract: Attention computation takes both the time complexity of $O(n^2)$ and the space complexity of $O(n^2)$ simultaneously, which makes deploying Large Language Models (LLMs) in streaming applications that involve long contexts requiring substantial computational resources. In recent OpenAI DevDay (Nov 6, 2023), OpenAI released a new model that is able to support a 128K-long document, in our paper, we f… ▽ More Attention computation takes both the time complexity of $O(n^2)$ and the space complexity of $O(n^2)$ simultaneously, which makes deploying Large Language Models (LLMs) in streaming applications that involve long contexts requiring substantial computational resources. In recent OpenAI DevDay (Nov 6, 2023), OpenAI released a new model that is able to support a 128K-long document, in our paper, we focus on the memory-efficient issue when context length $n$ is much greater than 128K ($n \gg 2^d$). Considering a single-layer self-attention with Query, Key, and Value matrices $Q, K, V \in \mathbb{R}^{n \times d}$, the polynomial method approximates the attention output $T \in \mathbb{R}^{n \times d}$. It accomplishes this by constructing $U_1, U_2 \in \mathbb{R}^{n \times t}$ to expedite attention ${\sf Attn}(Q, K, V)$ computation within $n^{1+o(1)}$ time executions. Despite this, computing the approximated attention matrix $U_1U_2^\top \in \mathbb{R}^{n \times n}$ still necessitates $O(n^2)$ space, leading to significant memory usage. In response to these challenges, we introduce a new algorithm that only reads one pass of the data in a streaming fashion. This method employs sublinear space $o(n)$ to store three sketch matrices, alleviating the need for exact $K, V$ storage. Notably, our algorithm exhibits exceptional memory-efficient performance with super-long tokens. As the token length $n$ increases, our error guarantee diminishes while the memory usage remains nearly constant. This unique attribute underscores the potential of our technique in efficiently handling LLMs in streaming applications. △ Less

Submitted 5 February, 2024; v1 submitted 24 November, 2023; originally announced November 2023.

arXiv:2311.02516 [pdf, other]

Forward $χ^2$ Divergence Based Variational Importance Sampling

Authors: Chengrui Li, Yule Wang, Weihan Li, Anqi Wu

Abstract: Maximizing the log-likelihood is a crucial aspect of learning latent variable models, and variational inference (VI) stands as the commonly adopted method. However, VI can encounter challenges in achieving a high log-likelihood when dealing with complicated posterior distributions. In response to this limitation, we introduce a novel variational importance sampling (VIS) approach that directly est… ▽ More Maximizing the log-likelihood is a crucial aspect of learning latent variable models, and variational inference (VI) stands as the commonly adopted method. However, VI can encounter challenges in achieving a high log-likelihood when dealing with complicated posterior distributions. In response to this limitation, we introduce a novel variational importance sampling (VIS) approach that directly estimates and maximizes the log-likelihood. VIS leverages the optimal proposal distribution, achieved by minimizing the forward $χ^2$ divergence, to enhance log-likelihood estimation. We apply VIS to various popular latent variable models, including mixture models, variational auto-encoders, and partially observable generalized linear models. Results demonstrate that our approach consistently outperforms state-of-the-art baselines, both in terms of log-likelihood and model parameter estimation. △ Less

Submitted 2 February, 2024; v1 submitted 4 November, 2023; originally announced November 2023.

arXiv:2311.00923 [pdf, other]

A Review and Roadmap of Deep Causal Model from Different Causal Structures and Representations

Authors: Hang Chen, Keqing Du, Chenguang Li, Xinyu Yang

Abstract: The fusion of causal models with deep learning introducing increasingly intricate data sets, such as the causal associations within images or between textual components, has surfaced as a focal research area. Nonetheless, the broadening of original causal concepts and theories to such complex, non-statistical data has been met with serious challenges. In response, our study proposes redefinitions… ▽ More The fusion of causal models with deep learning introducing increasingly intricate data sets, such as the causal associations within images or between textual components, has surfaced as a focal research area. Nonetheless, the broadening of original causal concepts and theories to such complex, non-statistical data has been met with serious challenges. In response, our study proposes redefinitions of causal data into three distinct categories from the standpoint of causal structure and representation: definite data, semi-definite data, and indefinite data. Definite data chiefly pertains to statistical data used in conventional causal scenarios, while semi-definite data refers to a spectrum of data formats germane to deep learning, including time-series, images, text, and others. Indefinite data is an emergent research sphere inferred from the progression of data forms by us. To comprehensively present these three data paradigms, we elaborate on their formal definitions, differences manifested in datasets, resolution pathways, and development of research. We summarize key tasks and achievements pertaining to definite and semi-definite data from myriad research undertakings, present a roadmap for indefinite data, beginning with its current research conundrums. Lastly, we classify and scrutinize the key datasets presently utilized within these three paradigms. △ Less

Submitted 1 November, 2023; originally announced November 2023.

Comments: under review

arXiv:2310.00729 [pdf, other]

Spectral Neural Networks: Approximation Theory and Optimization Landscape

Authors: Chenghui Li, Rishi Sonthalia, Nicolas Garcia Trillos

Abstract: There is a large variety of machine learning methodologies that are based on the extraction of spectral geometric information from data. However, the implementations of many of these methods often depend on traditional eigensolvers, which present limitations when applied in practical online big data scenarios. To address some of these challenges, researchers have proposed different strategies for… ▽ More There is a large variety of machine learning methodologies that are based on the extraction of spectral geometric information from data. However, the implementations of many of these methods often depend on traditional eigensolvers, which present limitations when applied in practical online big data scenarios. To address some of these challenges, researchers have proposed different strategies for training neural networks as alternatives to traditional eigensolvers, with one such approach known as Spectral Neural Network (SNN). In this paper, we investigate key theoretical aspects of SNN. First, we present quantitative insights into the tradeoff between the number of neurons and the amount of spectral geometric information a neural network learns. Second, we initiate a theoretical exploration of the optimization landscape of SNN's objective to shed light on the training dynamics of SNN. Unlike typical studies of convergence to global solutions of NN training dynamics, SNN presents an additional complexity due to its non-convex ambient loss function. △ Less

Submitted 1 October, 2023; originally announced October 2023.

arXiv:2309.15699 [pdf, other]

STRAW: Structure-Adaptive Weighting Procedure for Large-Scale Spatial Multiple Testing

Authors: Pengfei Wang, Pengyu Yan, Canhui Li

Abstract: The problem of large-scale spatial multiple testing is often encountered in various scientific research fields, where the signals are usually enriched on some regions while sparse on others. To integrate spatial structure information from nearby locations, we propose a novel approach, called {\bf STR}ucture-{\bf A}daptive {\bf W}eighting (STRAW) procedure, for large-scale spatial multiple testing.… ▽ More The problem of large-scale spatial multiple testing is often encountered in various scientific research fields, where the signals are usually enriched on some regions while sparse on others. To integrate spatial structure information from nearby locations, we propose a novel approach, called {\bf STR}ucture-{\bf A}daptive {\bf W}eighting (STRAW) procedure, for large-scale spatial multiple testing. The STRAW procedure is capable of handling a broad range of spatial settings by leveraging a class of weighted p-values and is fully data-driven. Theoretical results show that the proposed method controls the false discovery rate (FDR) at the pre-specified level under some mild conditions. In practice, the local sparsity level, defined as the probability of the null hypothesis being not true, is commonly unknown. To address this issue, we develop a new method for estimating the local sparsity level by employing the kernel-smooth local false discovery rate (Lfdr) statistic. The superior numerical performance of the STRAW procedure is demonstrated by performing extensive simulation studies and a real data analysis. △ Less

Submitted 27 September, 2023; originally announced September 2023.

arXiv:2309.09420 [pdf, other]

doi 10.1080/01621459.2023.2261658

Discovery and inference of a causal network with hidden confounding

Authors: Li Chen, Chunlin Li, Xiaotong Shen, Wei Pan

Abstract: This article proposes a novel causal discovery and inference method called GrIVET for a Gaussian directed acyclic graph with unmeasured confounders. GrIVET consists of an order-based causal discovery method and a likelihood-based inferential procedure. For causal discovery, we generalize the existing peeling algorithm to estimate the ancestral relations and candidate instruments in the presence of… ▽ More This article proposes a novel causal discovery and inference method called GrIVET for a Gaussian directed acyclic graph with unmeasured confounders. GrIVET consists of an order-based causal discovery method and a likelihood-based inferential procedure. For causal discovery, we generalize the existing peeling algorithm to estimate the ancestral relations and candidate instruments in the presence of hidden confounders. Based on this, we propose a new procedure for instrumental variable estimation of each direct effect by separating it from any mediation effects. For inference, we develop a new likelihood ratio test of multiple causal effects that is able to account for the unmeasured confounders. Theoretically, we prove that the proposed method has desirable guarantees, including robustness to invalid instruments and uncertain interventions, estimation consistency, low-order polynomial time complexity, and validity of asymptotic inference. Numerically, GrIVET performs well and compares favorably against state-of-the-art competitors. Furthermore, we demonstrate the utility and effectiveness of the proposed method through an application inferring regulatory pathways from Alzheimer's disease gene expression data. △ Less

Submitted 17 September, 2023; originally announced September 2023.

Comments: 27 pages, 4 figures, 3 tables. The manuscript is accepted by Journal of the American Statistical Association

arXiv:2307.08496 [pdf, other]

Can We Trust Race Prediction?

Authors: Cangyuan Li

Abstract: In the absence of sensitive race and ethnicity data, researchers, regulators, and firms alike turn to proxies. In this paper, I train a Bidirectional Long Short-Term Memory (BiLSTM) model on a novel dataset of voter registration data from all 50 US states and create an ensemble that achieves up to 36.8% higher out of sample (OOS) F1 scores than the best performing machine learning models in the li… ▽ More In the absence of sensitive race and ethnicity data, researchers, regulators, and firms alike turn to proxies. In this paper, I train a Bidirectional Long Short-Term Memory (BiLSTM) model on a novel dataset of voter registration data from all 50 US states and create an ensemble that achieves up to 36.8% higher out of sample (OOS) F1 scores than the best performing machine learning models in the literature. Additionally, I construct the most comprehensive database of first and surname distributions in the US in order to improve the coverage and accuracy of Bayesian Improved Surname Geocoding (BISG) and Bayesian Improved Firstname Surname Geocoding (BIFSG). Finally, I provide the first high-quality benchmark dataset in order to fairly compare existing models and aid future model developers. △ Less

Submitted 7 August, 2023; v1 submitted 17 July, 2023; originally announced July 2023.

arXiv:2307.00467 [pdf, other]

MissDiff: Training Diffusion Models on Tabular Data with Missing Values

Authors: Yidong Ouyang, Liyan Xie, Chongxuan Li, Guang Cheng

Abstract: The diffusion model has shown remarkable performance in modeling data distributions and synthesizing data. However, the vanilla diffusion model requires complete or fully observed data for training. Incomplete data is a common issue in various real-world applications, including healthcare and finance, particularly when dealing with tabular datasets. This work presents a unified and principled diff… ▽ More The diffusion model has shown remarkable performance in modeling data distributions and synthesizing data. However, the vanilla diffusion model requires complete or fully observed data for training. Incomplete data is a common issue in various real-world applications, including healthcare and finance, particularly when dealing with tabular datasets. This work presents a unified and principled diffusion-based framework for learning from data with missing values under various missing mechanisms. We first observe that the widely adopted "impute-then-generate" pipeline may lead to a biased learning objective. Then we propose to mask the regression loss of Denoising Score Matching in the training phase. We prove the proposed method is consistent in learning the score of data distributions, and the proposed training objective serves as an upper bound for the negative likelihood in certain cases. The proposed framework is evaluated on multiple tabular datasets using realistic and efficacious metrics and is demonstrated to outperform state-of-the-art diffusion model on tabular data with "impute-then-generate" pipeline by a large margin. △ Less

Submitted 1 July, 2023; originally announced July 2023.

Comments: 22 pages, short version is accepted by ICML workshop on Structured Probabilistic Inference & Generative Modeling 2023

Report number: 22

arXiv:2307.00126 [pdf, other]

Accelerating Inexact HyperGradient Descent for Bilevel Optimization

Authors: Haikuo Yang, Luo Luo, Chris Junchi Li, Michael I. Jordan

Abstract: We present a method for solving general nonconvex-strongly-convex bilevel optimization problems. Our method -- the \emph{Restarted Accelerated HyperGradient Descent} (\texttt{RAHGD}) method -- finds an $ε$-first-order stationary point of the objective with $\tilde{\mathcal{O}}(κ^{3.25}ε^{-1.75})$ oracle complexity, where $κ$ is the condition number of the lower-level objective and $ε$ is the desir… ▽ More We present a method for solving general nonconvex-strongly-convex bilevel optimization problems. Our method -- the \emph{Restarted Accelerated HyperGradient Descent} (\texttt{RAHGD}) method -- finds an $ε$-first-order stationary point of the objective with $\tilde{\mathcal{O}}(κ^{3.25}ε^{-1.75})$ oracle complexity, where $κ$ is the condition number of the lower-level objective and $ε$ is the desired accuracy. We also propose a perturbed variant of \texttt{RAHGD} for finding an $\big(ε,\mathcal{O}(κ^{2.5}\sqrtε\,)\big)$-second-order stationary point within the same order of oracle complexity. Our results achieve the best-known theoretical guarantees for finding stationary points in bilevel optimization and also improve upon the existing upper complexity bound for finding second-order stationary points in nonconvex-strongly-concave minimax optimization problems, setting a new state-of-the-art benchmark. Empirical studies are conducted to validate the theoretical results in this paper. △ Less

Submitted 30 June, 2023; originally announced July 2023.

arXiv:2306.17759 [pdf, other]

The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

Authors: Lorenzo Noci, Chuning Li, Mufan Bill Li, Bobby He, Thomas Hofmann, Chris Maddison, Daniel M. Roy

Abstract: In deep learning theory, the covariance matrix of the representations serves as a proxy to examine the network's trainability. Motivated by the success of Transformers, we study the covariance matrix of a modified Softmax-based attention model with skip connections in the proportional limit of infinite-depth-and-width. We show that at initialization the limiting distribution can be described by a… ▽ More In deep learning theory, the covariance matrix of the representations serves as a proxy to examine the network's trainability. Motivated by the success of Transformers, we study the covariance matrix of a modified Softmax-based attention model with skip connections in the proportional limit of infinite-depth-and-width. We show that at initialization the limiting distribution can be described by a stochastic differential equation (SDE) indexed by the depth-to-width ratio. To achieve a well-defined stochastic limit, the Transformer's attention mechanism is modified by centering the Softmax output at identity, and scaling the Softmax logits by a width-dependent temperature parameter. We examine the stability of the network through the corresponding SDE, showing how the scale of both the drift and diffusion can be elegantly controlled with the aid of residual connections. The existence of a stable SDE implies that the covariance structure is well-behaved, even for very large depth and width, thus preventing the notorious issues of rank degeneracy in deep attention models. Finally, we show, through simulations, that the SDE provides a surprisingly good description of the corresponding finite-size model. We coin the name shaped Transformer for these architectural modifications. △ Less

Submitted 9 December, 2023; v1 submitted 30 June, 2023; originally announced June 2023.

arXiv:2306.13870 [pdf, ps, other]

Post-Selection Inference for the Cox Model with Interval-Censored Data

Authors: Jianrui Zhang, Chenxi Li, Haolei Weng

Abstract: We develop a post-selection inference method for the Cox proportional hazards model with interval-censored data, which provides asymptotically valid p-values and confidence intervals conditional on the model selected by lasso. The method is based on a pivotal quantity that is shown to converge to a uniform distribution under local alternatives. The proof can be adapted to many other regression mod… ▽ More We develop a post-selection inference method for the Cox proportional hazards model with interval-censored data, which provides asymptotically valid p-values and confidence intervals conditional on the model selected by lasso. The method is based on a pivotal quantity that is shown to converge to a uniform distribution under local alternatives. The proof can be adapted to many other regression models, which is illustrated by the extension to generalized linear models and the Cox model with right-censored data. Our method involves estimation of the efficient information matrix, for which several approaches are proposed with proof of their consistency. Thorough simulation studies show that our method has satisfactory performance in samples of modest sizes. The utility of the method is illustrated via an application to an Alzheimer's disease study. △ Less

Submitted 30 December, 2023; v1 submitted 24 June, 2023; originally announced June 2023.

Comments: 46 pages, 14 figures

arXiv:2305.17476 [pdf, other]

Toward Understanding Generative Data Augmentation

Authors: Chenyu Zheng, Guoqiang Wu, Chongxuan Li

Abstract: Generative data augmentation, which scales datasets by obtaining fake labeled examples from a trained conditional generative model, boosts classification performance in various learning tasks including (semi-)supervised learning, few-shot learning, and adversarially robust learning. However, little work has theoretically investigated the effect of generative data augmentation. To fill this gap, we… ▽ More Generative data augmentation, which scales datasets by obtaining fake labeled examples from a trained conditional generative model, boosts classification performance in various learning tasks including (semi-)supervised learning, few-shot learning, and adversarially robust learning. However, little work has theoretically investigated the effect of generative data augmentation. To fill this gap, we establish a general stability bound in this not independently and identically distributed (non-i.i.d.) setting, where the learned distribution is dependent on the original train set and generally not the same as the true distribution. Our theoretical result includes the divergence between the learned distribution and the true distribution. It shows that generative data augmentation can enjoy a faster learning rate when the order of divergence term is $o(\max\left( \log(m)β_m, 1 / \sqrt{m})\right)$, where $m$ is the train set size and $β_m$ is the corresponding stability constant. We further specify the learning setup to the Gaussian mixture model and generative adversarial nets. We prove that in both cases, though generative data augmentation does not enjoy a faster learning rate, it can improve the learning guarantees at a constant level when the train set is small, which is significant when the awful overfitting occurs. Simulation results on the Gaussian mixture model and empirical results on generative adversarial nets support our theoretical conclusions. Our code is available at https://github.com/ML-GSAI/Understanding-GDA. △ Less

Submitted 27 May, 2023; originally announced May 2023.

Comments: 39 pages

arXiv:2305.06172 [pdf, other]

Principal Feature Detection via $Φ$-Sobolev Inequalities

Authors: Matthew T. C. Li, Youssef Marzouk, Olivier Zahm

Abstract: We investigate the approximation of high-dimensional target measures as low-dimensional updates of a dominating reference measure. This approximation class replaces the associated density with the composition of: (i) a feature map that identifies the leading principal components or features of the target measure, relative to the reference, and (ii) a low-dimensional profile function. When the refe… ▽ More We investigate the approximation of high-dimensional target measures as low-dimensional updates of a dominating reference measure. This approximation class replaces the associated density with the composition of: (i) a feature map that identifies the leading principal components or features of the target measure, relative to the reference, and (ii) a low-dimensional profile function. When the reference measure satisfies a subspace $φ$-Sobolev inequality, we construct a computationally tractable approximation that yields certifiable error guarantees with respect to the Amari $α$-divergences. Our construction proceeds in two stages. First, for any feature map and any $α$-divergence, we obtain an analytical expression for the optimal profile function. Second, for linear feature maps, the principal features are obtained from eigenvectors of a matrix involving gradients of the log-density. Neither step requires explicit access to normalizing constants. Notably, by leveraging the $φ$-Sobolev inequalities, we demonstrate that these features universally certify approximation errors across the range of $α$-divergences $α\in (0,1]$. We then propose an application to Bayesian inverse problems and provide an analogous construction with approximation guarantees that hold in expectation over the data. We conclude with an extension of the proposed dimension reduction strategy to nonlinear feature maps. △ Less

Submitted 16 January, 2024; v1 submitted 10 May, 2023; originally announced May 2023.

Comments: To appear in Bernoulli, but this version contains both the main file and the supplementary material

arXiv:2305.05248 [pdf, ps, other]

Towards Understanding Generalization of Macro-AUC in Multi-label Learning

Authors: Guoqiang Wu, Chongxuan Li, Yilong Yin

Abstract: Macro-AUC is the arithmetic mean of the class-wise AUCs in multi-label learning and is commonly used in practice. However, its theoretical understanding is far lacking. Toward solving it, we characterize the generalization properties of various learning algorithms based on the corresponding surrogate losses w.r.t. Macro-AUC. We theoretically identify a critical factor of the dataset affecting the… ▽ More Macro-AUC is the arithmetic mean of the class-wise AUCs in multi-label learning and is commonly used in practice. However, its theoretical understanding is far lacking. Toward solving it, we characterize the generalization properties of various learning algorithms based on the corresponding surrogate losses w.r.t. Macro-AUC. We theoretically identify a critical factor of the dataset affecting the generalization bounds: \emph{the label-wise class imbalance}. Our results on the imbalance-aware error bounds show that the widely-used univariate loss-based algorithm is more sensitive to the label-wise class imbalance than the proposed pairwise and reweighted loss-based ones, which probably implies its worse performance. Moreover, empirical results on various datasets corroborate our theory findings. To establish it, technically, we propose a new (and more general) McDiarmid-type concentration inequality, which may be of independent interest. △ Less

Submitted 2 June, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

Comments: ICML 2023

arXiv:2305.01143 [pdf, other]

Understanding the Generalization Ability of Deep Learning Algorithms: A Kernelized Renyi's Entropy Perspective

Authors: Yuxin Dong, Tieliang Gong, Hong Chen, Chen Li

Abstract: Recently, information theoretic analysis has become a popular framework for understanding the generalization behavior of deep neural networks. It allows a direct analysis for stochastic gradient/Langevin descent (SGD/SGLD) learning algorithms without strong assumptions such as Lipschitz or convexity conditions. However, the current generalization error bounds within this framework are still far fr… ▽ More Recently, information theoretic analysis has become a popular framework for understanding the generalization behavior of deep neural networks. It allows a direct analysis for stochastic gradient/Langevin descent (SGD/SGLD) learning algorithms without strong assumptions such as Lipschitz or convexity conditions. However, the current generalization error bounds within this framework are still far from optimal, while substantial improvements on these bounds are quite challenging due to the intractability of high-dimensional information quantities. To address this issue, we first propose a novel information theoretical measure: kernelized Renyi's entropy, by utilizing operator representation in Hilbert space. It inherits the properties of Shannon's entropy and can be effectively calculated via simple random sampling, while remaining independent of the input dimension. We then establish the generalization error bounds for SGD/SGLD under kernelized Renyi's entropy, where the mutual information quantities can be directly calculated, enabling evaluation of the tightness of each intermediate step. We show that our information-theoretical bounds depend on the statistics of the stochastic gradients evaluated along with the iterates, and are rigorously tighter than the current state-of-the-art (SOTA) results. The theoretical findings are also supported by large-scale empirical studies1. △ Less

Submitted 1 May, 2023; originally announced May 2023.

arXiv:2303.09519 [pdf]

doi 10.21105/joss.05428

PyVBMC: Efficient Bayesian inference in Python

Authors: Bobby Huggins, Chengkun Li, Marlon Tobaben, Mikko J. Aarnos, Luigi Acerbi

Abstract: PyVBMC is a Python implementation of the Variational Bayesian Monte Carlo (VBMC) algorithm for posterior and model inference for black-box computational models (Acerbi, 2018, 2020). VBMC is an approximate inference method designed for efficient parameter estimation and model assessment when model evaluations are mildly-to-very expensive (e.g., a second or more) and/or noisy. Specifically, VBMC com… ▽ More PyVBMC is a Python implementation of the Variational Bayesian Monte Carlo (VBMC) algorithm for posterior and model inference for black-box computational models (Acerbi, 2018, 2020). VBMC is an approximate inference method designed for efficient parameter estimation and model assessment when model evaluations are mildly-to-very expensive (e.g., a second or more) and/or noisy. Specifically, VBMC computes: - a flexible (non-Gaussian) approximate posterior distribution of the model parameters, from which statistics and posterior samples can be easily extracted; - an approximation of the model evidence or marginal likelihood, a metric used for Bayesian model selection. PyVBMC can be applied to any computational or statistical model with up to roughly 10-15 continuous parameters, with the only requirement that the user can provide a Python function that computes the target log likelihood of the model, or an approximation thereof (e.g., an estimate of the likelihood obtained via simulation or Monte Carlo methods). PyVBMC is particularly effective when the model takes more than about a second per evaluation, with dramatic speed-ups of 1-2 orders of magnitude when compared to traditional approximate inference methods. Extensive benchmarks on both artificial test problems and a large number of real models from the computational sciences, particularly computational and cognitive neuroscience, show that VBMC generally - and often vastly - outperforms alternative methods for sample-efficient Bayesian inference, and is applicable to both exact and simulator-based models (Acerbi, 2018, 2019, 2020). PyVBMC brings this state-of-the-art inference algorithm to Python, along with an easy-to-use Pythonic interface for running the algorithm and manipulating and visualizing its results. △ Less

Submitted 27 June, 2023; v1 submitted 16 March, 2023; originally announced March 2023.

Comments: 6 pages, 1 figure. Published in The Journal of Open Source Software. Documentation is available at https://acerbilab.github.io/pyvbmc and source code is available at https://github.com/acerbilab/pyvbmc

Journal ref: The Journal of Open Source Software, 8(86), 2023, 5428

arXiv:2303.06815 [pdf, other]

On Model Compression for Neural Networks: Framework, Algorithm, and Convergence Guarantee

Authors: Chenyang Li, Jihoon Chung, Biao Cai, Haimin Wang, Xianlian Zhou, Bo Shen

Abstract: Model compression is a crucial part of deploying neural networks (NNs), especially when the memory and storage of computing devices are limited in many applications. This paper focuses on two model compression techniques: low-rank approximation and weight pruning in neural networks, which are very popular nowadays. However, training NN with low-rank approximation and weight pruning always suffers… ▽ More Model compression is a crucial part of deploying neural networks (NNs), especially when the memory and storage of computing devices are limited in many applications. This paper focuses on two model compression techniques: low-rank approximation and weight pruning in neural networks, which are very popular nowadays. However, training NN with low-rank approximation and weight pruning always suffers significant accuracy loss and convergence issues. In this paper, a holistic framework is proposed for model compression from a novel perspective of nonconvex optimization by designing an appropriate objective function. Then, we introduce NN-BCD, a block coordinate descent (BCD) algorithm to solve the nonconvex optimization. One advantage of our algorithm is that an efficient iteration scheme can be derived with closed-form, which is gradient-free. Therefore, our algorithm will not suffer from vanishing/exploding gradient problems. Furthermore, with the Kurdyka-Łojasiewicz (KŁ) property of our objective function, we show that our algorithm globally converges to a critical point at the rate of O(1/k), where k denotes the number of iterations. Lastly, extensive experiments with tensor train decomposition and weight pruning demonstrate the efficiency and superior performance of the proposed framework. Our code implementation is available at https://github.com/ChenyangLi-97/NN-BCD △ Less

Submitted 4 January, 2024; v1 submitted 12 March, 2023; originally announced March 2023.

Comments: 43 pages

arXiv:2303.05263 [pdf, other]

Fast post-process Bayesian inference with Variational Sparse Bayesian Quadrature

Authors: Chengkun Li, Grégoire Clarté, Martin Jørgensen, Luigi Acerbi

Abstract: In applied Bayesian inference scenarios, users may have access to a large number of pre-existing model evaluations, for example from maximum-a-posteriori (MAP) optimization runs. However, traditional approximate inference techniques make little to no use of this available information. We propose the framework of post-process Bayesian inference as a means to obtain a quick posterior approximation f… ▽ More In applied Bayesian inference scenarios, users may have access to a large number of pre-existing model evaluations, for example from maximum-a-posteriori (MAP) optimization runs. However, traditional approximate inference techniques make little to no use of this available information. We propose the framework of post-process Bayesian inference as a means to obtain a quick posterior approximation from existing target density evaluations, with no further model calls. Within this framework, we introduce Variational Sparse Bayesian Quadrature (VSBQ), a method for post-process approximate inference for models with black-box and potentially noisy likelihoods. VSBQ reuses existing target density evaluations to build a sparse Gaussian process (GP) surrogate model of the log posterior density function. Subsequently, we leverage sparse-GP Bayesian quadrature combined with variational inference to achieve fast approximate posterior inference over the surrogate. We validate our method on challenging synthetic scenarios and real-world applications from computational neuroscience. The experiments show that VSBQ builds high-quality posterior approximations by post-processing existing optimization traces, with no further model evaluations. △ Less

Submitted 18 June, 2024; v1 submitted 9 March, 2023; originally announced March 2023.

Comments: 52 pages, 14 figures

arXiv:2303.04880 [pdf, other]

doi 10.1002/sim.9864

Rank Intraclass Correlation for Clustered Data

Authors: Shengxin Tu, Chun Li, Donglin Zeng, Bryan E. Shepherd

Abstract: Clustered data are common in biomedical research. Observations in the same cluster are often more similar to each other than to observations from other clusters. The intraclass correlation coefficient (ICC), first introduced by R. A. Fisher, is frequently used to measure this degree of similarity. However, the ICC is sensitive to extreme values and skewed distributions, and depends on the scale of… ▽ More Clustered data are common in biomedical research. Observations in the same cluster are often more similar to each other than to observations from other clusters. The intraclass correlation coefficient (ICC), first introduced by R. A. Fisher, is frequently used to measure this degree of similarity. However, the ICC is sensitive to extreme values and skewed distributions, and depends on the scale of the data. It is also not applicable to ordered categorical data. We define the rank ICC as a natural extension of Fisher's ICC to the rank scale, and describe its corresponding population parameter. The rank ICC is simply interpreted as the rank correlation between a random pair of observations from the same cluster. We also extend the definition when the underlying distribution has more than two hierarchies. We describe estimation and inference procedures, show the asymptotic properties of our estimator, conduct simulations to evaluate its performance, and illustrate our method in three real data examples with skewed data, count data, and three-level ordered categorical data. △ Less

Submitted 28 July, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

Journal ref: Statistics in Medicine. 2023; 42(24): 4333-4348

arXiv:2302.08097 [pdf, ps, other]

New $\sqrt{n}$-consistent, numerically stable higher-order influence function estimators

Authors: Lin Liu, Chang Li

Abstract: Higher-Order Influence Functions (HOIFs) provide a unified theory for constructing rate-optimal estimators for a large class of low-dimensional (smooth) statistical functionals/parameters (and sometimes even infinite-dimensional functions) that arise in substantive fields including epidemiology, economics, and the social sciences. Since the introduction of HOIFs by Robins et al. (2008), they have… ▽ More Higher-Order Influence Functions (HOIFs) provide a unified theory for constructing rate-optimal estimators for a large class of low-dimensional (smooth) statistical functionals/parameters (and sometimes even infinite-dimensional functions) that arise in substantive fields including epidemiology, economics, and the social sciences. Since the introduction of HOIFs by Robins et al. (2008), they have been viewed mostly as a theoretical benchmark rather than a useful tool for statistical practice. Works aimed to flip the script are scant, but a few recent papers Liu et al. (2017, 2021b) make some partial progress. In this paper, we take a fresh attempt at achieving this goal by constructing new, numerically stable HOIF estimators (or sHOIF estimators for short with ``s'' standing for ``stable'') with provable statistical, numerical, and computational guarantees. This new class of sHOIF estimators (up to the 2nd order) was foreshadowed in synthetic experiments conducted by Liu et al. (2020a). △ Less

Submitted 16 February, 2023; originally announced February 2023.

arXiv:2302.03178 [pdf, other]

doi 10.1080/01621459.2023.2179490

Nonlinear Causal Discovery with Confounders

Authors: Chunlin Li, Xiaotong Shen, Wei Pan

Abstract: This article introduces a causal discovery method to learn nonlinear relationships in a directed acyclic graph with correlated Gaussian errors due to confounding. First, we derive model identifiability under the sublinear growth assumption. Then, we propose a novel method, named the Deconfounded Functional Structure Estimation (DeFuSE), consisting of a deconfounding adjustment to remove the confou… ▽ More This article introduces a causal discovery method to learn nonlinear relationships in a directed acyclic graph with correlated Gaussian errors due to confounding. First, we derive model identifiability under the sublinear growth assumption. Then, we propose a novel method, named the Deconfounded Functional Structure Estimation (DeFuSE), consisting of a deconfounding adjustment to remove the confounding effects and a sequential procedure to estimate the causal order of variables. We implement DeFuSE via feedforward neural networks for scalable computation. Moreover, we establish the consistency of DeFuSE under an assumption called the strong causal minimality. In simulations, DeFuSE compares favorably against state-of-the-art competitors that ignore confounding or nonlinearity. Finally, we demonstrate the utility and effectiveness of the proposed approach with an application to gene regulatory network analysis. The Python implementation is available at https://github.com/chunlinli/defuse. △ Less

Submitted 13 February, 2023; v1 submitted 6 February, 2023; originally announced February 2023.

Comments: 28 pages, 4 figures, 3 tables

Journal ref: Journal of the American Statistical Association, 2023

arXiv:2302.02334 [pdf, other]

Revisiting Discriminative vs. Generative Classifiers: Theory and Implications

Authors: Chenyu Zheng, Guoqiang Wu, Fan Bao, Yue Cao, Chongxuan Li, Jun Zhu

Abstract: A large-scale deep model pre-trained on massive labeled or unlabeled data transfers well to downstream tasks. Linear evaluation freezes parameters in the pre-trained model and trains a linear classifier separately, which is efficient and attractive for transfer. However, little work has investigated the classifier in linear evaluation except for the default logistic regression. Inspired by the sta… ▽ More A large-scale deep model pre-trained on massive labeled or unlabeled data transfers well to downstream tasks. Linear evaluation freezes parameters in the pre-trained model and trains a linear classifier separately, which is efficient and attractive for transfer. However, little work has investigated the classifier in linear evaluation except for the default logistic regression. Inspired by the statistical efficiency of naive Bayes, the paper revisits the classical topic on discriminative vs. generative classifiers. Theoretically, the paper considers the surrogate loss instead of the zero-one loss in analyses and generalizes the classical results from binary cases to multiclass ones. We show that, under mild assumptions, multiclass naive Bayes requires $O(\log n)$ samples to approach its asymptotic error while the corresponding multiclass logistic regression requires $O(n)$ samples, where $n$ is the feature dimension. To establish it, we present a multiclass $\mathcal{H}$-consistency bound framework and an explicit bound for logistic loss, which are of independent interests. Simulation results on a mixture of Gaussian validate our theoretical findings. Experiments on various pre-trained deep vision models show that naive Bayes consistently converges faster as the number of data increases. Besides, naive Bayes shows promise in few-shot cases and we observe the "two regimes" phenomenon in pre-trained supervised models. Our code is available at https://github.com/ML-GSAI/Revisiting-Dis-vs-Gen-Classifiers. △ Less

Submitted 29 May, 2023; v1 submitted 5 February, 2023; originally announced February 2023.

Comments: Accepted by ICML 2023, 58 pages

arXiv:2301.04857 [pdf, other]

Neural Spline Search for Quantile Probabilistic Modeling

Authors: Ruoxi Sun, Chun-Liang Li, Sercan O. Arik, Michael W. Dusenberry, Chen-Yu Lee, Tomas Pfister

Abstract: Accurate estimation of output quantiles is crucial in many use cases, where it is desired to model the range of possibility. Modeling target distribution at arbitrary quantile levels and at arbitrary input attribute levels are important to offer a comprehensive picture of the data, and requires the quantile function to be expressive enough. The quantile function describing the target distribution… ▽ More Accurate estimation of output quantiles is crucial in many use cases, where it is desired to model the range of possibility. Modeling target distribution at arbitrary quantile levels and at arbitrary input attribute levels are important to offer a comprehensive picture of the data, and requires the quantile function to be expressive enough. The quantile function describing the target distribution using quantile levels is critical for quantile regression. Although various parametric forms for the distributions (that the quantile function specifies) can be adopted, an everlasting problem is selecting the most appropriate one that can properly approximate the data distributions. In this paper, we propose a non-parametric and data-driven approach, Neural Spline Search (NSS), to represent the observed data distribution without parametric assumptions. NSS is flexible and expressive for modeling data distributions by transforming the inputs with a series of monotonic spline regressions guided by symbolic operators. We demonstrate that NSS outperforms previous methods on synthetic, real-world regression and time-series forecasting tasks. △ Less

Submitted 12 January, 2023; originally announced January 2023.

arXiv:2212.09126 [pdf, other]

Pigeonhole Stochastic Gradient Langevin Dynamics for Large Crossed Mixed Effects Models

Authors: Xinyu Zhang, Cheng Li

Abstract: Large crossed mixed effects models with imbalanced structures and missing data pose major computational challenges for standard Bayesian posterior sampling algorithms, as the computational complexity is usually superlinear in the number of observations. We propose two efficient subset-based stochastic gradient MCMC algorithms for such crossed mixed effects model, which facilitate scalable inferenc… ▽ More Large crossed mixed effects models with imbalanced structures and missing data pose major computational challenges for standard Bayesian posterior sampling algorithms, as the computational complexity is usually superlinear in the number of observations. We propose two efficient subset-based stochastic gradient MCMC algorithms for such crossed mixed effects model, which facilitate scalable inference on both the variance components and regression coefficients. The first algorithm is developed for balanced design without missing observations, where we leverage the closed-form expression of precision matrix for the full data matrix. The second algorithm, which we call the pigeonhole stochastic gradient Langevin dynamics (PSGLD), is developed for both balanced and unbalanced designs with potentially a large proportion of missing observations. Our PSGLD algorithm imputes the latent crossed random effects by running short Markov chains and then samples the model parameters of variance components and regression coefficients at each MCMC iteration. We provide theoretical guarantee by showing the convergence of the output distribution from the proposed algorithms to the target non-log-concave posterior distribution. A variety of numerical experiments based on both synthetic and real data demonstrate that the proposed algorithms can significantly reduce the computational cost of the standard MCMC algorithms and better balance the approximation accuracy and computational efficiency. △ Less

Submitted 18 December, 2022; originally announced December 2022.

arXiv:2212.03122 [pdf, ps, other]

Robust convex biclustering with a tuning-free method

Authors: Yifan Chen, Chunyin Lei, Chuanquan Li, Haiqiang Ma, Ningyuan Hu

Abstract: Biclustering is widely used in different kinds of fields including gene information analysis, text mining, and recommendation system by effectively discovering the local correlation between samples and features. However, many biclustering algorithms will collapse when facing heavy-tailed data. In this paper, we propose a robust version of convex biclustering algorithm with Huber loss. Yet, the new… ▽ More Biclustering is widely used in different kinds of fields including gene information analysis, text mining, and recommendation system by effectively discovering the local correlation between samples and features. However, many biclustering algorithms will collapse when facing heavy-tailed data. In this paper, we propose a robust version of convex biclustering algorithm with Huber loss. Yet, the newly introduced robustification parameter brings an extra burden to selecting the optimal parameters. Therefore, we propose a tuning-free method for automatically selecting the optimal robustification parameter with high efficiency. The simulation study demonstrates the more fabulous performance of our proposed method than traditional biclustering methods when encountering heavy-tailed noise. A real-life biomedical application is also presented. The R package RcvxBiclustr is available at https://github.com/YifanChen3/RcvxBiclustr. △ Less

Submitted 6 October, 2023; v1 submitted 6 December, 2022; originally announced December 2022.

Comments: 17 pages, 4 figures

arXiv:2211.14752 [pdf, other]

Differentiable Meta Multigraph Search with Partial Message Propagation on Heterogeneous Information Networks

Authors: Chao Li, Hao Xu, Kun He

Abstract: Heterogeneous information networks (HINs) are widely employed for describing real-world data with intricate entities and relationships. To automatically utilize their semantic information, graph neural architecture search has recently been developed on various tasks of HINs. Existing works, on the other hand, show weaknesses in instability and inflexibility. To address these issues, we propose a n… ▽ More Heterogeneous information networks (HINs) are widely employed for describing real-world data with intricate entities and relationships. To automatically utilize their semantic information, graph neural architecture search has recently been developed on various tasks of HINs. Existing works, on the other hand, show weaknesses in instability and inflexibility. To address these issues, we propose a novel method called Partial Message Meta Multigraph search (PMMM) to automatically optimize the neural architecture design on HINs. Specifically, to learn how graph neural networks (GNNs) propagate messages along various types of edges, PMMM adopts an efficient differentiable framework to search for a meaningful meta multigraph, which can capture more flexible and complex semantic relations than a meta graph. The differentiable search typically suffers from performance instability, so we further propose a stable algorithm called partial message search to ensure that the searched meta multigraph consistently surpasses the manually designed meta-structures, i.e., meta-paths. Extensive experiments on six benchmark datasets over two representative tasks, including node classification and recommendation, demonstrate the effectiveness of the proposed method. Our approach outperforms the state-of-the-art heterogeneous GNNs, finds out meaningful meta multigraphs, and is significantly more stable. △ Less

Submitted 27 November, 2022; originally announced November 2022.

Comments: 12 pages, 7 figures, 8 tables, accepted by AAAI 2023 conference

arXiv:2211.14692 [pdf, other]

Radial Neighbors for Provably Accurate Scalable Approximations of Gaussian Processes

Authors: Yichen Zhu, Michele Peruzzi, Cheng Li, David B. Dunson

Abstract: In geostatistical problems with massive sample size, Gaussian processes can be approximated using sparse directed acyclic graphs to achieve scalable $O(n)$ computational complexity. In these models, data at each location are typically assumed conditionally dependent on a small set of parents which usually include a subset of the nearest neighbors. These methodologies often exhibit excellent empiri… ▽ More In geostatistical problems with massive sample size, Gaussian processes can be approximated using sparse directed acyclic graphs to achieve scalable $O(n)$ computational complexity. In these models, data at each location are typically assumed conditionally dependent on a small set of parents which usually include a subset of the nearest neighbors. These methodologies often exhibit excellent empirical performance, but the lack of theoretical validation leads to unclear guidance in specifying the underlying graphical model and sensitivity to graph choice. We address these issues by introducing radial neighbors Gaussian processes (RadGP), a class of Gaussian processes based on directed acyclic graphs in which directed edges connect every location to all of its neighbors within a predetermined radius. We prove that any radial neighbors Gaussian process can accurately approximate the corresponding unrestricted Gaussian process in Wasserstein-2 distance, with an error rate determined by the approximation radius, the spatial covariance function, and the spatial dispersion of samples. We offer further empirical validation of our approach via applications on simulated and real world data showing excellent performance in both prior and posterior approximations to the original Gaussian process. △ Less

Submitted 20 June, 2024; v1 submitted 26 November, 2022; originally announced November 2022.

Showing 1–50 of 326 results for author: Li, C