Search | arXiv e-print repository

Distributionally Robust Safe Sample Screening

Authors: Hiroyuki Hanada, Aoyama Tatsuya, Akahane Satoshi, Tomonari Tanaka, Yoshito Okura, Yu Inatsu, Noriaki Hashimoto, Shion Takeno, Taro Murayama, Hanju Lee, Shinya Kojima, Ichiro Takeuchi

Abstract: In this study, we propose a machine learning method called Distributionally Robust Safe Sample Screening (DRSSS). DRSSS aims to identify unnecessary training samples, even when the distribution of the training samples changes in the future. To achieve this, we effectively combine the distributionally robust (DR) paradigm, which aims to enhance model robustness against variations in data distributi… ▽ More In this study, we propose a machine learning method called Distributionally Robust Safe Sample Screening (DRSSS). DRSSS aims to identify unnecessary training samples, even when the distribution of the training samples changes in the future. To achieve this, we effectively combine the distributionally robust (DR) paradigm, which aims to enhance model robustness against variations in data distribution, with the safe sample screening (SSS), which identifies unnecessary training samples prior to model training. Since we need to consider an infinite number of scenarios regarding changes in the distribution, we applied SSS because it does not require model training after the change of the distribution. In this paper, we employed the covariate shift framework to represent the distribution of training samples and reformulated the DR covariate-shift problem as a weighted empirical risk minimization problem, where the weights are subject to uncertainty within a predetermined range. By extending the existing SSS technique to accommodate this weight uncertainty, the DRSSS method is capable of reliably identifying unnecessary samples under any future distribution within a specified range. We provide a theoretical guarantee for the DRSSS method and validate its performance through numerical experiments on both synthetic and real-world datasets. △ Less

Submitted 9 June, 2024; originally announced June 2024.

arXiv:2406.02847 [pdf, other]

Exact Conversion of In-Context Learning to Model Weights in Linearized-Attention Transformers

Authors: Brian K Chen, Tianyang Hu, Hui **, Hwee Kuan Lee, Kenji Kawaguchi

Abstract: In-Context Learning (ICL) has been a powerful emergent property of large language models that has attracted increasing attention in recent years. In contrast to regular gradient-based learning, ICL is highly interpretable and does not require parameter updates. In this paper, we show that, for linearized transformer networks, ICL can be made explicit and permanent through the inclusion of bias ter… ▽ More In-Context Learning (ICL) has been a powerful emergent property of large language models that has attracted increasing attention in recent years. In contrast to regular gradient-based learning, ICL is highly interpretable and does not require parameter updates. In this paper, we show that, for linearized transformer networks, ICL can be made explicit and permanent through the inclusion of bias terms. We mathematically demonstrate the equivalence between a model with ICL demonstration prompts and the same model with the additional bias terms. Our algorithm (ICLCA) allows for exact conversion in an inexpensive manner. Existing methods are not exact and require expensive parameter updates. We demonstrate the efficacy of our approach through experiments that show the exact incorporation of ICL tokens into a linear transformer. We further suggest how our method can be adapted to achieve cheap approximate conversion of ICL tokens, even in regular transformer networks that are not linearized. Our experiments on GPT-2 show that, even though the conversion is only approximate, the model still gains valuable context from the included bias terms. △ Less

Submitted 6 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

Comments: Accepted to ICML 2024

arXiv:2406.00823 [pdf, other]

Lasso Bandit with Compatibility Condition on Optimal Arm

Authors: Harin Lee, Taehyun Hwang, Min-hwan Oh

Abstract: We consider a stochastic sparse linear bandit problem where only a sparse subset of context features affects the expected reward function, i.e., the unknown reward parameter has sparse structure. In the existing Lasso bandit literature, the compatibility conditions together with additional diversity conditions on the context features are imposed to achieve regret bounds that only depend logarithmi… ▽ More We consider a stochastic sparse linear bandit problem where only a sparse subset of context features affects the expected reward function, i.e., the unknown reward parameter has sparse structure. In the existing Lasso bandit literature, the compatibility conditions together with additional diversity conditions on the context features are imposed to achieve regret bounds that only depend logarithmically on the ambient dimension $d$. In this paper, we demonstrate that even without the additional diversity assumptions, the compatibility condition only on the optimal arm is sufficient to derive a regret bound that depends logarithmically on $d$, and our assumption is strictly weaker than those used in the lasso bandit literature under the single parameter setting. We propose an algorithm that adapts the forced-sampling technique and prove that the proposed algorithm achieves $O(\text{poly}\log dT)$ regret under the margin condition. To our knowledge, the proposed algorithm requires the weakest assumptions among Lasso bandit algorithms under a single parameter setting that achieve $O(\text{poly}\log dT)$ regret. Through the numerical experiments, we confirm the superior performance of our proposed algorithm. △ Less

Submitted 2 June, 2024; originally announced June 2024.

arXiv:2405.19553 [pdf, ps, other]

Convergence Bounds for Sequential Monte Carlo on Multimodal Distributions using Soft Decomposition

Authors: Holden Lee, Matheau Santana-Gijzen

Abstract: We prove bounds on the variance of a function $f$ under the empirical measure of the samples obtained by the Sequential Monte Carlo (SMC) algorithm, with time complexity depending on local rather than global Markov chain mixing dynamics. SMC is a Markov Chain Monte Carlo (MCMC) method, which starts by drawing $N$ particles from a known distribution, and then, through a sequence of distributions, r… ▽ More We prove bounds on the variance of a function $f$ under the empirical measure of the samples obtained by the Sequential Monte Carlo (SMC) algorithm, with time complexity depending on local rather than global Markov chain mixing dynamics. SMC is a Markov Chain Monte Carlo (MCMC) method, which starts by drawing $N$ particles from a known distribution, and then, through a sequence of distributions, re-weights and re-samples the particles, at each instance applying a Markov chain for smoothing. In principle, SMC tries to alleviate problems from multi-modality. However, most theoretical guarantees for SMC are obtained by assuming global mixing time bounds, which are only efficient in the uni-modal setting. We show that bounds can be obtained in the truly multi-modal setting, with mixing times that depend only on local MCMC dynamics. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.15950 [pdf, ps, other]

A Systematic Bias of Machine Learning Regression Models and Its Correction: an Application to Imaging-based Brain Age Prediction

Authors: Hwiyoung Lee, Shuo Chen

Abstract: Machine learning models for continuous outcomes often yield systematically biased predictions, particularly for values that largely deviate from the mean. Specifically, predictions for large-valued outcomes tend to be negatively biased, while those for small-valued outcomes are positively biased. We refer to this linear central tendency warped bias as the "systematic bias of machine learning regre… ▽ More Machine learning models for continuous outcomes often yield systematically biased predictions, particularly for values that largely deviate from the mean. Specifically, predictions for large-valued outcomes tend to be negatively biased, while those for small-valued outcomes are positively biased. We refer to this linear central tendency warped bias as the "systematic bias of machine learning regression". In this paper, we first demonstrate that this issue persists across various machine learning models, and then delve into its theoretical underpinnings. We propose a general constrained optimization approach designed to correct this bias and develop a computationally efficient algorithm to implement our method. Our simulation results indicate that our correction method effectively eliminates the bias from the predicted outcomes. We apply the proposed approach to the prediction of brain age using neuroimaging data. In comparison to competing machine learning models, our method effectively addresses the longstanding issue of "systematic bias of machine learning regression" in neuroimaging-based brain age calculation, yielding unbiased predictions of brain age. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.10925 [pdf]

High-dimensional multiple imputation (HDMI) for partially observed confounders including natural language processing-derived auxiliary covariates

Authors: Janick Weberpals, Pamela A. Shaw, Kueiyu Joshua Lin, Richard Wyss, Joseph M Plasek, Li Zhou, Kerry Ngan, Thomas DeRamus, Sudha R. Raman, Bradley G. Hammill, Hana Lee, Sengwee Toh, John G. Connolly, Kimberly J. Dandreo, Fang Tian, Wei Liu, Jie Li, José J. Hernández-Muñoz, Sebastian Schneeweiss, Rishi J. Desai

Abstract: Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from… ▽ More Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from opioid vs. non-steroidal anti-inflammatory drug (NSAID) initiators (X) with observed serum creatinine labs (Z2) and time-to-acute kidney injury as outcome. We simulated 100 cohorts with a null treatment effect, including X, Z2, atrial fibrillation (U), and 13 other investigator-derived confounders (Z1) in the outcome generation. We then imposed missingness (MZ2) on 50% of Z2 measurements as a function of Z2 and U and created different HDMI candidate AC using structured and NLP-derived features. We mimicked scenarios where U was unobserved by omitting it from all AC candidate sets. Using LASSO, we data-adaptively selected HDMI covariates associated with Z2 and MZ2 for MI, and with U to include in propensity score models. The treatment effect was estimated following propensity score matching in MI datasets and we benchmarked HDMI approaches against a baseline imputation and complete case analysis with Z1 only. HDMI using claims data showed the lowest bias (0.072). Combining claims and sentence embeddings led to an improvement in the efficiency displaying the lowest root-mean-squared-error (0.173) and coverage (94%). NLP-derived AC alone did not perform better than baseline MI. HDMI approaches may decrease bias in studies with partially observed confounders where missingness depends on unobserved factors. △ Less

Submitted 17 May, 2024; originally announced May 2024.

arXiv:2404.18869 [pdf, ps, other]

Learning Mixtures of Gaussians Using Diffusion Models

Authors: Khashayar Gatmiry, Jonathan Kelner, Holden Lee

Abstract: We give a new algorithm for learning mixtures of $k$ Gaussians (with identity covariance in $\mathbb{R}^n$) to TV error $\varepsilon$, with quasi-polynomial ($O(n^{\text{poly log}\left(\frac{n+k}{\varepsilon}\right)})$) time and sample complexity, under a minimum weight assumption. Unlike previous approaches, most of which are algebraic in nature, our approach is analytic and relies on the framewo… ▽ More We give a new algorithm for learning mixtures of $k$ Gaussians (with identity covariance in $\mathbb{R}^n$) to TV error $\varepsilon$, with quasi-polynomial ($O(n^{\text{poly log}\left(\frac{n+k}{\varepsilon}\right)})$) time and sample complexity, under a minimum weight assumption. Unlike previous approaches, most of which are algebraic in nature, our approach is analytic and relies on the framework of diffusion models. Diffusion models are a modern paradigm for generative modeling, which typically rely on learning the score function (gradient log-pdf) along a process transforming a pure noise distribution, in our case a Gaussian, to the data distribution. Despite their dazzling performance in tasks such as image generation, there are few end-to-end theoretical guarantees that they can efficiently learn nontrivial families of distributions; we give some of the first such guarantees. We proceed by deriving higher-order Gaussian noise sensitivity bounds for the score functions for a Gaussian mixture to show that that they can be inductively learned using piecewise polynomial regression (up to poly-logarithmic degree), and combine this with known convergence results for diffusion models. Our results extend to continuous mixtures of Gaussians where the mixing distribution is supported on a union of $k$ balls of constant radius. In particular, this applies to the case of Gaussian convolutions of distributions on low-dimensional manifolds, or more generally sets with small covering number. △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.17563 [pdf, other]

An exactly solvable model for emergence and scaling laws

Authors: Yoonsoo Nam, Nayara Fonseca, Seok Hyeong Lee, Ard Louis

Abstract: Deep learning models can exhibit what appears to be a sudden ability to solve a new problem as training time ($T$), training data ($D$), or model size ($N$) increases, a phenomenon known as emergence. In this paper, we present a framework where each new ability (a skill) is represented as a basis function. We solve a simple multi-linear model in this skill-basis, finding analytic expressions for t… ▽ More Deep learning models can exhibit what appears to be a sudden ability to solve a new problem as training time ($T$), training data ($D$), or model size ($N$) increases, a phenomenon known as emergence. In this paper, we present a framework where each new ability (a skill) is represented as a basis function. We solve a simple multi-linear model in this skill-basis, finding analytic expressions for the emergence of new skills, as well as for scaling laws of the loss with training time, data size, model size, and optimal compute ($C$). We compare our detailed calculations to direct simulations of a two-layer neural network trained on multitask sparse parity, where the tasks in the dataset are distributed according to a power-law. Our simple model captures, using a single fit parameter, the sigmoidal emergence of multiple new skills as training time, data size or model size increases in the neural network. △ Less

Submitted 26 April, 2024; originally announced April 2024.

arXiv:2404.10884 [pdf, other]

Modeling Interconnected Modules in Multivariate Outcomes: Evaluating the Impact of Alcohol Intake on Plasma Metabolomics

Authors: Yifan Yang, Chixiang Chen, Hwiyoung Lee, Ming Wang, Shuo Chen

Abstract: Alcohol consumption has been shown to influence cardiovascular mechanisms in humans, leading to observable alterations in the plasma metabolomic profile. Regression models are commonly employed to investigate these effects, treating metabolomics features as the outcomes and alcohol intake as the exposure. Given the latent dependence structure among the numerous metabolomic features (e.g., co-expre… ▽ More Alcohol consumption has been shown to influence cardiovascular mechanisms in humans, leading to observable alterations in the plasma metabolomic profile. Regression models are commonly employed to investigate these effects, treating metabolomics features as the outcomes and alcohol intake as the exposure. Given the latent dependence structure among the numerous metabolomic features (e.g., co-expression networks with interconnected modules), modeling this structure is crucial for accurately identifying metabolomic features associated with alcohol intake. However, integrating dependence structures into regression models remains difficult in both estimation and inference procedures due to their large or high dimensionality. To bridge this gap, we propose an innovative multivariate regression model that accounts for correlations among outcome features by incorporating an interconnected community structure. Furthermore, we derive closed-form and likelihood-based estimators, accompanied by explicit exact and explicit asymptotic covariance matrix estimators, respectively. Simulation analysis demonstrates that our approach provides accurate estimation of both dependence and regression coefficients, and enhances sensitivity while maintaining a well-controlled discovery rate, as evidenced through benchmarking against existing regression models. Finally, we apply our approach to assess the impact of alcohol intake on $249$ metabolomic biomarkers measured using nuclear magnetic resonance spectroscopy. The results indicate that alcohol intake can elevate high-density lipoprotein levels by enhancing the transport rate of Apolipoproteins A1. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: 25 pages, 5 figures

arXiv:2402.15705 [pdf, other]

A Variational Approach for Modeling High-dimensional Spatial Generalized Linear Mixed Models

Authors: ** Hyung Lee, Ben Seiyon Lee

Abstract: Gaussian and discrete non-Gaussian spatial datasets are prevalent across many fields such as public health, ecology, geosciences, and social sciences. Bayesian spatial generalized linear mixed models (SGLMMs) are a flexible class of models designed for these data, but SGLMMs do not scale well, even to moderately large datasets. State-of-the-art scalable SGLMMs (i.e., basis representations or spars… ▽ More Gaussian and discrete non-Gaussian spatial datasets are prevalent across many fields such as public health, ecology, geosciences, and social sciences. Bayesian spatial generalized linear mixed models (SGLMMs) are a flexible class of models designed for these data, but SGLMMs do not scale well, even to moderately large datasets. State-of-the-art scalable SGLMMs (i.e., basis representations or sparse covariance/precision matrices) require posterior sampling via Markov chain Monte Carlo (MCMC), which can be prohibitive for large datasets. While variational Bayes (VB) have been extended to SGLMMs, their focus has primarily been on smaller spatial datasets. In this study, we propose two computationally efficient VB approaches for modeling moderate-sized and massive (millions of locations) Gaussian and discrete non-Gaussian spatial data. Our scalable VB method embeds semi-parametric approximations for the latent spatial random processes and parallel computing offered by modern high-performance computing systems. Our approaches deliver nearly identical inferential and predictive performance compared to 'gold standard' methods but achieve computational speedups of up to 1000x. We demonstrate our approaches through a comparative numerical study as well as applications to two real-world datasets. Our proposed VB methodology enables practitioners to model millions of non-Gaussian spatial observations using a standard laptop within a short timeframe. △ Less

Submitted 17 March, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

Comments: 34 Pages for the main paper, 72 pages for the supplemental information, 4 tables, 5 figures

arXiv:2402.06992 [pdf, other]

A Rational Analysis of the Speech-to-Song Illusion

Authors: Raja Marjieh, Pol van Rijn, Ilia Sucholutsky, Harin Lee, Thomas L. Griffiths, Nori Jacoby

Abstract: The speech-to-song illusion is a robust psychological phenomenon whereby a spoken sentence sounds increasingly more musical as it is repeated. Despite decades of research, a complete formal account of this transformation is still lacking, and some of its nuanced characteristics, namely, that certain phrases appear to transform while others do not, is not well understood. Here we provide a formal a… ▽ More The speech-to-song illusion is a robust psychological phenomenon whereby a spoken sentence sounds increasingly more musical as it is repeated. Despite decades of research, a complete formal account of this transformation is still lacking, and some of its nuanced characteristics, namely, that certain phrases appear to transform while others do not, is not well understood. Here we provide a formal account of this phenomenon, by recasting it as a statistical inference whereby a rational agent attempts to decide whether a sequence of utterances is more likely to have been produced in a song or speech. Using this approach and analyzing song and speech corpora, we further introduce a novel prose-to-lyrics illusion that is purely text-based. In this illusion, simply duplicating written sentences makes them appear more like song lyrics. We provide robust evidence for this new illusion in both human participants and large language models. △ Less

Submitted 10 February, 2024; originally announced February 2024.

Comments: 7 pages, 5 figures

arXiv:2402.02128 [pdf, other]

Adaptive Accelerated Failure Time modeling with a Semiparametric Skewed Error Distribution

Authors: Sangkon Oh, Hyunjae Lee, Sangwook Kang, Byungtae Seo

Abstract: The accelerated failure time (AFT) model is widely used to analyze relationships between variables in the presence of censored observations. However, this model relies on some assumptions such as the error distribution, which can lead to biased or inefficient estimates if these assumptions are violated. In order to overcome this challenge, we propose a novel approach that incorporates a semiparame… ▽ More The accelerated failure time (AFT) model is widely used to analyze relationships between variables in the presence of censored observations. However, this model relies on some assumptions such as the error distribution, which can lead to biased or inefficient estimates if these assumptions are violated. In order to overcome this challenge, we propose a novel approach that incorporates a semiparametric skew-normal scale mixture distribution for the error term in the AFT model. By allowing for more flexibility and robustness, this approach reduces the risk of misspecification and improves the accuracy of parameter estimation. We investigate the identifiability and consistency of the proposed model and develop a practical estimation algorithm. To evaluate the performance of our approach, we conduct extensive simulation studies and real data analyses. The results demonstrate the effectiveness of our method in providing robust and accurate estimates in various scenarios. △ Less

Submitted 3 February, 2024; originally announced February 2024.

arXiv:2401.08175 [pdf, other]

Bayesian Kriging Approaches for Spatial Functional Data

Authors: Heesang Lee, Dagun Oh, Sunhwa Choi, Jaewoo Park

Abstract: Functional kriging approaches have been developed to predict the curves at unobserved spatial locations. However, most existing approaches are based on variogram fittings rather than constructing hierarchical statistical models. Therefore, it is challenging to analyze the relationships between functional variables, and uncertainty quantification of the model is not trivial. In this manuscript, we… ▽ More Functional kriging approaches have been developed to predict the curves at unobserved spatial locations. However, most existing approaches are based on variogram fittings rather than constructing hierarchical statistical models. Therefore, it is challenging to analyze the relationships between functional variables, and uncertainty quantification of the model is not trivial. In this manuscript, we propose a Bayesian framework for spatial function-on-function regression. However, inference for the proposed model has computational and inferential challenges because the model needs to account for within and between-curve dependencies. Furthermore, high-dimensional and spatially correlated parameters can lead to the slow mixing of Markov chain Monte Carlo algorithms. To address these issues, we first utilize a basis transformation approach to simplify the covariance and apply projection methods for dimension reduction. We also develop a simultaneous band score for the proposed model to detect the significant region in the regression function. We apply the methods to simulated and real datasets, including data on particulate matter in Japan and mobility data in South Korea. The proposed method is computationally efficient and provides accurate estimations and predictions. △ Less

Submitted 16 January, 2024; originally announced January 2024.

arXiv:2401.04832 [pdf, other]

Group lasso priors for Bayesian accelerated failure time models with left-truncated and interval-censored data

Authors: Harrison T. Reeder, Sebastien Haneuse, Kyu Ha Lee

Abstract: An important task in health research is to characterize time-to-event outcomes such as disease onset or mortality in terms of a potentially high-dimensional set of risk factors. For example, prospective cohort studies of Alzheimer's disease typically enroll older adults for observation over several decades to assess the long-term impact of genetic and other factors on cognitive decline and mortali… ▽ More An important task in health research is to characterize time-to-event outcomes such as disease onset or mortality in terms of a potentially high-dimensional set of risk factors. For example, prospective cohort studies of Alzheimer's disease typically enroll older adults for observation over several decades to assess the long-term impact of genetic and other factors on cognitive decline and mortality. The accelerated failure time model is particularly well-suited to such studies, structuring covariate effects as `horizontal' changes to the survival quantiles that conceptually reflect shifts in the outcome distribution due to lifelong exposures. However, this modeling task is complicated by the enrollment of adults at differing ages, and intermittent followup visits leading to interval censored outcome information. Moreover, genetic and clinical risk factors are not only high-dimensional, but characterized by underlying grou** structure, such as by function or gene location. Such grouped high-dimensional covariates require shrinkage methods that directly acknowledge this structure to facilitate variable selection and estimation. In this paper, we address these considerations directly by proposing a Bayesian accelerated failure time model with a group-structured lasso penalty, designed for left-truncated and interval-censored time-to-event data. We develop a custom Markov chain Monte Carlo sampler for efficient estimation, and investigate the impact of various methods of penalty tuning and thresholding for variable selection. We present a simulation study examining the performance of this method relative to models with an ordinary lasso penalty, and apply the proposed method to identify groups of predictive genetic and clinical risk factors for Alzheimer's disease in the Religious Orders Study and Memory and Aging Project (ROSMAP) prospective cohort studies of AD and dementia. △ Less

Submitted 11 January, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

arXiv:2401.00104 [pdf, other]

Causal State Distillation for Explainable Reinforcement Learning

Authors: Wenhao Lu, Xufeng Zhao, Thilo Fryen, Jae Hee Lee, Mengdi Li, Sven Magg, Stefan Wermter

Abstract: Reinforcement learning (RL) is a powerful technique for training intelligent agents, but understanding why these agents make specific decisions can be quite challenging. This lack of transparency in RL models has been a long-standing problem, making it difficult for users to grasp the reasons behind an agent's behaviour. Various approaches have been explored to address this problem, with one promi… ▽ More Reinforcement learning (RL) is a powerful technique for training intelligent agents, but understanding why these agents make specific decisions can be quite challenging. This lack of transparency in RL models has been a long-standing problem, making it difficult for users to grasp the reasons behind an agent's behaviour. Various approaches have been explored to address this problem, with one promising avenue being reward decomposition (RD). RD is appealing as it sidesteps some of the concerns associated with other methods that attempt to rationalize an agent's behaviour in a post-hoc manner. RD works by exposing various facets of the rewards that contribute to the agent's objectives during training. However, RD alone has limitations as it primarily offers insights based on sub-rewards and does not delve into the intricate cause-and-effect relationships that occur within an RL agent's neural model. In this paper, we present an extension of RD that goes beyond sub-rewards to provide more informative explanations. Our approach is centred on a causal learning framework that leverages information-theoretic measures for explanation objectives that encourage three crucial properties of causal factors: causal sufficiency, sparseness, and orthogonality. These properties help us distill the cause-and-effect relationships between the agent's states and actions or rewards, allowing for a deeper understanding of its decision-making processes. Our framework is designed to generate local explanations and can be applied to a wide range of RL tasks with multiple reward channels. Through a series of experiments, we demonstrate that our approach offers more meaningful and insightful explanations for the agent's action selections. △ Less

Submitted 1 April, 2024; v1 submitted 29 December, 2023; originally announced January 2024.

Comments: https://lukaswill.github.io/; Accepted as oral by CLeaR 2024

arXiv:2312.11769 [pdf, other]

Clustering Mixtures of Bounded Covariance Distributions Under Optimal Separation

Authors: Ilias Diakonikolas, Daniel M. Kane, Jasper C. H. Lee, Thanasis Pittas

Abstract: We study the clustering problem for mixtures of bounded covariance distributions, under a fine-grained separation assumption. Specifically, given samples from a $k$-component mixture distribution $D = \sum_{i =1}^k w_i P_i$, where each $w_i \ge α$ for some known parameter $α$, and each $P_i$ has unknown covariance $Σ_i \preceq σ^2_i \cdot I_d$ for some unknown $σ_i$, the goal is to cluster the sam… ▽ More We study the clustering problem for mixtures of bounded covariance distributions, under a fine-grained separation assumption. Specifically, given samples from a $k$-component mixture distribution $D = \sum_{i =1}^k w_i P_i$, where each $w_i \ge α$ for some known parameter $α$, and each $P_i$ has unknown covariance $Σ_i \preceq σ^2_i \cdot I_d$ for some unknown $σ_i$, the goal is to cluster the samples assuming a pairwise mean separation in the order of $(σ_i+σ_j)/\sqrtα$ between every pair of components $P_i$ and $P_j$. Our contributions are as follows: For the special case of nearly uniform mixtures, we give the first poly-time algorithm for this clustering task. Prior work either required separation scaling with the maximum cluster standard deviation (i.e. $\max_i σ_i$) [DKK+22b] or required both additional structural assumptions and mean separation scaling as a large degree polynomial in $1/α$ [BKK22]. For general-weight mixtures, we point out that accurate clustering is information-theoretically impossible under our fine-grained mean separation assumptions. We introduce the notion of a clustering refinement -- a list of not-too-small subsets satisfying a similar separation, and which can be merged into a clustering approximating the ground truth -- and show that it is possible to efficiently compute an accurate clustering refinement of the samples. Furthermore, under a variant of the "no large sub-cluster'' condition from in prior work [BKK22], we show that our algorithm outputs an accurate clustering, not just a refinement, even for general-weight mixtures. As a corollary, we obtain efficient clustering algorithms for mixtures of well-conditioned high-dimensional log-concave distributions. Moreover, our algorithm is robust to $Ω(α)$-fraction of adversarial outliers. △ Less

Submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.10675 [pdf, other]

Visualization and Assessment of Copula Symmetry

Authors: Cristian F. Jimenez-Varon, Hao Lee, Marc G. Genton, Ying Sun

Abstract: Visualization and assessment of copula structures are crucial for accurately understanding and modeling the dependencies in multivariate data analysis. In this paper, we introduce an innovative method that employs functional boxplots and rank-based testing procedures to evaluate copula symmetry. This approach is specifically designed to assess key characteristics such as reflection symmetry, radia… ▽ More Visualization and assessment of copula structures are crucial for accurately understanding and modeling the dependencies in multivariate data analysis. In this paper, we introduce an innovative method that employs functional boxplots and rank-based testing procedures to evaluate copula symmetry. This approach is specifically designed to assess key characteristics such as reflection symmetry, radial symmetry, and joint symmetry. We first construct test functions for each specific property and then investigate the asymptotic properties of their empirical estimators. We demonstrate that the functional boxplot of these sample test functions serves as an informative visualization tool of a given copula structure, effectively measuring the departure from zero of the test function. Furthermore, we introduce a nonparametric testing procedure to assess the significance of deviations from symmetry, ensuring the accuracy and reliability of our visualization method. Through extensive simulation studies involving various copula models, we demonstrate the effectiveness of our testing approach. Finally, we apply our visualization and testing techniques to two real-world datasets: a nutritional habits survey with five variables and wind speed data from three locations in Saudi Arabia. △ Less

Submitted 17 December, 2023; originally announced December 2023.

arXiv:2312.01133 [pdf, other]

$t^3$-Variational Autoencoder: Learning Heavy-tailed Data with Student's t and Power Divergence

Authors: Juno Kim, Jaehyuk Kwon, Mincheol Cho, Hyunjong Lee, Joong-Ho Won

Abstract: The variational autoencoder (VAE) typically employs a standard normal prior as a regularizer for the probabilistic latent encoder. However, the Gaussian tail often decays too quickly to effectively accommodate the encoded points, failing to preserve crucial structures hidden in the data. In this paper, we explore the use of heavy-tailed models to combat over-regularization. Drawing upon insights f… ▽ More The variational autoencoder (VAE) typically employs a standard normal prior as a regularizer for the probabilistic latent encoder. However, the Gaussian tail often decays too quickly to effectively accommodate the encoded points, failing to preserve crucial structures hidden in the data. In this paper, we explore the use of heavy-tailed models to combat over-regularization. Drawing upon insights from information geometry, we propose $t^3$VAE, a modified VAE framework that incorporates Student's t-distributions for the prior, encoder, and decoder. This results in a joint model distribution of a power form which we argue can better fit real-world datasets. We derive a new objective by reformulating the evidence lower bound as joint optimization of KL divergence between two statistical manifolds and replacing with $γ$-power divergence, a natural alternative for power families. $t^3$VAE demonstrates superior generation of low-density regions when trained on heavy-tailed synthetic data. Furthermore, we show that $t^3$VAE significantly outperforms other models on CelebA and imbalanced CIFAR-100 datasets. △ Less

Submitted 3 March, 2024; v1 submitted 2 December, 2023; originally announced December 2023.

Comments: ICLR 2024; 27 pages, 7 figures, 8 tables

arXiv:2311.12784 [pdf, ps, other]

Optimality in Mean Estimation: Beyond Worst-Case, Beyond Sub-Gaussian, and Beyond $1+α$ Moments

Authors: Trung Dang, Jasper C. H. Lee, Maoyuan Song, Paul Valiant

Abstract: There is growing interest in improving our algorithmic understanding of fundamental statistical problems such as mean estimation, driven by the goal of understanding the limits of what we can extract from valuable data. The state of the art results for mean estimation in $\mathbb{R}$ are 1) the optimal sub-Gaussian mean estimator by [LV22], with the tight sub-Gaussian constant for all distribution… ▽ More There is growing interest in improving our algorithmic understanding of fundamental statistical problems such as mean estimation, driven by the goal of understanding the limits of what we can extract from valuable data. The state of the art results for mean estimation in $\mathbb{R}$ are 1) the optimal sub-Gaussian mean estimator by [LV22], with the tight sub-Gaussian constant for all distributions with finite but unknown variance, and 2) the analysis of the median-of-means algorithm by [BCL13] and a lower bound by [DLLO16], characterizing the big-O optimal errors for distributions for which only a $1+α$ moment exists for $α\in (0,1)$. Both results, however, are optimal only in the worst case. We initiate the fine-grained study of the mean estimation problem: Can algorithms leverage useful features of the input distribution to beat the sub-Gaussian rate, without explicit knowledge of such features? We resolve this question with an unexpectedly nuanced answer: "Yes in limited regimes, but in general no". For any distribution $p$ with a finite mean, we construct a distribution $q$ whose mean is well-separated from $p$'s, yet $p$ and $q$ are not distinguishable with high probability, and $q$ further preserves $p$'s moments up to constants. The main consequence is that no reasonable estimator can asymptotically achieve better than the sub-Gaussian error rate for any distribution, matching the worst-case result of [LV22]. More generally, we introduce a new definitional framework to analyze the fine-grained optimality of algorithms, which we call "neighborhood optimality", interpolating between the unattainably strong "instance optimality" and the trivially weak "admissibility" definitions. Applying the new framework, we show that median-of-means is neighborhood optimal, up to constant factors. It is open to find a neighborhood-optimal estimator without constant factor slackness. △ Less

Submitted 21 November, 2023; originally announced November 2023.

Comments: 27 pages, to appear in NeurIPS 2023. Abstract shortened to fit arXiv limit

arXiv:2311.10792 [pdf]

Enhancing Data Efficiency and Feature Identification for Lithium-Ion Battery Lifespan Prediction by Deciphering Interpretation of Temporal Patterns and Cyclic Variability Using Attention-Based Models

Authors: Jaewook Lee, Seongmin Heo, Jay H. Lee

Abstract: Accurately predicting the lifespan of lithium-ion batteries is crucial for optimizing operational strategies and mitigating risks. While numerous studies have aimed at predicting battery lifespan, few have examined the interpretability of their models or how such insights could improve predictions. Addressing this gap, we introduce three innovative models that integrate shallow attention layers in… ▽ More Accurately predicting the lifespan of lithium-ion batteries is crucial for optimizing operational strategies and mitigating risks. While numerous studies have aimed at predicting battery lifespan, few have examined the interpretability of their models or how such insights could improve predictions. Addressing this gap, we introduce three innovative models that integrate shallow attention layers into a foundational model from our previous work, which combined elements of recurrent and convolutional neural networks. Utilizing a well-known public dataset, we showcase our methodology's effectiveness. Temporal attention is applied to identify critical timesteps and highlight differences among test cell batches, particularly underscoring the significance of the "rest" phase. Furthermore, by applying cyclic attention via self-attention to context vectors, our approach effectively identifies key cycles, enabling us to strategically decrease the input size for quicker predictions. Employing both single- and multi-head attention mechanisms, we have systematically minimized the required input from 100 to 50 and then to 30 cycles, refining this process based on cyclic attention scores. Our refined model exhibits strong regression capabilities, accurately forecasting the initiation of rapid capacity fade with an average deviation of only 58 cycles by analyzing just the initial 30 cycles of easily accessible input data. △ Less

Submitted 11 April, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

arXiv:2310.16136 [pdf, other]

Analyzing Disparity and Temporal Progression of Internet Quality through Crowdsourced Measurements with Bias-Correction

Authors: Hyeongseong Lee, Udit Paul, Arpit Gupta, Elizabeth Belding, Mengyang Gu

Abstract: Crowdsourced speedtest measurements are an important tool for studying internet performance from the end user perspective. Nevertheless, despite the accuracy of individual measurements, simplistic aggregation of these data points is problematic due to their intrinsic sampling bias. In this work, we utilize a dataset of nearly 1 million individual Ookla Speedtest measurements, correlate each datapo… ▽ More Crowdsourced speedtest measurements are an important tool for studying internet performance from the end user perspective. Nevertheless, despite the accuracy of individual measurements, simplistic aggregation of these data points is problematic due to their intrinsic sampling bias. In this work, we utilize a dataset of nearly 1 million individual Ookla Speedtest measurements, correlate each datapoint with 2019 Census demographic data, and develop new methods to present a novel analysis to quantify regional sampling bias and the relationship of internet performance to demographic profile. We find that the crowdsourced Ookla Speedtest data points contain significant sampling bias across different census block groups based on a statistical test of homogeneity. We introduce two methods to correct the regional bias by the population of each census block group. Whereas the sampling bias leads to a small discrepancy in the overall cumulative distribution function of internet speed in a city between estimation from original samples and bias-corrected estimation, the discrepancy is much smaller compared to the size of the sampling heterogeneity across regions. Further, we show that the sampling bias is strongly associated with a few demographic variables, such as income, education level, age, and ethnic distribution. Through regression analysis, we find that regions with higher income, younger populations, and lower representation of Hispanic residents tend to measure faster internet speeds along with substantial collinearity amongst socioeconomic attributes and ethnic composition. Finally, we find that average internet speed increases over time based on both linear and nonlinear analysis from state space models, though the regional sampling bias may result in a small overestimation of the temporal increase of internet speed. △ Less

Submitted 7 December, 2023; v1 submitted 24 October, 2023; originally announced October 2023.

arXiv:2310.11654 [pdf, other]

Subject-specific Deep Neural Networks for Count Data with High-cardinality Categorical Features

Authors: Hangbin Lee, Il Do Ha, Changha Hwang, Youngjo Lee

Abstract: There is a growing interest in subject-specific predictions using deep neural networks (DNNs) because real-world data often exhibit correlations, which has been typically overlooked in traditional DNN frameworks. In this paper, we propose a novel hierarchical likelihood learning framework for introducing gamma random effects into the Poisson DNN, so as to improve the prediction performance by capt… ▽ More There is a growing interest in subject-specific predictions using deep neural networks (DNNs) because real-world data often exhibit correlations, which has been typically overlooked in traditional DNN frameworks. In this paper, we propose a novel hierarchical likelihood learning framework for introducing gamma random effects into the Poisson DNN, so as to improve the prediction performance by capturing both nonlinear effects of input variables and subject-specific cluster effects. The proposed method simultaneously yields maximum likelihood estimators for fixed parameters and best unbiased predictors for random effects by optimizing a single objective function. This approach enables a fast end-to-end algorithm for handling clustered count data, which often involve high-cardinality categorical features. Furthermore, state-of-the-art network architectures can be easily implemented into the proposed h-likelihood framework. As an example, we introduce multi-head attention layer and a sparsemax function, which allows feature selection in high-dimensional settings. To enhance practical performance and learning efficiency, we present an adjustment procedure for prediction of random parameters and a method-of-moments estimator for pretraining of variance component. Various experiential studies and real data analyses confirm the advantages of our proposed methods. △ Less

Submitted 17 October, 2023; originally announced October 2023.

arXiv:2310.10614 [pdf, ps, other]

Understanding an Acquisition Function Family for Bayesian Optimization

Authors: Jiajie Kong, Tony Pourmohamad, Herbert K. H. Lee

Abstract: Bayesian optimization (BO) developed as an approach for the efficient optimization of expensive black-box functions without gradient information. A typical BO paper introduces a new approach and compares it to some alternatives on simulated and possibly real examples to show its efficacy. Yet on a different example, this new algorithm might not be as effective as the alternatives. This paper looks… ▽ More Bayesian optimization (BO) developed as an approach for the efficient optimization of expensive black-box functions without gradient information. A typical BO paper introduces a new approach and compares it to some alternatives on simulated and possibly real examples to show its efficacy. Yet on a different example, this new algorithm might not be as effective as the alternatives. This paper looks at a broader family of approaches to explain the strengths and weaknesses of algorithms in the family, with guidance on what choices might work best on different classes of problems. △ Less

Submitted 16 October, 2023; originally announced October 2023.

arXiv:2310.09960 [pdf, other]

Point Mass in the Confidence Distribution: Is it a Drawback or an Advantage?

Authors: Hangbin Lee, Youngjo Lee

Abstract: Stein's (1959) problem highlights the phenomenon called the probability dilution in high dimensional cases, which is known as a fundamental deficiency in probabilistic inference. The satellite conjunction problem also suffers from probability dilution that poor-quality data can lead to a dilution of collision probability. Though various methods have been proposed, such as generalized fiducial dist… ▽ More Stein's (1959) problem highlights the phenomenon called the probability dilution in high dimensional cases, which is known as a fundamental deficiency in probabilistic inference. The satellite conjunction problem also suffers from probability dilution that poor-quality data can lead to a dilution of collision probability. Though various methods have been proposed, such as generalized fiducial distribution and the reference posterior, they could not maintain the coverage probability of confidence intervals (CIs) in both problems. On the other hand, the confidence distribution (CD) has a point mass at zero, which has been interpreted paradoxical. However, we show that this point mass is an advantage rather than a drawback, because it gives a way to maintain the coverage probability of CIs. More recently, `false confidence theorem' was presented as another deficiency in probabilistic inferences, called the false confidence. It was further claimed that the use of consonant belief can mitigate this deficiency. However, we show that the false confidence theorem cannot be applied to the CD in both Stein's and satellite conjunction problems. It is crucial that a confidence feature, not a consonant one, is the key to overcome the deficiencies in probabilistic inferences. Our findings reveal that the CD outperforms the other existing methods, including the consonant belief, in the context of Stein's and satellite conjunction problems. Additionally, we demonstrate the ambiguity of coverage probability in an observed CI from the frequentist CI procedure, and show that the CD provides valuable information regarding this ambiguity. △ Less

Submitted 15 October, 2023; originally announced October 2023.

arXiv:2310.09955 [pdf, other]

On the Statistical Foundations of H-likelihood for Unobserved Random Variables

Authors: Hangbin Lee, Youngjo Lee

Abstract: The maximum likelihood estimation is widely used for statistical inferences. This paper aims to reformulate Lee and Nelder's (1996) h-likelihood, so that the maximum h-likelihood estimator resembles the maximum likelihood estimator of the classical likelihood. We establish the statistical foundations of the new h-likelihood. This extends classical likelihood theories to embrace broader class of st… ▽ More The maximum likelihood estimation is widely used for statistical inferences. This paper aims to reformulate Lee and Nelder's (1996) h-likelihood, so that the maximum h-likelihood estimator resembles the maximum likelihood estimator of the classical likelihood. We establish the statistical foundations of the new h-likelihood. This extends classical likelihood theories to embrace broader class of statistical models with random parameters. Maximization of the h-likelihood yields asymptotically optimal estimators for both fixed and random parameters achieving the generalized Cramér-Rao lower bound, while providing computationally efficient fitting algorithms. Furthermore, we explore asymptotic theory when the consistency of either fixed parameter estimation or random parameter prediction is violated. We also study how to obtain maximum h-likelihood estimators when the h-likelihood is not explicitly available. △ Less

Submitted 5 December, 2023; v1 submitted 15 October, 2023; originally announced October 2023.

arXiv:2310.03176 [pdf]

Sensitivity analysis for causality in observational studies for regulatory science

Authors: Iván Díaz, Hana Lee, Emre Kıcıman, Mouna Akacha, Dean Follman, Debashis Ghosh

Abstract: Recognizing the importance of real-world data (RWD) for regulatory purposes, the United States (US) Congress passed the 21st Century Cures Act1 mandating the development of Food and Drug Administration (FDA) guidance on regulatory use of real-world evidence. The Forum on the Integration of Observational and Randomized Data (FIORD) conducted a meeting bringing together various stakeholder groups to… ▽ More Recognizing the importance of real-world data (RWD) for regulatory purposes, the United States (US) Congress passed the 21st Century Cures Act1 mandating the development of Food and Drug Administration (FDA) guidance on regulatory use of real-world evidence. The Forum on the Integration of Observational and Randomized Data (FIORD) conducted a meeting bringing together various stakeholder groups to build consensus around best practices for the use of RWD to support regulatory science. Our companion paper describes in detail the context and discussion carried out in the meeting, which includes a recommendation to use a causal roadmap for complete pre-specification of study designs using RWD. This article discusses one step of the roadmap: the specification of a procedure for sensitivity analysis, defined as a procedure for testing the robustness of substantive conclusions to violations of assumptions made in the causal roadmap. We include a worked-out example of a sensitivity analysis from a RWD study on the effectiveness of Nifurtimox in treating Chagas disease, as well as an overview of various methods available for sensitivity analysis in causal inference, emphasizing practical considerations on their use for regulatory purposes. △ Less

Submitted 17 October, 2023; v1 submitted 4 October, 2023; originally announced October 2023.

arXiv:2310.02423 [pdf, other]

Delta-AI: Local objectives for amortized inference in sparse graphical models

Authors: Jean-Pierre Falet, Hae Beom Lee, Nikolay Malkin, Chen Sun, Dragos Secrieru, Thomas Jiralerspong, Dinghuai Zhang, Guillaume Lajoie, Yoshua Bengio

Abstract: We present a new algorithm for amortized inference in sparse probabilistic graphical models (PGMs), which we call $Δ$-amortized inference ($Δ$-AI). Our approach is based on the observation that when the sampling of variables in a PGM is seen as a sequence of actions taken by an agent, sparsity of the PGM enables local credit assignment in the agent's policy learning objective. This yields a local… ▽ More We present a new algorithm for amortized inference in sparse probabilistic graphical models (PGMs), which we call $Δ$-amortized inference ($Δ$-AI). Our approach is based on the observation that when the sampling of variables in a PGM is seen as a sequence of actions taken by an agent, sparsity of the PGM enables local credit assignment in the agent's policy learning objective. This yields a local constraint that can be turned into a local loss in the style of generative flow networks (GFlowNets) that enables off-policy training but avoids the need to instantiate all the random variables for each parameter update, thus speeding up training considerably. The $Δ$-AI objective matches the conditional distribution of a variable given its Markov blanket in a tractable learned sampler, which has the structure of a Bayesian network, with the same conditional distribution under the target PGM. As such, the trained sampler recovers marginals and conditional distributions of interest and enables inference of partial subsets of variables. We illustrate $Δ$-AI's effectiveness for sampling from synthetic PGMs and training latent variable models with sparse factor structure. △ Less

Submitted 13 March, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

Comments: ICLR 2024; 19 pages, code: https://github.com/GFNOrg/Delta-AI/

arXiv:2308.02596 [pdf, other]

doi 10.1007/s40042-023-00921-8

Revisiting small-world network models: Exploring technical realizations and the equivalence of the Newman-Watts and Harary models

Authors: Seora Son, Eun Ji Choi, Sang Hoon Lee

Abstract: We address the relatively less known facts on the equivalence and technical realizations surrounding two network models showing the "small-world" property, namely the Newman-Watts and the Harary models. We provide the most accurate (in terms of faithfulness to the original literature) versions of these models to clarify the deviation from them existing in their variants adopted in one of the most… ▽ More We address the relatively less known facts on the equivalence and technical realizations surrounding two network models showing the "small-world" property, namely the Newman-Watts and the Harary models. We provide the most accurate (in terms of faithfulness to the original literature) versions of these models to clarify the deviation from them existing in their variants adopted in one of the most popular network analysis packages. The difference in technical realizations of those models could be conceived as minor details, but we discover significantly notable changes caused by the possibly inadvertent modification. For the Harary model, the stochasticity in the original formulation allows a much wider range of the clustering coefficient and the average shortest path length. For the Newman-Watts model, due to the drastically different degree distributions, the clustering coefficient can also be affected, which is verified by our higher-order analytic derivation. During the process, we discover the equivalence of the Newman-Watts (better known in the network science or physics community) and the Harary (better known in the graph theory or mathematics community) models under a specific condition of restricted parity in variables, which would bridge the two relatively independently developed models in different fields. Our result highlights the importance of each detailed step in constructing network models and the possibility of deeply related models, even if they might initially appear distinct in terms of the time period or the academic disciplines from which they emerged. △ Less

Submitted 12 December, 2023; v1 submitted 3 August, 2023; originally announced August 2023.

Comments: 11 pages, 5 figures, 1 table

Journal ref: J. Korean Phys. Soc. 83, 879 (2023)

arXiv:2307.08044 [pdf, other]

Towards Flexible Time-to-event Modeling: Optimizing Neural Networks via Rank Regression

Authors: Hyunjun Lee, Junhyun Lee, Taehwa Choi, Jaewoo Kang, Sangbum Choi

Abstract: Time-to-event analysis, also known as survival analysis, aims to predict the time of occurrence of an event, given a set of features. One of the major challenges in this area is dealing with censored data, which can make learning algorithms more complex. Traditional methods such as Cox's proportional hazards model and the accelerated failure time (AFT) model have been popular in this field, but th… ▽ More Time-to-event analysis, also known as survival analysis, aims to predict the time of occurrence of an event, given a set of features. One of the major challenges in this area is dealing with censored data, which can make learning algorithms more complex. Traditional methods such as Cox's proportional hazards model and the accelerated failure time (AFT) model have been popular in this field, but they often require assumptions such as proportional hazards and linearity. In particular, the AFT models often require pre-specified parametric distributional assumptions. To improve predictive performance and alleviate strict assumptions, there have been many deep learning approaches for hazard-based models in recent years. However, representation learning for AFT has not been widely explored in the neural network literature, despite its simplicity and interpretability in comparison to hazard-focused methods. In this work, we introduce the Deep AFT Rank-regression model for Time-to-event prediction (DART). This model uses an objective function based on Gehan's rank statistic, which is efficient and reliable for representation learning. On top of eliminating the requirement to establish a baseline event time distribution, DART retains the advantages of directly predicting event time in standard AFT models. The proposed method is a semiparametric approach to AFT modeling that does not impose any distributional assumptions on the survival time distribution. This also eliminates the need for additional hyperparameters or complex model architectures, unlike existing neural network-based AFT models. Through quantitative analysis on various benchmark datasets, we have shown that DART has significant potential for modeling high-throughput censored time-to-event data. △ Less

Submitted 22 July, 2023; v1 submitted 16 July, 2023; originally announced July 2023.

Comments: Accepted at ECAI 2023

arXiv:2307.07442 [pdf]

Sensitivity Analysis for Unmeasured Confounding in Medical Product Development and Evaluation Using Real World Evidence

Authors: Peng Ding, Yixin Fang, Doug Faries, Susan Gruber, Hana Lee, Joo-Yeon Lee, Pallavi Mishra-Kalyani, Mingyang Shan, Mark van der Laan, Shu Yang, Xiang Zhang

Abstract: The American Statistical Association Biopharmaceutical Section (ASA BIOP) working group on real-world evidence (RWE) has been making continuous, extended effort towards a goal of supporting and advancing regulatory science with respect to non-interventional, clinical studies intended to use real-world data for evidence generation for the purpose of medical product development and evaluation (i.e.,… ▽ More The American Statistical Association Biopharmaceutical Section (ASA BIOP) working group on real-world evidence (RWE) has been making continuous, extended effort towards a goal of supporting and advancing regulatory science with respect to non-interventional, clinical studies intended to use real-world data for evidence generation for the purpose of medical product development and evaluation (i.e., RWE studies). In 2023, the working group published a manuscript delineating challenges and opportunities in constructing estimands for RWE studies following a framework in ICH E9(R1) guidance on estimand and sensitivity analysis. As a follow-up task, we describe the other issue in RWE studies, sensitivity analysis. Focusing on the issue of unmeasured confounding, we review availability and applicability of sensitivity analysis methods for different types unmeasured confounding. We discuss consideration on the choice and use of sensitivity analysis for RWE studies. Updated version of this article will present how findings from sensitivity analysis could support regulatory decision-making using a real example. △ Less

Submitted 14 July, 2023; originally announced July 2023.

Comments: 17 pages, 2 figures

arXiv:2307.06581 [pdf, other]

Deep Neural Networks for Semiparametric Frailty Models via H-likelihood

Authors: Hangbin Lee, IL DO HA, Youngjo Lee

Abstract: For prediction of clustered time-to-event data, we propose a new deep neural network based gamma frailty model (DNN-FM). An advantage of the proposed model is that the joint maximization of the new h-likelihood provides maximum likelihood estimators for fixed parameters and best unbiased predictors for random frailties. Thus, the proposed DNN-FM is trained by using a negative profiled h-likelihood… ▽ More For prediction of clustered time-to-event data, we propose a new deep neural network based gamma frailty model (DNN-FM). An advantage of the proposed model is that the joint maximization of the new h-likelihood provides maximum likelihood estimators for fixed parameters and best unbiased predictors for random frailties. Thus, the proposed DNN-FM is trained by using a negative profiled h-likelihood as a loss function, constructed by profiling out the non-parametric baseline hazard. Experimental studies show that the proposed method enhances the prediction performance of the existing methods. A real data analysis shows that the inclusion of subject-specific frailties helps to improve prediction of the DNN based Cox model (DNN-Cox). △ Less

Submitted 13 July, 2023; originally announced July 2023.

arXiv:2307.00190 [pdf]

Estimands in Real-World Evidence Studies

Authors: Jie Chen, Daniel Scharfstein, Hongwei Wang, Binbing Yu, Yang Song, Weili He, John Scott, Xiwu Lin, Hana Lee

Abstract: A Real-World Evidence (RWE) Scientific Working Group (SWG) of the American Statistical Association Biopharmaceutical Section (ASA BIOP) has been reviewing statistical considerations for the generation of RWE to support regulatory decision-making. As part of the effort, the working group is addressing estimands in RWE studies. Constructing the right estimand -- the target of estimation -- which ref… ▽ More A Real-World Evidence (RWE) Scientific Working Group (SWG) of the American Statistical Association Biopharmaceutical Section (ASA BIOP) has been reviewing statistical considerations for the generation of RWE to support regulatory decision-making. As part of the effort, the working group is addressing estimands in RWE studies. Constructing the right estimand -- the target of estimation -- which reflects the research question and the study objective, is one of the key components in formulating a clinical study. ICH E9(R1) describes statistical principles for constructing estimands in clinical trials with a focus on five attributes -- population, treatment, endpoints, intercurrent events, and population-level summary. However, defining estimands for clinical studies using real-world data (RWD), i.e., RWE studies, requires additional considerations due to, for example, heterogeneity of study population, complexity of treatment regimes, different types and patterns of intercurrent events, and complexities in choosing study endpoints. This paper reviews the essential components of estimands and causal inference framework, discusses considerations in constructing estimands for RWE studies, highlights similarities and differences in traditional clinical trial and RWE study estimands, and provides a roadmap for choosing appropriate estimands for RWE studies. △ Less

Submitted 30 June, 2023; originally announced July 2023.

arXiv:2306.16573 [pdf, other]

Finite-Sample Symmetric Mean Estimation with Fisher Information Rate

Authors: Shivam Gupta, Jasper C. H. Lee, Eric Price

Abstract: The mean of an unknown variance-$σ^2$ distribution $f$ can be estimated from $n$ samples with variance $\frac{σ^2}{n}$ and nearly corresponding subgaussian rate. When $f$ is known up to translation, this can be improved asymptotically to $\frac{1}{n\mathcal I}$, where $\mathcal I$ is the Fisher information of the distribution. Such an improvement is not possible for general unknown $f$, but [Stone… ▽ More The mean of an unknown variance-$σ^2$ distribution $f$ can be estimated from $n$ samples with variance $\frac{σ^2}{n}$ and nearly corresponding subgaussian rate. When $f$ is known up to translation, this can be improved asymptotically to $\frac{1}{n\mathcal I}$, where $\mathcal I$ is the Fisher information of the distribution. Such an improvement is not possible for general unknown $f$, but [Stone, 1975] showed that this asymptotic convergence $\textit{is}$ possible if $f$ is $\textit{symmetric}$ about its mean. Stone's bound is asymptotic, however: the $n$ required for convergence depends in an unspecified way on the distribution $f$ and failure probability $δ$. In this paper we give finite-sample guarantees for symmetric mean estimation in terms of Fisher information. For every $f, n, δ$ with $n > \log \frac{1}δ$, we get convergence close to a subgaussian with variance $\frac{1}{n \mathcal I_r}$, where $\mathcal I_r$ is the $r$-$\textit{smoothed}$ Fisher information with smoothing radius $r$ that decays polynomially in $n$. Such a bound essentially matches the finite-sample guarantees in the known-$f$ setting. △ Less

Submitted 28 June, 2023; originally announced June 2023.

Comments: COLT 2023

arXiv:2306.03291 [pdf, other]

Switching Autoregressive Low-rank Tensor Models

Authors: Hyun Dong Lee, Andrew Warrington, Joshua I. Glaser, Scott W. Linderman

Abstract: An important problem in time-series analysis is modeling systems with time-varying dynamics. Probabilistic models with joint continuous and discrete latent states offer interpretable, efficient, and experimentally useful descriptions of such data. Commonly used models include autoregressive hidden Markov models (ARHMMs) and switching linear dynamical systems (SLDSs), each with its own advantages a… ▽ More An important problem in time-series analysis is modeling systems with time-varying dynamics. Probabilistic models with joint continuous and discrete latent states offer interpretable, efficient, and experimentally useful descriptions of such data. Commonly used models include autoregressive hidden Markov models (ARHMMs) and switching linear dynamical systems (SLDSs), each with its own advantages and disadvantages. ARHMMs permit exact inference and easy parameter estimation, but are parameter intensive when modeling long dependencies, and hence are prone to overfitting. In contrast, SLDSs can capture long-range dependencies in a parameter efficient way through Markovian latent dynamics, but present an intractable likelihood and a challenging parameter estimation task. In this paper, we propose switching autoregressive low-rank tensor (SALT) models, which retain the advantages of both approaches while ameliorating the weaknesses. SALT parameterizes the tensor of an ARHMM with a low-rank factorization to control the number of parameters and allow longer range dependencies without overfitting. We prove theoretical and discuss practical connections between SALT, linear dynamical systems, and SLDSs. We empirically demonstrate quantitative advantages of SALT models on a range of simulated and real prediction tasks, including behavioral and neural datasets. Furthermore, the learned low-rank tensor provides novel insights into temporal dependencies within each discrete state. △ Less

Submitted 6 June, 2023; v1 submitted 5 June, 2023; originally announced June 2023.

arXiv:2306.02283 [pdf, other]

Matrix Completion from General Deterministic Sampling Patterns

Authors: Hanbyul Lee, Rahul Mazumder, Qifan Song, Jean Honorio

Abstract: Most of the existing works on provable guarantees for low-rank matrix completion algorithms rely on some unrealistic assumptions such that matrix entries are sampled randomly or the sampling pattern has a specific structure. In this work, we establish theoretical guarantee for the exact and approximate low-rank matrix completion problems which can be applied to any deterministic sampling schemes.… ▽ More Most of the existing works on provable guarantees for low-rank matrix completion algorithms rely on some unrealistic assumptions such that matrix entries are sampled randomly or the sampling pattern has a specific structure. In this work, we establish theoretical guarantee for the exact and approximate low-rank matrix completion problems which can be applied to any deterministic sampling schemes. For this, we introduce a graph having observed entries as its edge set, and investigate its graph properties involving the performance of the standard constrained nuclear norm minimization algorithm. We theoretically and experimentally show that the algorithm can be successful as the observation graph is well-connected and has similar node degrees. Our result can be viewed as an extension of the works by Bhojanapalli and Jain [2014] and Burnwal and Vidyasagar [2020], in which the node degrees of the observation graph were assumed to be the same. In particular, our theory significantly improves their results when the underlying matrix is symmetric. △ Less

Submitted 4 June, 2023; originally announced June 2023.

arXiv:2306.01993 [pdf, ps, other]

Provable benefits of score matching

Authors: Chirag Pabbaraju, Dhruv Rohatgi, Anish Sevekari, Holden Lee, Ankur Moitra, Andrej Risteski

Abstract: Score matching is an alternative to maximum likelihood (ML) for estimating a probability distribution parametrized up to a constant of proportionality. By fitting the ''score'' of the distribution, it sidesteps the need to compute this constant of proportionality (which is often intractable). While score matching and variants thereof are popular in practice, precise theoretical understanding of th… ▽ More Score matching is an alternative to maximum likelihood (ML) for estimating a probability distribution parametrized up to a constant of proportionality. By fitting the ''score'' of the distribution, it sidesteps the need to compute this constant of proportionality (which is often intractable). While score matching and variants thereof are popular in practice, precise theoretical understanding of the benefits and tradeoffs with maximum likelihood -- both computational and statistical -- are not well understood. In this work, we give the first example of a natural exponential family of distributions such that the score matching loss is computationally efficient to optimize, and has a comparable statistical efficiency to ML, while the ML loss is intractable to optimize using a gradient-based method. The family consists of exponentials of polynomials of fixed degree, and our result can be viewed as a continuous analogue of recent developments in the discrete setting. Precisely, we show: (1) Designing a zeroth-order or first-order oracle for optimizing the maximum likelihood loss is NP-hard. (2) Maximum likelihood has a statistical efficiency polynomial in the ambient dimension and the radius of the parameters of the family. (3) Minimizing the score matching loss is both computationally and statistically efficient, with complexity polynomial in the ambient dimension. △ Less

Submitted 2 June, 2023; originally announced June 2023.

Comments: 25 Pages

arXiv:2306.00356 [pdf, other]

Regularizing Towards Soft Equivariance Under Mixed Symmetries

Authors: Hyunsu Kim, Hyungi Lee, Hongseok Yang, Juho Lee

Abstract: Datasets often have their intrinsic symmetries, and particular deep-learning models called equivariant or invariant models have been developed to exploit these symmetries. However, if some or all of these symmetries are only approximate, which frequently happens in practice, these models may be suboptimal due to the architectural restrictions imposed on them. We tackle this issue of approximate sy… ▽ More Datasets often have their intrinsic symmetries, and particular deep-learning models called equivariant or invariant models have been developed to exploit these symmetries. However, if some or all of these symmetries are only approximate, which frequently happens in practice, these models may be suboptimal due to the architectural restrictions imposed on them. We tackle this issue of approximate symmetries in a setup where symmetries are mixed, i.e., they are symmetries of not single but multiple different types and the degree of approximation varies across these types. Instead of proposing a new architectural restriction as in most of the previous approaches, we present a regularizer-based method for building a model for a dataset with mixed approximate symmetries. The key component of our method is what we call equivariance regularizer for a given type of symmetries, which measures how much a model is equivariant with respect to the symmetries of the type. Our method is trained with these regularizers, one per each symmetry type, and the strength of the regularizers is automatically tuned during training, leading to the discovery of the approximation levels of some candidate symmetry types without explicit supervision. Using synthetic function approximation and motion forecasting tasks, we demonstrate that our method achieves better accuracy than prior approaches while discovering the approximate symmetry levels correctly. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: Proceedings of the International Conference on Machine Learning (ICML), 2023

arXiv:2305.11798 [pdf, ps, other]

The probability flow ODE is provably fast

Authors: Sitan Chen, Sinho Chewi, Holden Lee, Yuanzhi Li, Jianfeng Lu, Adil Salim

Abstract: We provide the first polynomial-time convergence guarantees for the probability flow ODE implementation (together with a corrector step) of score-based generative modeling. Our analysis is carried out in the wake of recent results obtaining such guarantees for the SDE-based implementation (i.e., denoising diffusion probabilistic modeling or DDPM), but requires the development of novel techniques f… ▽ More We provide the first polynomial-time convergence guarantees for the probability flow ODE implementation (together with a corrector step) of score-based generative modeling. Our analysis is carried out in the wake of recent results obtaining such guarantees for the SDE-based implementation (i.e., denoising diffusion probabilistic modeling or DDPM), but requires the development of novel techniques for studying deterministic dynamics without contractivity. Through the use of a specially chosen corrector step based on the underdamped Langevin diffusion, we obtain better dimension dependence than prior works on DDPM ($O(\sqrt{d})$ vs. $O(d)$, assuming smoothness of the data distribution), highlighting potential advantages of the ODE framework. △ Less

Submitted 19 May, 2023; originally announced May 2023.

Comments: 23 pages, 2 figures

arXiv:2305.06850 [pdf]

A Causal Roadmap for Generating High-Quality Real-World Evidence

Authors: Lauren E Dang, Susan Gruber, Hana Lee, Issa Dahabreh, Elizabeth A Stuart, Brian D Williamson, Richard Wyss, Iván Díaz, Debashis Ghosh, Emre Kıcıman, Demissie Alemayehu, Katherine L Hoffman, Carla Y Vossen, Raymond A Huml, Henrik Ravn, Kajsa Kvist, Richard Pratley, Mei-Chiung Shih, Gene Pennello, David Martin, Salina P Waddy, Charles E Barr, Mouna Akacha, John B Buse, Mark van der Laan , et al. (1 additional authors not shown)

Abstract: Increasing emphasis on the use of real-world evidence (RWE) to support clinical policy and regulatory decision-making has led to a proliferation of guidance, advice, and frameworks from regulatory agencies, academia, professional societies, and industry. A broad spectrum of studies use real-world data (RWD) to produce RWE, ranging from randomized controlled trials with outcomes assessed using RWD… ▽ More Increasing emphasis on the use of real-world evidence (RWE) to support clinical policy and regulatory decision-making has led to a proliferation of guidance, advice, and frameworks from regulatory agencies, academia, professional societies, and industry. A broad spectrum of studies use real-world data (RWD) to produce RWE, ranging from randomized controlled trials with outcomes assessed using RWD to fully observational studies. Yet many RWE study proposals lack sufficient detail to evaluate adequacy, and many analyses of RWD suffer from implausible assumptions, other methodological flaws, or inappropriate interpretations. The Causal Roadmap is an explicit, itemized, iterative process that guides investigators to pre-specify analytic study designs; it addresses a wide range of guidance within a single framework. By requiring transparent evaluation of causal assumptions and facilitating objective comparisons of design and analysis choices based on pre-specified criteria, the Roadmap can help investigators to evaluate the quality of evidence that a given study is likely to produce, specify a study to generate high-quality RWE, and communicate effectively with regulatory agencies and other stakeholders. This paper aims to disseminate and extend the Causal Roadmap framework for use by clinical and translational researchers, with companion papers demonstrating application of the Causal Roadmap for specific use cases. △ Less

Submitted 11 May, 2023; originally announced May 2023.

Comments: 51 pages, 2 figures, 4 tables

arXiv:2305.00966 [pdf, other]

A Spectral Algorithm for List-Decodable Covariance Estimation in Relative Frobenius Norm

Authors: Ilias Diakonikolas, Daniel M. Kane, Jasper C. H. Lee, Ankit Pensia, Thanasis Pittas

Abstract: We study the problem of list-decodable Gaussian covariance estimation. Given a multiset $T$ of $n$ points in $\mathbb R^d$ such that an unknown $α<1/2$ fraction of points in $T$ are i.i.d. samples from an unknown Gaussian $\mathcal{N}(μ, Σ)$, the goal is to output a list of $O(1/α)$ hypotheses at least one of which is close to $Σ$ in relative Frobenius norm. Our main result is a… ▽ More We study the problem of list-decodable Gaussian covariance estimation. Given a multiset $T$ of $n$ points in $\mathbb R^d$ such that an unknown $α<1/2$ fraction of points in $T$ are i.i.d. samples from an unknown Gaussian $\mathcal{N}(μ, Σ)$, the goal is to output a list of $O(1/α)$ hypotheses at least one of which is close to $Σ$ in relative Frobenius norm. Our main result is a $\mathrm{poly}(d,1/α)$ sample and time algorithm for this task that guarantees relative Frobenius norm error of $\mathrm{poly}(1/α)$. Importantly, our algorithm relies purely on spectral techniques. As a corollary, we obtain an efficient spectral algorithm for robust partial clustering of Gaussian mixture models (GMMs) -- a key ingredient in the recent work of [BDJ+22] on robustly learning arbitrary GMMs. Combined with the other components of [BDJ+22], our new method yields the first Sum-of-Squares-free algorithm for robustly learning GMMs. At the technical level, we develop a novel multi-filtering method for list-decodable covariance estimation that may be useful in other settings. △ Less

Submitted 1 May, 2023; originally announced May 2023.

arXiv:2304.08553 [pdf, other]

A New Representation of Uniform-Block Matrix and Applications

Authors: Yifan Yang, Hwiyoung Lee, Shuo Chen

Abstract: A covariance matrix with a special pattern (e.g., sparsity or block structure) is essential for conducting multivariate analysis on high-dimensional data. Recently, a block covariance or correlation pattern has been observed in various biological and biomedical studies, such as gene expression, proteomics, neuroimaging, exposome, and seed quality, among others. Specifically, this pattern partition… ▽ More A covariance matrix with a special pattern (e.g., sparsity or block structure) is essential for conducting multivariate analysis on high-dimensional data. Recently, a block covariance or correlation pattern has been observed in various biological and biomedical studies, such as gene expression, proteomics, neuroimaging, exposome, and seed quality, among others. Specifically, this pattern partitions the population covariance matrix into uniform (i.e., equal variances and covariances) blocks. However, the unknown mathematical properties of matrices with this pattern limit the incorporation of this pre-determined covariance information into research. To address this gap, we propose a block Hadamard product representation that utilizes two lower-dimensional "coordinate" matrices and a pre-specific vector. This representation enables the explicit expressions of the square or power, determinant, inverse, eigendecomposition, canonical form, and the other matrix functions of the original larger-dimensional matrix on the basis of these "coordinate" matrices. By utilizing this representation, we construct null distributions of information test statistics for the population mean(s) in both single and multiple sample cases, which are extensions of Hotelling's $T^2$ and $T_0^2$, respectively. △ Less

Submitted 17 April, 2023; originally announced April 2023.

arXiv:2304.01303 [pdf, ps, other]

Improved Bound for Mixing Time of Parallel Tempering

Authors: Holden Lee, Zeyu Shen

Abstract: In the field of sampling algorithms, MCMC (Markov Chain Monte Carlo) methods are widely used when direct sampling is not possible. However, multimodality of target distributions often leads to slow convergence and mixing. One common solution is parallel tempering. Though highly effective in practice, theoretical guarantees on its performance are limited. In this paper, we present a new lower bound… ▽ More In the field of sampling algorithms, MCMC (Markov Chain Monte Carlo) methods are widely used when direct sampling is not possible. However, multimodality of target distributions often leads to slow convergence and mixing. One common solution is parallel tempering. Though highly effective in practice, theoretical guarantees on its performance are limited. In this paper, we present a new lower bound for parallel tempering on the spectral gap that has a polynomial dependence on all parameters except $\log L$, where $(L + 1)$ is the number of levels. This improves the best existing bound which depends exponentially on the number of modes. Moreover, we complement our result with a hypothetical upper bound on spectral gap that has an exponential dependence on $\log L$, which shows that, in some sense, our bound is tight. △ Less

Submitted 3 April, 2023; originally announced April 2023.

arXiv:2303.07053 [pdf, other]

Bandit-supported care planning for older people with complex health and care needs

Authors: Gi-Soo Kim, Young Suh Hong, Tae Hoon Lee, Myunghee Cho Paik, Hongsoo Kim

Abstract: Long-term care service for old people is in great demand in most of the aging societies. The number of nursing homes residents is increasing while the number of care providers is limited. Due to the care worker shortage, care to vulnerable older residents cannot be fully tailored to the unique needs and preference of each individual. This may bring negative impacts on health outcomes and quality o… ▽ More Long-term care service for old people is in great demand in most of the aging societies. The number of nursing homes residents is increasing while the number of care providers is limited. Due to the care worker shortage, care to vulnerable older residents cannot be fully tailored to the unique needs and preference of each individual. This may bring negative impacts on health outcomes and quality of life among institutionalized older people. To improve care quality through personalized care planning and delivery with limited care workforce, we propose a new care planning model assisted by artificial intelligence. We apply bandit algorithms which optimize the clinical decision for care planning by adapting to the sequential feedback from the past decisions. We evaluate the proposed model on empirical data acquired from the Systems for Person-centered Elder Care (SPEC) study, a ICT-enhanced care management program. △ Less

Submitted 13 March, 2023; originally announced March 2023.

arXiv:2302.02497 [pdf, other]

High-dimensional Location Estimation via Norm Concentration for Subgamma Vectors

Authors: Shivam Gupta, Jasper C. H. Lee, Eric Price

Abstract: In location estimation, we are given $n$ samples from a known distribution $f$ shifted by an unknown translation $λ$, and want to estimate $λ$ as precisely as possible. Asymptotically, the maximum likelihood estimate achieves the Cramér-Rao bound of error $\mathcal N(0, \frac{1}{n\mathcal I})$, where $\mathcal I$ is the Fisher information of $f$. However, the $n$ required for convergence depends o… ▽ More In location estimation, we are given $n$ samples from a known distribution $f$ shifted by an unknown translation $λ$, and want to estimate $λ$ as precisely as possible. Asymptotically, the maximum likelihood estimate achieves the Cramér-Rao bound of error $\mathcal N(0, \frac{1}{n\mathcal I})$, where $\mathcal I$ is the Fisher information of $f$. However, the $n$ required for convergence depends on $f$, and may be arbitrarily large. We build on the theory using \emph{smoothed} estimators to bound the error for finite $n$ in terms of $\mathcal I_r$, the Fisher information of the $r$-smoothed distribution. As $n \to \infty$, $r \to 0$ at an explicit rate and this converges to the Cramér-Rao bound. We (1) improve the prior work for 1-dimensional $f$ to converge for constant failure probability in addition to high probability, and (2) extend the theory to high-dimensional distributions. In the process, we prove a new bound on the norm of a high-dimensional random variable whose 1-dimensional projections are subgamma, which may be of independent interest. △ Less

Submitted 5 February, 2023; originally announced February 2023.

arXiv:2302.01535 [pdf, other]

Support Recovery in Sparse PCA with Non-Random Missing Data

Authors: Hanbyul Lee, Qifan Song, Jean Honorio

Abstract: We analyze a practical algorithm for sparse PCA on incomplete and noisy data under a general non-random sampling scheme. The algorithm is based on a semidefinite relaxation of the $\ell_1$-regularized PCA problem. We provide theoretical justification that under certain conditions, we can recover the support of the sparse leading eigenvector with high probability by obtaining a unique solution. The… ▽ More We analyze a practical algorithm for sparse PCA on incomplete and noisy data under a general non-random sampling scheme. The algorithm is based on a semidefinite relaxation of the $\ell_1$-regularized PCA problem. We provide theoretical justification that under certain conditions, we can recover the support of the sparse leading eigenvector with high probability by obtaining a unique solution. The conditions involve the spectral gap between the largest and second-largest eigenvalues of the true data matrix, the magnitude of the noise, and the structural properties of the observed entries. The concepts of algebraic connectivity and irregularity are used to describe the structural properties of the observed entries. We empirically justify our theorem with synthetic and real data analysis. We also show that our algorithm outperforms several other sparse PCA approaches especially when the observed entries have good structural properties. As a by-product of our analysis, we provide two theorems to handle a deterministic sampling scheme, which can be applied to other matrix-related problems. △ Less

Submitted 2 February, 2023; originally announced February 2023.

Comments: arXiv admin note: text overlap with arXiv:2205.15215

arXiv:2302.01002 [pdf, other]

Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning

Authors: Francois Caron, Fadhel Ayed, Paul Jung, Hoil Lee, Juho Lee, Hongseok Yang

Abstract: We consider the optimisation of large and shallow neural networks via gradient flow, where the output of each hidden node is scaled by some positive parameter. We focus on the case where the node scalings are non-identical, differing from the classical Neural Tangent Kernel (NTK) parameterisation. We prove that, for large neural networks, with high probability, gradient flow converges to a global… ▽ More We consider the optimisation of large and shallow neural networks via gradient flow, where the output of each hidden node is scaled by some positive parameter. We focus on the case where the node scalings are non-identical, differing from the classical Neural Tangent Kernel (NTK) parameterisation. We prove that, for large neural networks, with high probability, gradient flow converges to a global minimum AND can learn features, unlike in the NTK regime. We also provide experiments on synthetic and real-world datasets illustrating our theoretical results and showing the benefit of such scaling in terms of pruning and transfer learning. △ Less

Submitted 2 February, 2023; originally announced February 2023.

arXiv:2302.00951 [pdf, other]

A Bayesian analysis of current duration data with reporting issues: an application to estimating the distribution of time-between-sex from time-since-last-sex data as collected in cross-sectional surveys in low- and middle-income countries

Authors: Chi Hyun Lee, Herbert Susmann, Leontine Alkema

Abstract: Aggregate measures of family planning are used to monitor demand for and usage of contraceptive methods in populations globally, for example as part of the FP2030 initiative. Family planning measures for low- and middle-income countries are typically based on data collected through cross-sectional household surveys. Recently proposed measures account for sexual activity through assessment of the d… ▽ More Aggregate measures of family planning are used to monitor demand for and usage of contraceptive methods in populations globally, for example as part of the FP2030 initiative. Family planning measures for low- and middle-income countries are typically based on data collected through cross-sectional household surveys. Recently proposed measures account for sexual activity through assessment of the distribution of time-between-sex (TBS) in the population of interest. In this paper, we propose a statistical approach to estimate the distribution of TBS using data typically available in low- and middle-income countries, while addressing two major challenges. The first challenge is that timing of sex information is typically limited to women's time-since-last-sex (TSLS) data collected in the cross-sectional survey. In our proposed approach, we adopt the current duration method to estimate the distribution of TBS using the available TSLS data, from which the frequency of sex at the population level can be derived. Furthermore, the observed TSLS data are subject to reporting issues because they can be reported in different units and may be rounded off. To apply the current duration approach and account for these data reporting issues, we develop a flexible Bayesian model, and provide a detailed technical description of the proposed modeling approach. △ Less

Submitted 2 February, 2023; originally announced February 2023.

arXiv:2301.03057 [pdf, other]

doi 10.1093/biostatistics/kxac052

Characterizing quantile-varying covariate effects under the accelerated failure time model

Authors: Harrison T. Reeder, Kyu Ha Lee, Sebastien Haneuse

Abstract: An important task in survival analysis is choosing a structure for the relationship between covariates of interest and the time-to-event outcome. For example, the accelerated failure time (AFT) model structures each covariate effect as a constant multiplicative shift in the outcome distribution across all survival quantiles. Though parsimonious, this structure cannot detect or capture effects that… ▽ More An important task in survival analysis is choosing a structure for the relationship between covariates of interest and the time-to-event outcome. For example, the accelerated failure time (AFT) model structures each covariate effect as a constant multiplicative shift in the outcome distribution across all survival quantiles. Though parsimonious, this structure cannot detect or capture effects that differ across quantiles of the distribution, a limitation that is analogous to only permitting proportional hazards in the Cox model. To address this, we propose a general framework for quantile-varying multiplicative effects under the AFT model. Specifically, we embed flexible regression structures within the AFT model, and derive a novel formula for interpretable effects on the quantile scale. A regression standardization scheme based on the g-formula is proposed to enable estimation of both covariate-conditional and marginal effects for an exposure of interest. We implement a user-friendly Bayesian approach for estimation and quantification of uncertainty, while accounting for left truncation and complex censoring. We emphasize the intuitive interpretation of this model through numerical and graphical tools, and illustrate its performance by application to a study of Alzheimer's disease and dementia. △ Less

Submitted 8 January, 2023; originally announced January 2023.

Comments: This is the pre-peer reviewed, "submitted" version of the manuscript published in final form in Biostatistics by Oxford University Press at the below citation/doi. This upload will be updated with the final peer-reviewed "accepted" version of the manuscript following a 24 month embargo period

Journal ref: Biostatistics (Oxford, England), kxac052 (2023)

arXiv:2211.16333 [pdf, ps, other]

Outlier-Robust Sparse Mean Estimation for Heavy-Tailed Distributions

Authors: Ilias Diakonikolas, Daniel M. Kane, Jasper C. H. Lee, Ankit Pensia

Abstract: We study the fundamental task of outlier-robust mean estimation for heavy-tailed distributions in the presence of sparsity. Specifically, given a small number of corrupted samples from a high-dimensional heavy-tailed distribution whose mean $μ$ is guaranteed to be sparse, the goal is to efficiently compute a hypothesis that accurately approximates $μ$ with high probability. Prior work had obtained… ▽ More We study the fundamental task of outlier-robust mean estimation for heavy-tailed distributions in the presence of sparsity. Specifically, given a small number of corrupted samples from a high-dimensional heavy-tailed distribution whose mean $μ$ is guaranteed to be sparse, the goal is to efficiently compute a hypothesis that accurately approximates $μ$ with high probability. Prior work had obtained efficient algorithms for robust sparse mean estimation of light-tailed distributions. In this work, we give the first sample-efficient and polynomial-time robust sparse mean estimator for heavy-tailed distributions under mild moment assumptions. Our algorithm achieves the optimal asymptotic error using a number of samples scaling logarithmically with the ambient dimension. Importantly, the sample complexity of our method is optimal as a function of the failure probability $τ$, having an additive $\log(1/τ)$ dependence. Our algorithm leverages the stability-based approach from the algorithmic robust statistics literature, with crucial (and necessary) adaptations required in our setting. Our analysis may be of independent interest, involving the delicate design of a (non-spectral) decomposition for positive semi-definite matrices satisfying certain sparsity properties. △ Less

Submitted 29 November, 2022; originally announced November 2022.

Comments: To appear in NeurIPS 2022

arXiv:2211.13866 [pdf, ps, other]

Minimal Width for Universal Property of Deep RNN

Authors: Chang hoon Song, Geonho Hwang, Jun ho Lee, Myungjoo Kang

Abstract: A recurrent neural network (RNN) is a widely used deep-learning network for dealing with sequential data. Imitating a dynamical system, an infinite-width RNN can approximate any open dynamical system in a compact domain. In general, deep networks with bounded widths are more effective than wide networks in practice; however, the universal approximation theorem for deep narrow structures has yet to… ▽ More A recurrent neural network (RNN) is a widely used deep-learning network for dealing with sequential data. Imitating a dynamical system, an infinite-width RNN can approximate any open dynamical system in a compact domain. In general, deep networks with bounded widths are more effective than wide networks in practice; however, the universal approximation theorem for deep narrow structures has yet to be extensively studied. In this study, we prove the universality of deep narrow RNNs and show that the upper bound of the minimum width for universality can be independent of the length of the data. Specifically, we show that a deep RNN with ReLU activation can approximate any continuous function or $L^p$ function with the widths $d_x+d_y+2$ and $\max\{d_x+1,d_y\}$, respectively, where the target function maps a finite sequence of vectors in $\mathbb{R}^{d_x}$ to a finite sequence of vectors in $\mathbb{R}^{d_y}$. We also compute the additional width required if the activation function is $\tanh$ or more. In addition, we prove the universality of other recurrent networks, such as bidirectional RNNs. Bridging a multi-layer perceptron and an RNN, our theory and proof technique can be an initial step toward further research on deep RNNs. △ Less

Submitted 28 March, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

Showing 1–50 of 248 results for author: Lee, H