Statistics
See recent articles
- [1] arXiv:2407.00139 [pdf, html, other]
-
Title: A Calibrated Sensitivity Analysis for Weighted Causal DecompositionsSubjects: Methodology (stat.ME); Applications (stat.AP)
Disparities in health or well-being experienced by minority groups can be difficult to study using the traditional exposure-outcome paradigm in causal inference, since potential outcomes in variables such as race or sexual minority status are challenging to interpret. Causal decomposition analysis addresses this gap by positing causal effects on disparities under interventions to other, intervenable exposures that may play a mediating role in the disparity. While invoking weaker assumptions than causal mediation approaches, decomposition analyses are often conducted in observational settings and require uncheckable assumptions that eliminate unmeasured confounders. Leveraging the marginal sensitivity model, we develop a sensitivity analysis for weighted causal decomposition estimators and use the percentile bootstrap to construct valid confidence intervals for causal effects on disparities. We also propose a two-parameter amplification that enhances interpretability and facilitates an intuitive understanding of the plausibility of unmeasured confounders and their effects. We illustrate our framework on a study examining the effect of parental acceptance on disparities in suicidal ideation among sexual minority youth. We find that the effect is small and sensitive to unmeasured confounding, suggesting that further screening studies are needed to identify mitigating interventions in this vulnerable population.
- [2] arXiv:2407.00240 [pdf, html, other]
-
Title: Exact mean and covariance formulas after diagonal transformations of a multivariate normalComments: 21 pagesSubjects: Statistics Theory (math.ST)
Consider $\boldsymbol X \sim \mathcal{N}(\boldsymbol 0, \boldsymbol \Sigma)$ and $\boldsymbol Y = (f_1(X_1), f_2(X_2),\dots, f_d(X_d))$. We call this a diagonal transformation of a multivariate normal. In this paper we compute exactly the mean vector and covariance matrix of the random vector $\boldsymbol Y.$ This is done two different ways: One approach uses a series expansion for the function $f_i$ and the other a transform method. We compute several examples, show how the covariance entries can be estimated, and compare the theoretical results with numerical ones.
- [3] arXiv:2407.00292 [pdf, html, other]
-
Title: Interpret the estimand framework from a causal inference perspectiveSubjects: Other Statistics (stat.OT); Applications (stat.AP)
The estimand framework proposed by ICH in 2017 has brought fundamental changes in the pharmaceutical industry. It clearly describes how a treatment effect in a clinical question should be precisely defined and estimated, through attributes including treatments, endpoints and intercurrent events. However, ideas around the estimand framework are commonly in text, and different interpretations on this framework may exist. This article aims to interpret the estimand framework through its underlying theories, the causal inference framework based on potential outcomes. The statistical origin and formula of an estimand is given through the causal inference framework, with all attributes translated into statistical terms. How five strategies proposed by ICH to analyze intercurrent events are incorporated in the statistical formula of an estimand is described, and a new strategy to analyze intercurrent events is also suggested. The roles of target populations and analysis sets in the estimand framework are compared and discussed based on the statistical formula of an estimand. This article recommends continuing study of causal inference theories behind the estimand framework and improving the estimand framework with greater methodological comprehensibility and availability.
- [4] arXiv:2407.00364 [pdf, html, other]
-
Title: Medical Knowledge Integration into Reinforcement Learning Algorithms for Dynamic Treatment RegimesSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
The goal of precision medicine is to provide individualized treatment at each stage of chronic diseases, a concept formalized by Dynamic Treatment Regimes (DTR). These regimes adapt treatment strategies based on decision rules learned from clinical data to enhance therapeutic effectiveness. Reinforcement Learning (RL) algorithms allow to determine these decision rules conditioned by individual patient data and their medical history. The integration of medical expertise into these models makes possible to increase confidence in treatment recommendations and facilitate the adoption of this approach by healthcare professionals and patients. In this work, we examine the mathematical foundations of RL, contextualize its application in the field of DTR, and present an overview of methods to improve its effectiveness by integrating medical expertise.
- [5] arXiv:2407.00381 [pdf, html, other]
-
Title: Climate change analysis from LRD manifold functional regressionSubjects: Methodology (stat.ME)
A functional nonlinear regression approach, incorporating time information in the covariates, is proposed for temporal strong correlated manifold map data sequence analysis. Specifically, the functional regression parameters are supported on a connected and compact two--point homogeneous space. The Generalized Least--Squares (GLS) parameter estimator is computed in the linearized model, having error term displaying manifold scale varying Long Range Dependence (LRD). The performance of the theoretical and plug--in nonlinear regression predictors is illustrated by simulations on sphere, in terms of the empirical mean of the computed spherical functional absolute errors. In the case where the second--order structure of the functional error term in the linearized model is unknown, its estimation is performed by minimum contrast in the functional spectral domain. The linear case is illustrated in the Supplementary Material, revealing the effect of the slow decay velocity in time of the trace norms of the covariance operator family of the regression LRD error term. The purely spatial statistical analysis of atmospheric pressure at high cloud bottom, and downward solar radiation flux in Alegria et al. (2021) is extended to the spatiotemporal context, illustrating the numerical results from a generated synthetic data set.
- [6] arXiv:2407.00561 [pdf, html, other]
-
Title: Advancing Information Integration through Empirical Likelihood: Selective Reviews and a New IdeaSubjects: Methodology (stat.ME); Applications (stat.AP)
Information integration plays a pivotal role in biomedical studies by facilitating the combination and analysis of independent datasets from multiple studies, thereby uncovering valuable insights that might otherwise remain obscured due to the limited sample size in individual studies. However, sharing raw data from independent studies presents significant challenges, primarily due to the need to safeguard sensitive participant information and the cumbersome paperwork involved in data sharing. In this article, we first provide a selective review of recent methodological developments in information integration via empirical likelihood, wherein only summary information is required, rather than the raw data. Following this, we introduce a new insight and a potentially promising framework that could broaden the application of information integration across a wider spectrum. Furthermore, this new framework offers computational convenience compared to classic empirical likelihood-based methods. We provide numerical evaluations to assess its performance and discuss various extensions in the end.
- [7] arXiv:2407.00564 [pdf, html, other]
-
Title: Variational Nonparametric Inference in Functional Stochastic Block ModelSubjects: Methodology (stat.ME)
We propose a functional stochastic block model whose vertices involve functional data information. This new model extends the classic stochastic block model with vector-valued nodal information, and finds applications in real-world networks whose nodal information could be functional curves. Examples include international trade data in which a network vertex (country) is associated with the annual or quarterly GDP over certain time period, and MyFitnessPal data in which a network vertex (MyFitnessPal user) is associated with daily calorie information measured over certain time period. Two statistical tasks will be jointly executed. First, we will detect community structures of the network vertices assisted by the functional nodal information. Second, we propose computationally efficient variational test to examine the significance of the functional nodal information. We show that the community detection algorithms achieve weak and strong consistency, and the variational test is asymptotically chi-square with diverging degrees of freedom. As a byproduct, we propose pointwise confidence intervals for the slop function of the functional nodal information. Our methods are examined through both simulated and real datasets.
- [8] arXiv:2407.00644 [pdf, html, other]
-
Title: Clusterpath Gaussian Graphical ModelingComments: 43 pages, 11 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Graphical models serve as effective tools for visualizing conditional dependencies between variables. However, as the number of variables grows, interpretation becomes increasingly difficult, and estimation uncertainty increases due to the large number of parameters relative to the number of observations. To address these challenges, we introduce the Clusterpath estimator of the Gaussian Graphical Model (CGGM) that encourages variable clustering in the graphical model in a data-driven way. Through the use of a clusterpath penalty, we group variables together, which in turn results in a block-structured precision matrix whose block structure remains preserved in the covariance matrix. We present a computationally efficient implementation of the CGGM estimator by using a cyclic block coordinate descent algorithm. In simulations, we show that CGGM not only matches, but oftentimes outperforms other state-of-the-art methods for variable clustering in graphical models. We also demonstrate CGGM's practical advantages and versatility on a diverse collection of empirical applications.
- [9] arXiv:2407.00649 [pdf, html, other]
-
Title: Particle Semi-Implicit Variational InferenceSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Semi-implicit variational inference (SIVI) enriches the expressiveness of variational families by utilizing a kernel and a mixing distribution to hierarchically define the variational distribution. Existing SIVI methods parameterize the mixing distribution using implicit distributions, leading to intractable variational densities. As a result, directly maximizing the evidence lower bound (ELBO) is not possible and so, they resort to either: optimizing bounds on the ELBO, employing costly inner-loop Markov chain Monte Carlo runs, or solving minimax objectives. In this paper, we propose a novel method for SIVI called Particle Variational Inference (PVI) which employs empirical measures to approximate the optimal mixing distributions characterized as the minimizer of a natural free energy functional via a particle approximation of an Euclidean--Wasserstein gradient flow. This approach means that, unlike prior works, PVI can directly optimize the ELBO; furthermore, it makes no parametric assumption about the mixing distribution. Our empirical results demonstrate that PVI performs favourably against other SIVI methods across various tasks. Moreover, we provide a theoretical analysis of the behaviour of the gradient flow of a related free energy functional: establishing the existence and uniqueness of solutions as well as propagation of chaos results.
- [10] arXiv:2407.00650 [pdf, html, other]
-
Title: Proper Scoring Rules for Multivariate Probabilistic Forecasts based on Aggregation and TransformationComments: for associated code, see this https URLSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Applications (stat.AP)
Proper scoring rules are an essential tool to assess the predictive performance of probabilistic forecasts. However, propriety alone does not ensure an informative characterization of predictive performance and it is recommended to compare forecasts using multiple scoring rules. With that in mind, interpretable scoring rules providing complementary information are necessary. We formalize a framework based on aggregation and transformation to build interpretable multivariate proper scoring rules. Aggregation-and-transformation-based scoring rules are able to target specific features of the probabilistic forecasts; which improves the characterization of the predictive performance. This framework is illustrated through examples taken from the literature and studied using numerical experiments showcasing its benefits. In particular, it is shown that it can help bridge the gap between proper scoring rules and spatial verification tools.
- [11] arXiv:2407.00655 [pdf, html, other]
-
Title: Markov Switching Multiple-equation Tensor RegressionsSubjects: Methodology (stat.ME)
We propose a new flexible tensor model for multiple-equation regression that accounts for latent regime changes. The model allows for dynamic coefficients and multi-dimensional covariates that vary across equations. We assume the coefficients are driven by a common hidden Markov process that addresses structural breaks to enhance the model flexibility and preserve parsimony. We introduce a new Soft PARAFAC hierarchical prior to achieve dimensionality reduction while preserving the structural information of the covariate tensor. The proposed prior includes a new multi-way shrinking effect to address over-parametrization issues. We developed theoretical results to help hyperparameter choice. An efficient MCMC algorithm based on random scan Gibbs and back-fitting strategy is developed to achieve better computational scalability of the posterior sampling. The validity of the MCMC algorithm is demonstrated theoretically, and its computational efficiency is studied using numerical experiments in different parameter settings. The effectiveness of the model framework is illustrated using two original real data analyses. The proposed model exhibits superior performance when compared to the current benchmark, Lasso regression.
- [12] arXiv:2407.00709 [pdf, html, other]
-
Title: Comparative Effectiveness Research with Average Hazard for Censored Time-to-Event Outcomes: A Numerical StudySubjects: Applications (stat.AP)
The average hazard (AH), recently introduced by Uno and Horiguchi, represents a novel summary metric of event time distributions, conceptualized as the general censoring-free average person-time incidence rate on a given time window, $[0,\tau].$ This metric is calculated as the ratio of the cumulative incidence probability at $\tau$ to the restricted mean survival time at $\tau$ and can be estimated through non-parametric methods. The AH's difference and ratio present viable alternatives to the traditional Cox's hazard ratio for quantifying the treatment effect on time-to-event outcomes in comparative clinical studies. While the methodology for evaluating the difference and ratio of AH in randomized clinical trials has been previously proposed, the application of the AH-based approach in general comparative effectiveness research (CER), where interventions are not randomly allocated, remains underdiscussed. This paper aims to introduce several approaches for applying the AH in general CER, thereby extending its utility beyond randomized trial settings to observational studies where treatment assignment is non-random.
- [13] arXiv:2407.00712 [pdf, html, other]
-
Title: Geometric and Harmonic Aging Intensity function and a Reliability PerspectiveComments: 22 pages, 3 figuresSubjects: Statistics Theory (math.ST); Probability (math.PR)
In this paper, we introduce some new notions of aging based on geometric, harmonic means of failure rate and aging intensity function. We define a generalized version of aging functions called specific interval-average geometric hazard rate, specific interval-average harmonic hazard rate. We focus on some characterization results and their inter-relationships among the resulting non-parametric classes of distributions. Monotonic nature of so defined aging classes are exhibited by some well known probability distributions. Probabilistic orders based on these functions are taken up for further study. The work is illustrated through case studies and a simulated data having applications in reliability/survival analysis.
- [14] arXiv:2407.00716 [pdf, html, other]
-
Title: On a General Theoretical Framework of ReliabilitySubjects: Methodology (stat.ME)
Reliability is an essential measure of how closely observed scores represent latent scores (reflecting constructs), assuming some latent variable measurement model. We present a general theoretical framework of reliability, placing emphasis on measuring association between latent and observed scores. This framework was inspired by McDonald's (2011) regression framework, which highlighted the coefficient of determination as a measure of reliability. We extend McDonald's (2011) framework beyond coefficients of determination and introduce four desiderata for reliability measures (estimability, normalization, symmetry, and invariance). We also present theoretical examples to illustrate distinct measures of reliability and report on a numerical study that demonstrates the behavior of different reliability measures. We conclude with a discussion on the use of reliability coefficients and outline future avenues of research.
- [15] arXiv:2407.00730 [pdf, html, other]
-
Title: D-CDLF: Decomposition of Common and Distinctive Latent Factors for Multi-view High-dimensional DataSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
A typical approach to the joint analysis of multiple high-dimensional data views is to decompose each view's data matrix into three parts: a low-rank common-source matrix generated by common latent factors of all data views, a low-rank distinctive-source matrix generated by distinctive latent factors of the corresponding data view, and an additive noise matrix. Existing decomposition methods often focus on the uncorrelatedness between the common latent factors and distinctive latent factors, but inadequately address the equally necessary uncorrelatedness between distinctive latent factors from different data views. We propose a novel decomposition method, called Decomposition of Common and Distinctive Latent Factors (D-CDLF), to effectively achieve both types of uncorrelatedness for two-view data. We also discuss the estimation of the D-CDLF under high-dimensional settings.
- [16] arXiv:2407.00791 [pdf, html, other]
-
Title: inlabru: software for fitting latent Gaussian models with non-linear predictorsSubjects: Methodology (stat.ME); Computation (stat.CO)
The integrated nested Laplace approximation (INLA) method has become a popular approach for computationally efficient approximate Bayesian computation. In particular, by leveraging sparsity in random effect precision matrices, INLA is commonly used in spatial and spatio-temporal applications. However, the speed of INLA comes at the cost of restricting the user to the family of latent Gaussian models and the likelihoods currently implemented in {INLA}, the main software implementation of the INLA methodology.
{inlabru} is a software package that extends the types of models that can be fitted using INLA by allowing the latent predictor to be non-linear in its parameters, moving beyond the additive linear predictor framework to allow more complex functional relationships. For inference it uses an approximate iterative method based on the first-order Taylor expansion of the non-linear predictor, fitting the model using INLA for each linearised model configuration.
{inlabru} automates much of the workflow required to fit models using {R-INLA}, simplifying the process for users to specify, fit and predict from models. There is additional support for fitting joint likelihood models by building each likelihood individually. {inlabru} also supports the direct use of spatial data structures, such as those implemented in the {sf} and {terra} packages.
In this paper we outline the statistical theory, model structure and basic syntax required for users to understand and develop their own models using {inlabru}. We evaluate the approximate inference method using a Bayesian method checking approach. We provide three examples modelling simulated spatial data that demonstrate the benefits of the additional flexibility provided by {inlabru}. - [17] arXiv:2407.00797 [pdf, html, other]
-
Title: A placement-value based approach to concave ROC analysisComments: 18 pages, 6 figures, 2 tablesSubjects: Methodology (stat.ME)
The receiver operating characteristic (ROC) curve is an important graphic tool for evaluating a test in a wide range of disciplines. While useful, an ROC curve can cross the chance line, either by having an S-shape or a hook at the extreme specificity. These non-concave ROC curves are sub-optimal according to decision theory, as there are points that are superior than those corresponding to the portions below the chance line with either the same sensitivity or specificity. We extend the literature by proposing a novel placement value-based approach to ensure concave curvature of the ROC curve, and utilize Bayesian paradigm to make estimations under both a parametric and a semiparametric framework. We conduct extensive simulation studies to assess the performance of the proposed methodology under various scenarios, and apply it to a pancreatic cancer dataset.
- [18] arXiv:2407.00846 [pdf, html, other]
-
Title: Estimating the cognitive effects of statins from observational data using the survival-incorporated median: a summary measure for clinical outcomes in the presence of deathQingyan Xiang, Paola Sebastiani, Thomas Perls, Stacy L. Andersen, Svetlana Ukraintseva, Mikael Thinggaard, Judith J. LokComments: 56 pagesSubjects: Methodology (stat.ME); Applications (stat.AP)
The issue of "truncation by death" commonly arises in clinical research: subjects may die before their follow-up assessment, resulting in undefined clinical outcomes. This article addresses truncation by death by analyzing the Long Life Family Study (LLFS), a multicenter observational study involving over 4000 older adults with familial longevity. We are interested in the cognitive effects of statins in LLFS participants, as the impact of statins on cognition remains unclear despite their widespread use. In this application, rather than treating death as a mechanism through which clinical outcomes are missing, we advocate treating death as part of the outcome measure. We focus on the survival-incorporated median, the median of a composite outcome combining death and cognitive scores, to summarize the effect of statins. We propose an estimator for the survival-incorporated median from observational data, applicable in both point-treatment settings and time-varying treatment settings. Simulations demonstrate the survival-incorporated median as a simple and useful summary measure. We apply this method to estimate the effect of statins on the change in cognitive function (measured by the Digit Symbol Substitution Test), incorporating death. Our results indicate no significant difference in cognitive decline between participants with a similar age distribution on and off statins from baseline. Through this application, we aim to not only contribute to this clinical question but also offer insights into analyzing clinical outcomes in the presence of death.
- [19] arXiv:2407.00859 [pdf, html, other]
-
Title: Statistical inference on partially shape-constrained function-on-scalar linear regression modelsComments: 30 pages, 7 figuresSubjects: Methodology (stat.ME)
We consider functional linear regression models where functional outcomes are associated with scalar predictors by coefficient functions with shape constraints, such as monotonicity and convexity, that apply to sub-domains of interest. To validate the partial shape constraints, we propose testing a composite hypothesis of linear functional constraints on regression coefficients. Our approach employs kernel- and spline-based methods within a unified inferential framework, evaluating the statistical significance of the hypothesis by measuring an $L^2$-distance between constrained and unconstrained model fits. In the theoretical study of large-sample analysis under mild conditions, we show that both methods achieve the standard rate of convergence observed in the nonparametric estimation literature. Through numerical experiments of finite-sample analysis, we demonstrate that the type I error rate keeps the significance level as specified across various scenarios and that the power increases with sample size, confirming the consistency of the test procedure under both estimation methods. Our theoretical and numerical results provide researchers the flexibility to choose a method based on computational preference. The practicality of partial shape-constrained inference is illustrated by two data applications: one involving clinical trials of NeuroBloc in type A-resistant cervical dystonia and the other with the National Institute of Mental Health Schizophrenia Study.
- [20] arXiv:2407.00882 [pdf, html, other]
-
Title: Subgroup Identification with Latent Factor StructureSubjects: Methodology (stat.ME)
Subgroup analysis has attracted growing attention due to its ability to identify meaningful subgroups from a heterogeneous population and thereby improving predictive power. However, in many scenarios such as social science and biology, the covariates are possibly highly correlated due to the existence of common factors, which brings great challenges for group identification and is neglected in the existing literature. In this paper, we aim to fill this gap in the ``diverging dimension" regime and propose a center-augmented subgroup identification method under the Factor Augmented (sparse) Linear Model framework, which bridge dimension reduction and sparse regression together. The proposed method is flexible to the possibly high cross-sectional dependence among covariates and inherits the computational advantage with complexity $O(nK)$, in contrast to the $O(n^2)$ complexity of the conventional pairwise fusion penalty method in the literature, where $n$ is the sample size and $K$ is the number of subgroups. We also investigate the asymptotic properties of its oracle estimators under conditions on the minimal distance between group centroids. To implement the proposed approach, we introduce a Difference of Convex functions based Alternating Direction Method of Multipliers (DC-ADMM) algorithm and demonstrate its convergence to a local minimizer in finite steps. We illustrate the superiority of the proposed method through extensive numerical experiments and a real macroeconomic data example. An \texttt{R} package \texttt{SILFS} implementing the method is also available on CRAN.
- [21] arXiv:2407.00953 [pdf, html, other]
-
Title: Estimation for the dam** factor of the driving process of an SPDE in two space dimensionsSubjects: Statistics Theory (math.ST)
We study parametric estimation for a second order linear parabolic stochastic partial differential equation (SPDE) in two space dimensions driven by a $Q$-Wiener process based on high frequency spatio-temporal data. We give an estimator of the dam** parameter of the $Q$-Wiener process of the SPDE based on quadratic variations with temporal and spatial increments. We also provide simulation results of the proposed estimator.
- [22] arXiv:2407.01015 [pdf, html, other]
-
Title: Bayesian Entropy Neural Networks for Physics-Aware PredictionComments: 15 pagesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
This paper addresses the need for deep learning models to integrate well-defined constraints into their outputs, driven by their application in surrogate models, learning with limited data and partial information, and scenarios requiring flexible model behavior to incorporate non-data sample information. We introduce Bayesian Entropy Neural Networks (BENN), a framework grounded in Maximum Entropy (MaxEnt) principles, designed to impose constraints on Bayesian Neural Network (BNN) predictions. BENN is capable of constraining not only the predicted values but also their derivatives and variances, ensuring a more robust and reliable model output. To achieve simultaneous uncertainty quantification and constraint satisfaction, we employ the method of multipliers approach. This allows for the concurrent estimation of neural network parameters and the Lagrangian multipliers associated with the constraints. Our experiments, spanning diverse applications such as beam deflection modeling and microstructure generation, demonstrate the effectiveness of BENN. The results highlight significant improvements over traditional BNNs and showcase competitive performance relative to contemporary constrained deep learning methods.
- [23] arXiv:2407.01036 [pdf, html, other]
-
Title: Ranking by Lifts: A Cost-Benefit Approach to Large-Scale A/B TestsSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
A/B testers conducting large-scale tests prioritize lifts and want to be able to control false rejections of the null. This work develops a decision-theoretic framework for maximizing profits subject to false discovery rate (FDR) control. We build an empirical Bayes solution for the problem via the greedy knapsack approach. We derive an oracle rule based on ranking the ratio of expected lifts and the cost of wrong rejections using the local false discovery rate (lfdr) statistic. Our oracle decision rule is valid and optimal for large-scale tests. Further, we establish asymptotic validity for the data-driven procedure and demonstrate finite-sample validity in experimental studies. We also demonstrate the merit of the proposed method over other FDR control methods. Finally, we discuss an application to actual Optimizely experiments.
- [24] arXiv:2407.01040 [pdf, html, other]
-
Title: QBIC of SEM for diffusion processes from discrete observationsComments: 26pages, 4figures. arXiv admin note: text overlap with arXiv:2402.08959Subjects: Statistics Theory (math.ST)
We deal with a model selection problem for structural equation modeling (SEM) with latent variables for diffusion processes. Based on the asymptotic expansion of the marginal quasi-log likelihood, we propose two types of quasi-Bayesian information criteria of the SEM. It is shown that the information criteria have model selection consistency. Furthermore, we examine the finite-sample performance of the proposed information criteria by numerical experiments.
- [25] arXiv:2407.01055 [pdf, html, other]
-
Title: Exact statistical analysis for response-adaptive clinical trials: a general and computationally tractable approachComments: 35 pages, 6 figures, 11 tablesSubjects: Methodology (stat.ME)
Response-adaptive (RA) designs of clinical trials allow targeting a given objective by skewing the allocation of participants to treatments based on observed outcomes. RA designs face greater regulatory scrutiny due to potential type I error inflation, which limits their uptake in practice. Existing approaches to type I error control either only work for specific designs, have a risk of Monte Carlo/approximation error, are conservative, or computationally intractable. We develop a general and computationally tractable approach for exact analysis in two-arm RA designs with binary outcomes. We use the approach to construct exact tests applicable to designs that use either randomized or deterministic RA procedures, allowing for complexities such as delayed outcomes, early stop** or allocation of participants in blocks. Our efficient forward recursion implementation allows for testing of two-arm trials with 1,000 participants on a standard computer. Through an illustrative computational study of trials using randomized dynamic programming we show that, contrary to what is known for equal allocation, a conditional exact test has, almost uniformly, higher power than the unconditional test. Two real-world trials with the above-mentioned complexities are re-analyzed to demonstrate the value of our approach in controlling type I error and/or improving the statistical power.
- [26] arXiv:2407.01079 [pdf, html, other]
-
Title: On Statistical Rates and Provably Efficient Criteria of Latent Diffusion Transformers (DiTs)Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
We investigate the statistical and computational limits of latent \textbf{Di}ffusion \textbf{T}ransformers (\textbf{DiT}s) under the low-dimensional linear latent space assumption. Statistically, we study the universal approximation and sample complexity of the DiTs score function, as well as the distribution recovery property of the initial data. Specifically, under mild data assumptions, we derive an approximation error bound for the score network of latent DiTs, which is sub-linear in the latent space dimension. Additionally, we derive the corresponding sample complexity bound and show that the data distribution generated from the estimated score function converges toward a proximate area of the original one. Computationally, we characterize the hardness of both forward inference and backward computation of latent DiTs, assuming the Strong Exponential Time Hypothesis (SETH). For forward inference, we identify efficient criteria for all possible latent DiTs inference algorithms and showcase our theory by pushing the efficiency toward almost-linear time inference. For backward computation, we leverage the low-rank structure within the gradient computation of DiTs training for possible algorithmic speedup. Specifically, we show that such speedup achieves almost-linear time latent DiTs training by casting the DiTs gradient as a series of chained low-rank approximations with bounded error. Under the low-dimensional assumption, we show that the convergence rate and the computational efficiency are both dominated by the dimension of the subspace, suggesting that latent DiTs have the potential to bypass the challenges associated with the high dimensionality of initial data.
- [27] arXiv:2407.01172 [pdf, html, other]
-
Title: Enlarging of the sample to address multicollinearityComments: 11 pages, 2 tables, working paperSubjects: Applications (stat.AP)
The paper analyzes how the enlarging of the sample affects to the mitigation of collinearity concluding that it may mitigate the consequences of collinearity related to statistical analysis but not necessarily the numerical instability. The problem that is addressed is of importance in the teaching of social sciences since it discusses one of the solutions proposed almost unanimously to solve the problem of multicollinearity. For a better understanding and illustration of the contribution of this paper, two empirical examples are presented and not highly technical developments are used.
- [28] arXiv:2407.01186 [pdf, html, other]
-
Title: Data fusion for efficiency gain in ATE estimation: A practical review with simulationsSubjects: Methodology (stat.ME)
The integration of real-world data (RWD) and randomized controlled trials (RCT) is increasingly important for advancing causal inference in scientific research. This combination holds great promise for enhancing the efficiency of causal effect estimation, offering benefits such as reduced trial participant numbers and expedited drug access for patients. Despite the availability of numerous data fusion methods, selecting the most appropriate one for a specific research question remains challenging. This paper systematically reviews and compares these methods regarding their assumptions, limitations, and implementation complexities. Through simulations reflecting real-world scenarios, we identify a prevalent risk-reward trade-off across different methods. We investigate and interpret this trade-off, providing key insights into the strengths and weaknesses of various methods; thereby hel** researchers navigate through the application of data fusion for improved causal inference.
- [29] arXiv:2407.01483 [pdf, html, other]
-
Title: A General Purpose Approximation to the Ferguson-Klass Algorithm for Sampling from L\'evy Processes Without Gaussian ComponentsSubjects: Computation (stat.CO); Applications (stat.AP)
We propose a general-purpose approximation to the Ferguson-Klass algorithm for generating samples from Lévy processes without Gaussian components. We show that the proposed method is more than 1000 times faster than the standard Ferguson-Klass algorithm without a significant loss of precision. This method can open an avenue for computationally efficient and scalable Bayesian nonparametric models which go beyond conjugacy assumptions, as demonstrated in the examples section.
- [30] arXiv:2407.01495 [pdf, html, other]
-
Title: Multifidelity Cross-validationComments: arXiv admin note: text overlap with arXiv:2203.01436Subjects: Computation (stat.CO); Machine Learning (stat.ML)
Emulating the map** between quantities of interest and their control parameters using surrogate models finds widespread application in engineering design, including in numerical optimization and uncertainty quantification. Gaussian process models can serve as a probabilistic surrogate model of unknown functions, thereby making them highly suitable for engineering design and decision-making in the presence of uncertainty. In this work, we are interested in emulating quantities of interest observed from models of a system at multiple fidelities, which trade accuracy for computational efficiency. Using multifidelity Gaussian process models, to efficiently fuse models at multiple fidelities, we propose a novel method to actively learn the surrogate model via leave-one-out cross-validation (LOO-CV). Our proposed multifidelity cross-validation (\texttt{MFCV}) approach develops an adaptive approach to reduce the LOO-CV error at the target (highest) fidelity, by learning the correlations between the LOO-CV at all fidelities. \texttt{MFCV} develops a two-step lookahead policy to select optimal input-fidelity pairs, both in sequence and in batches, both for continuous and discrete fidelity spaces. We demonstrate the utility of our method on several synthetic test problems as well as on the thermal stress analysis of a gas turbine blade.
New submissions for Tuesday, 2 July 2024 (showing 30 of 30 entries )
- [31] arXiv:2407.00028 (cross-list from q-bio.NC) [pdf, html, other]
-
Title: Harnessing XGBoost for Robust Biomarker Selection of Obsessive-Compulsive Disorder (OCD) from Adolescent Brain Cognitive Development (ABCD) dataSubjects: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Applications (stat.AP)
This study evaluates the performance of various supervised machine learning models in analyzing highly correlated neural signaling data from the Adolescent Brain Cognitive Development (ABCD) Study, with a focus on predicting obsessive-compulsive disorder scales. We simulated a dataset to mimic the correlation structures commonly found in imaging data and evaluated logistic regression, elastic networks, random forests, and XGBoost on their ability to handle multicollinearity and accurately identify predictive features. Our study aims to guide the selection of appropriate machine learning methods for processing neuroimaging data, highlighting models that best capture underlying signals in high feature correlations and prioritize clinically relevant features associated with Obsessive-Compulsive Disorder (OCD).
- [32] arXiv:2407.00099 (cross-list from q-bio.NC) [pdf, html, other]
-
Title: Optimal Transport for Latent Integration with An Application to Heterogeneous Neuronal Activity DataSubjects: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Applications (stat.AP)
Detecting dynamic patterns of task-specific responses shared across heterogeneous datasets is an essential and challenging problem in many scientific applications in medical science and neuroscience. In our motivating example of rodent electrophysiological data, identifying the dynamical patterns in neuronal activity associated with ongoing cognitive demands and behavior is key to uncovering the neural mechanisms of memory. One of the greatest challenges in investigating a cross-subject biological process is that the systematic heterogeneity across individuals could significantly undermine the power of existing machine learning methods to identify the underlying biological dynamics. In addition, many technically challenging neurobiological experiments are conducted on only a handful of subjects where rich longitudinal data are available for each subject. The low sample sizes of such experiments could further reduce the power to detect common dynamic patterns among subjects. In this paper, we propose a novel heterogeneous data integration framework based on optimal transport to extract shared patterns in complex biological processes. The key advantages of the proposed method are that it can increase discriminating power in identifying common patterns by reducing heterogeneity unrelated to the signal by aligning the extracted latent spatiotemporal information across subjects. Our approach is effective even with a small number of subjects, and does not require auxiliary matching information for the alignment. In particular, our method can align longitudinal data across heterogeneous subjects in a common latent space to capture the dynamics of shared patterns while utilizing temporal dependency within subjects.
- [33] arXiv:2407.00143 (cross-list from cs.LG) [pdf, other]
-
Title: InfoNCE: Identifying the Gap Between Theory and PracticeEvgenia Rusak, Patrik Reizinger, Attila Juhos, Oliver Bringmann, Roland S. Zimmermann, Wieland BrendelSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Previous theoretical work on contrastive learning (CL) with InfoNCE showed that, under certain assumptions, the learned representations uncover the ground-truth latent factors. We argue these theories overlook crucial aspects of how CL is deployed in practice. Specifically, they assume that within a positive pair, all latent factors either vary to a similar extent, or that some do not vary at all. However, in practice, positive pairs are often generated using augmentations such as strong crop** to just a few pixels. Hence, a more realistic assumption is that all latent factors change, with a continuum of variability across these factors. We introduce AnInfoNCE, a generalization of InfoNCE that can provably uncover the latent factors in this anisotropic setting, broadly generalizing previous identifiability results in CL. We validate our identifiability results in controlled experiments and show that AnInfoNCE increases the recovery of previously collapsed information in CIFAR10 and ImageNet, albeit at the cost of downstream accuracy. Additionally, we explore and discuss further mismatches between theoretical assumptions and practical implementations, including extensions to hard negative mining and loss ensembles.
- [34] arXiv:2407.00175 (cross-list from q-bio.QM) [pdf, other]
-
Title: Permutation invariant multi-output Gaussian Processes for drug combination prediction in cancerSubjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
Dose-response prediction in cancer is an active application field in machine learning. Using large libraries of \textit{in-vitro} drug sensitivity screens, the goal is to develop accurate predictive models that can be used to guide experimental design or inform treatment decisions. Building on previous work that makes use of permutation invariant multi-output Gaussian Processes in the context of dose-response prediction for drug combinations, we develop a variational approximation to these models. The variational approximation enables a more scalable model that provides uncertainty quantification and naturally handles missing data. Furthermore, we propose using a deep generative model to encode the chemical space in a continuous manner, enabling prediction for new drugs and new combinations. We demonstrate the performance of our model in a simple setting using a high-throughput dataset and show that the model is able to efficiently borrow information across outputs.
- [35] arXiv:2407.00224 (cross-list from cs.CV) [pdf, html, other]
-
Title: Multimodal Prototy** for cancer survival predictionAndrew H. Song, Richard J. Chen, Guillaume Jaume, Anurag J. Vaidya, Alexander S. Baras, Faisal MahmoodComments: ICML 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
Multimodal survival methods combining gigapixel histology whole-slide images (WSIs) and transcriptomic profiles are particularly promising for patient prognostication and stratification. Current approaches involve tokenizing the WSIs into smaller patches (>10,000 patches) and transcriptomics into gene groups, which are then integrated using a Transformer for predicting outcomes. However, this process generates many tokens, which leads to high memory requirements for computing attention and complicates post-hoc interpretability analyses. Instead, we hypothesize that we can: (1) effectively summarize the morphological content of a WSI by condensing its constituting tokens using morphological prototypes, achieving more than 300x compression; and (2) accurately characterize cellular functions by encoding the transcriptomic profile with biological pathway prototypes, all in an unsupervised fashion. The resulting multimodal tokens are then processed by a fusion network, either with a Transformer or an optimal transport cross-alignment, which now operates with a small and fixed number of tokens without approximations. Extensive evaluation on six cancer types shows that our framework outperforms state-of-the-art methods with much less computation while unlocking new interpretability analyses.
- [36] arXiv:2407.00256 (cross-list from cs.AI) [pdf, html, other]
-
Title: One Prompt is not Enough: Automated Construction of a Mixture-of-Expert PromptsComments: ICML 2024. code available at this https URLJournal-ref: Proceedings of the 41st International Conference on Machine Learning (ICML), Vienna, Austria, 2024Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Large Language Models (LLMs) exhibit strong generalization capabilities to novel tasks when prompted with language instructions and in-context demos. Since this ability sensitively depends on the quality of prompts, various methods have been explored to automate the instruction design. While these methods demonstrated promising results, they also restricted the searched prompt to one instruction. Such simplification significantly limits their capacity, as a single demo-free instruction might not be able to cover the entire complex problem space of the targeted task. To alleviate this issue, we adopt the Mixture-of-Expert paradigm and divide the problem space into a set of sub-regions; Each sub-region is governed by a specialized expert, equipped with both an instruction and a set of demos. A two-phase process is developed to construct the specialized expert for each region: (1) demo assignment: Inspired by the theoretical connection between in-context learning and kernel regression, we group demos into experts based on their semantic similarity; (2) instruction assignment: A region-based joint search of an instruction per expert complements the demos assigned to it, yielding a synergistic effect. The resulting method, codenamed Mixture-of-Prompts (MoP), achieves an average win rate of 81% against prior arts across several major benchmarks.
- [37] arXiv:2407.00258 (cross-list from physics.soc-ph) [pdf, html, other]
-
Title: Graph Simplification Solutions to the Street Intersection Miscount ProblemSubjects: Physics and Society (physics.soc-ph); Discrete Mathematics (cs.DM); Systems and Control (eess.SY); Computation (stat.CO)
Street intersection counts and densities are ubiquitous measures in transport geography and planning. However, typical street network data and typical street network analysis tools can substantially overcount them. This paper explains why this happens and introduces solutions to this problem. It presents the OSMnx package's algorithms to automatically simplify graph models of urban street networks -- via edge simplification and node consolidation -- resulting in faster, parsimonious models and more accurate network measures like intersection counts/densities, street segment lengths, and node degrees. Then it validates these algorithms and conducts a worldwide empirical assessment of count bias to quantify the motivating problem's prevalence. A full accounting of this bias and better methods to attenuate misrepresentations of intersections are necessary for data-driven, evidence-informed transport planning.
- [38] arXiv:2407.00271 (cross-list from math.DS) [pdf, html, other]
-
Title: Minimum Reduced-Order Models via Causal InferenceSubjects: Dynamical Systems (math.DS); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
Enhancing the sparsity of data-driven reduced-order models (ROMs) has gained increasing attention in recent years. In this work, we analyze an efficient approach to identifying skillful ROMs with a sparse structure using an information-theoretic indicator called causation entropy. The causation entropy quantifies in a statistical way the additional contribution of each term to the underlying dynamics beyond the information already captured by all the other terms in the ansatz. By doing so, the causation entropy assesses the importance of each term to the dynamics before a parameter estimation procedure is performed. Thus, the approach can be utilized to eliminate terms with little dynamic impact, leading to a parsimonious structure that retains the essential physics. To circumvent the difficulty of estimating high-dimensional probability density functions (PDFs) involved in the causation entropy computation, we leverage Gaussian approximations for such PDFs, which are demonstrated to be sufficient even in the presence of highly non-Gaussian dynamics. The effectiveness of the approach is illustrated by the Kuramoto-Sivashinsky equation by building sparse causation-based ROMs for various purposes, such as recovering long-term statistics and inferring unobserved dynamics via data assimilation with partial observations.
- [39] arXiv:2407.00307 (cross-list from math.OC) [pdf, html, other]
-
Title: Deterministic and Stochastic Frank-Wolfe Recursion on Probability SpacesSubjects: Optimization and Control (math.OC); Computation (stat.CO)
Motivated by applications in emergency response and experimental design, we consider smooth stochastic optimization problems over probability measures supported on compact subsets of the Euclidean space. With the influence function as the variational object, we construct a deterministic Frank-Wolfe (dFW) recursion for probability spaces, made especially possible by a lemma that identifies a ``closed-form'' solution to the infinite-dimensional Frank-Wolfe sub-problem. Each iterate in dFW is expressed as a convex combination of the incumbent iterate and a Dirac measure concentrating on the minimum of the influence function at the incumbent iterate. To address common application contexts that have access only to Monte Carlo observations of the objective and influence function, we construct a stochastic Frank-Wolfe (sFW) variation that generates a random sequence of probability measures constructed using minima of increasingly accurate estimates of the influence function. We demonstrate that sFW's optimality gap sequence exhibits $O(k^{-1})$ iteration complexity almost surely and in expectation for smooth convex objectives, and $O(k^{-1/2})$ (in Frank-Wolfe gap) for smooth non-convex objectives. Furthermore, we show that an easy-to-implement fixed-step, fixed-sample version of (sFW) exhibits exponential convergence to $\varepsilon$-optimality. We end with a central limit theorem on the observed objective values at the sequence of generated random measures. To further intuition, we include several illustrative examples with exact influence function calculations.
- [40] arXiv:2407.00317 (cross-list from cs.IR) [pdf, html, other]
-
Title: Towards Statistically Significant Taxonomy Aware Co-location Pattern DetectionComments: Accepted in The 16th Conference on Spatial Information Theory (COSIT) 2024Subjects: Information Retrieval (cs.IR); Applications (stat.AP)
Given a collection of Boolean spatial feature types, their instances, a neighborhood relation (e.g., proximity), and a hierarchical taxonomy of the feature types, the goal is to find the subsets of feature types or their parents whose spatial interaction is statistically significant. This problem is for taxonomy-reliant applications such as ecology (e.g., finding new symbiotic relationships across the food chain), spatial pathology (e.g., immunotherapy for cancer), retail, etc. The problem is computationally challenging due to the exponential number of candidate co-location patterns generated by the taxonomy. Most approaches for co-location pattern detection overlook the hierarchical relationships among spatial features, and the statistical significance of the detected patterns is not always considered, leading to potential false discoveries. This paper introduces two methods for incorporating taxonomies and assessing the statistical significance of co-location patterns. The baseline approach iteratively checks the significance of co-locations between leaf nodes or their ancestors in the taxonomy. Using the Benjamini-Hochberg procedure, an advanced approach is proposed to control the false discovery rate. This approach effectively reduces the risk of false discoveries while maintaining the power to detect true co-location patterns. Experimental evaluation and case study results show the effectiveness of the approach.
- [41] arXiv:2407.00397 (cross-list from cs.LG) [pdf, html, other]
-
Title: Markovian Gaussian Process: A Universal State-Space Representation for Stationary Temporal Gaussian ProcessSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Gaussian Processes (GPs) and Linear Dynamical Systems (LDSs) are essential time series and dynamic system modeling tools. GPs can handle complex, nonlinear dynamics but are computationally demanding, while LDSs offer efficient computation but lack the expressive power of GPs. To combine their benefits, we introduce a universal method that allows an LDS to mirror stationary temporal GPs. This state-space representation, known as the Markovian Gaussian Process (Markovian GP), leverages the flexibility of kernel functions while maintaining efficient linear computation. Unlike existing GP-LDS conversion methods, which require separability for most multi-output kernels, our approach works universally for single- and multi-output stationary temporal kernels. We evaluate our method by computing covariance, performing regression tasks, and applying it to a neuroscience application, demonstrating that our method provides an accurate state-space representation for stationary temporal GPs.
- [42] arXiv:2407.00417 (cross-list from cs.CR) [pdf, html, other]
-
Title: Obtaining $(\epsilon,\delta)$-differential privacy guarantees when using a Poisson mechanism to synthesize contingency tablesSubjects: Cryptography and Security (cs.CR); Methodology (stat.ME)
We show that differential privacy type guarantees can be obtained when using a Poisson synthesis mechanism to protect counts in contingency tables. Specifically, we show how to obtain $(\epsilon, \delta)$-probabilistic differential privacy guarantees via the Poisson distribution's cumulative distribution function. We demonstrate this empirically with the synthesis of an administrative-type confidential database.
- [43] arXiv:2407.00471 (cross-list from math.NA) [pdf, html, other]
-
Title: A note on the relationship between PDE-based precision operators and Mat\'ern covariancesSubjects: Numerical Analysis (math.NA); Probability (math.PR); Computation (stat.CO)
The purpose of this technical note is to summarize the relationship between the marginal variance and correlation length of a Gaussian random field with Matérn covariance and the coefficients of the corresponding partial-differential-equation (PDE)-based precision operator.
- [44] arXiv:2407.00490 (cross-list from cs.LG) [pdf, html, other]
-
Title: Toward Global Convergence of Gradient EM for Over-Parameterized Gaussian Mixture ModelsComments: 25 pagesSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
We study the gradient Expectation-Maximization (EM) algorithm for Gaussian Mixture Models (GMM) in the over-parameterized setting, where a general GMM with $n>1$ components learns from data that are generated by a single ground truth Gaussian distribution. While results for the special case of 2-Gaussian mixtures are well-known, a general global convergence analysis for arbitrary $n$ remains unresolved and faces several new technical barriers since the convergence becomes sub-linear and non-monotonic. To address these challenges, we construct a novel likelihood-based convergence analysis framework and rigorously prove that gradient EM converges globally with a sublinear rate $O(1/\sqrt{t})$. This is the first global convergence result for Gaussian mixtures with more than $2$ components. The sublinear convergence rate is due to the algorithmic nature of learning over-parameterized GMM with gradient EM. We also identify a new emerging technical challenge for learning general over-parameterized GMM: the existence of bad local regions that can trap gradient EM for an exponential number of steps.
- [45] arXiv:2407.00492 (cross-list from cs.LG) [pdf, html, other]
-
Title: Fast Gibbs sampling for the local and global trend Bayesian exponential smoothing modelSubjects: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
In Smyl et al. [Local and global trend Bayesian exponential smoothing models. International Journal of Forecasting, 2024.], a generalised exponential smoothing model was proposed that is able to capture strong trends and volatility in time series. This method achieved state-of-the-art performance in many forecasting tasks, but its fitting procedure, which is based on the NUTS sampler, is very computationally expensive. In this work, we propose several modifications to the original model, as well as a bespoke Gibbs sampler for posterior exploration; these changes improve sampling time by an order of magnitude, thus rendering the model much more practically relevant. The new model, and sampler, are evaluated on the M3 dataset and are shown to be competitive, or superior, in terms of accuracy to the original method, while being substantially faster to run.
- [46] arXiv:2407.00529 (cross-list from cs.LG) [pdf, html, other]
-
Title: Detecting and Identifying Selection Structure in Sequential DataComments: ICML 2024Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Statistics Theory (math.ST); Machine Learning (stat.ML)
We argue that the selective inclusion of data points based on latent objectives is common in practical situations, such as music sequences. Since this selection process often distorts statistical analysis, previous work primarily views it as a bias to be corrected and proposes various methods to mitigate its effect. However, while controlling this bias is crucial, selection also offers an opportunity to provide a deeper insight into the hidden generation process, as it is a fundamental mechanism underlying what we observe. In particular, overlooking selection in sequential data can lead to an incomplete or overcomplicated inductive bias in modeling, such as assuming a universal autoregressive structure for all dependencies. Therefore, rather than merely viewing it as a bias, we explore the causal structure of selection in sequential data to delve deeper into the complete causal process. Specifically, we show that selection structure is identifiable without any parametric assumptions or interventional experiments. Moreover, even in cases where selection variables coexist with latent confounders, we still establish the nonparametric identifiability under appropriate structural conditions. Meanwhile, we also propose a provably correct algorithm to detect and identify selection structures as well as other types of dependencies. The framework has been validated empirically on both synthetic data and real-world music.
- [47] arXiv:2407.00584 (cross-list from cs.LG) [pdf, html, other]
-
Title: Hyperparameter Optimization for Randomized Algorithms: A Case Study for Random FeaturesSubjects: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
Randomized algorithms exploit stochasticity to reduce computational complexity. One important example is random feature regression (RFR) that accelerates Gaussian process regression (GPR). RFR approximates an unknown function with a random neural network whose hidden weights and biases are sampled from a probability distribution. Only the final output layer is fit to data. In randomized algorithms like RFR, the hyperparameters that characterize the sampling distribution greatly impact performance, yet are not directly accessible from samples. This makes optimization of hyperparameters via standard (gradient-based) optimization tools inapplicable. Inspired by Bayesian ideas from GPR, this paper introduces a random objective function that is tailored for hyperparameter tuning of vector-valued random features. The objective is minimized with ensemble Kalman inversion (EKI). EKI is a gradient-free particle-based optimizer that is scalable to high-dimensions and robust to randomness in objective functions. A numerical study showcases the new black-box methodology to learn hyperparameter distributions in several problems that are sensitive to the hyperparameter selection: two global sensitivity analyses, integrating a chaotic dynamical system, and solving a Bayesian inverse problem from atmospheric dynamics. The success of the proposed EKI-based algorithm for RFR suggests its potential for automated optimization of hyperparameters arising in other randomized algorithms.
- [48] arXiv:2407.00706 (cross-list from cs.LG) [pdf, html, other]
-
Title: Sum-of-norms regularized Nonnegative Matrix FactorizationComments: 22 pages, 12 figuresSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
When applying nonnegative matrix factorization (NMF), generally the rank parameter is unknown. Such rank in NMF, called the nonnegative rank, is usually estimated heuristically since computing the exact value of it is NP-hard. In this work, we propose an approximation method to estimate such rank while solving NMF on-the-fly. We use sum-of-norm (SON), a group-lasso structure that encourages pairwise similarity, to reduce the rank of a factor matrix where the rank is overestimated at the beginning. On various datasets, SON-NMF is able to reveal the correct nonnegative rank of the data without any prior knowledge nor tuning.
SON-NMF is a nonconvx nonsmmoth non-separable non-proximable problem, solving it is nontrivial. First, as rank estimation in NMF is NP-hard, the proposed approach does not enjoy a lower computational complexity. Using a graph-theoretic argument, we prove that the complexity of the SON-NMF is almost irreducible. Second, the per-iteration cost of any algorithm solving SON-NMF is possibly high, which motivated us to propose a first-order BCD algorithm to approximately solve SON-NMF with a low per-iteration cost, in which we do so by the proximal average operator. Lastly, we propose a simple greedy method for post-processing.
SON-NMF exhibits favourable features for applications. Beside the ability to automatically estimate the rank from data, SON-NMF can deal with rank-deficient data matrix, can detect weak component with small energy. Furthermore, on the application of hyperspectral imaging, SON-NMF handle the issue of spectral variability naturally. - [49] arXiv:2407.00710 (cross-list from cs.LG) [pdf, html, other]
-
Title: Weighted Missing Linear Discriminant Analysis: An Explainable Approach for Classification with Missing DataSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
As Artificial Intelligence (AI) models are gradually being adopted in real-life applications, the explainability of the model used is critical, especially in high-stakes areas such as medicine, finance, etc. Among the commonly used models, Linear Discriminant Analysis (LDA) is a widely used classification tool that is also explainable thanks to its ability to model class distributions and maximize class separation through linear feature combinations. Nevertheless, real-world data is frequently incomplete, presenting significant challenges for classification tasks and model explanations. In this paper, we propose a novel approach to LDA under missing data, termed \textbf{\textit{Weighted missing Linear Discriminant Analysis (WLDA)}}, to directly classify observations in data that contains missing values without imputation effectively by estimating the parameters directly on missing data and use a weight matrix for missing values to penalize missing entries during classification. Furthermore, we also analyze the theoretical properties and examine the explainability of the proposed technique in a comprehensive manner. Experimental results demonstrate that WLDA outperforms conventional methods by a significant margin, particularly in scenarios where missing values are present in both training and test sets.
- [50] arXiv:2407.00745 (cross-list from cs.LG) [pdf, other]
-
Title: Posterior Sampling with Denoising Oracles via Tilted TransportSubjects: Machine Learning (cs.LG); Probability (math.PR); Computation (stat.CO); Machine Learning (stat.ML)
Score-based diffusion models have significantly advanced high-dimensional data generation across various domains, by learning a denoising oracle (or score) from datasets. From a Bayesian perspective, they offer a realistic modeling of data priors and facilitate solving inverse problems through posterior sampling. Although many heuristic methods have been developed recently for this purpose, they lack the quantitative guarantees needed in many scientific applications.
In this work, we introduce the \textit{tilted transport} technique, which leverages the quadratic structure of the log-likelihood in linear inverse problems in combination with the prior denoising oracle to transform the original posterior sampling problem into a new `boosted' posterior that is provably easier to sample from. We quantify the conditions under which this boosted posterior is strongly log-concave, highlighting the dependencies on the condition number of the measurement matrix and the signal-to-noise ratio. The resulting posterior sampling scheme is shown to reach the computational threshold predicted for sampling Ising models [Kunisky'23] with a direct analysis, and is further validated on high-dimensional Gaussian mixture models and scalar field $\varphi^4$ models. - [51] arXiv:2407.00746 (cross-list from math.NA) [pdf, html, other]
-
Title: Structured Sketching for Linear SystemsSubjects: Numerical Analysis (math.NA); Mathematical Software (cs.MS); Computation (stat.CO)
For linear systems $Ax=b$ we develop iterative algorithms based on a sketch-and-project approach. By using judicious choices for the sketch, such as the history of residuals, we develop weighting strategies that enable short recursive formulas. The proposed algorithms have a low memory footprint and iteration complexity compared to regular sketch-and-project methods. In a set of numerical experiments the new methods compare well to GMRES, SYMMLQ and state-of-the-art randomized solvers.
- [52] arXiv:2407.00765 (cross-list from cs.LG) [pdf, html, other]
-
Title: Structured and Balanced Multi-component and Multi-layer Neural NetworksComments: Our codes and implementation details are available at this https URLSubjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA); Machine Learning (stat.ML)
In this work, we propose a balanced multi-component and multi-layer neural network (MMNN) structure to approximate functions with complex features with both accuracy and efficiency in terms of degrees of freedom and computation cost. The main idea is motivated by a multi-component, each of which can be approximated effectively by a single-layer network, and multi-layer decomposition in a "divide-and-conquer" type of strategy to deal with a complex function. While an easy modification to fully connected neural networks (FCNNs) or multi-layer perceptrons (MLPs) through the introduction of balanced multi-component structures in the network, MMNNs achieve a significant reduction of training parameters, a much more efficient training process, and a much improved accuracy compared to FCNNs or MLPs. Extensive numerical experiments are presented to illustrate the effectiveness of MMNNs in approximating high oscillatory functions and its automatic adaptivity in capturing localized features.
- [53] arXiv:2407.00927 (cross-list from cs.LG) [pdf, html, other]
-
Title: Learnability of Parameter-Bounded Bayes NetsComments: 15 pages, 2 figuresSubjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Machine Learning (stat.ML)
Bayes nets are extensively used in practice to efficiently represent joint probability distributions over a set of random variables and capture dependency relations. In a seminal paper, Chickering et al. (JMLR 2004) showed that given a distribution $P$, that is defined as the marginal distribution of a Bayes net, it is $\mathsf{NP}$-hard to decide whether there is a parameter-bounded Bayes net that represents $P$. They called this problem LEARN. In this work, we extend the $\mathsf{NP}$-hardness result of LEARN and prove the $\mathsf{NP}$-hardness of a promise search variant of LEARN, whereby the Bayes net in question is guaranteed to exist and one is asked to find such a Bayes net. We complement our hardness result with a positive result about the sample complexity that is sufficient to recover a parameter-bounded Bayes net that is close (in TV distance) to a given distribution $P$, that is represented by some parameter-bounded Bayes net, generalizing a degree-bounded sample complexity result of Brustle et al. (EC 2020).
- [54] arXiv:2407.00950 (cross-list from cs.LG) [pdf, html, other]
-
Title: Causal Bandits: The Pareto Optimal Frontier of Adaptivity, a Reduction to Linear Bandits, and Limitations around Unknown MarginalsComments: Accepted to ICML 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In this work, we investigate the problem of adapting to the presence or absence of causal structure in multi-armed bandit problems. In addition to the usual reward signal, we assume the learner has access to additional variables, observed in each round after acting. When these variables $d$-separate the action from the reward, existing work in causal bandits demonstrates that one can achieve strictly better (minimax) rates of regret (Lu et al., 2020). Our goal is to adapt to this favorable "conditionally benign" structure, if it is present in the environment, while simultaneously recovering worst-case minimax regret, if it is not. Notably, the learner has no prior knowledge of whether the favorable structure holds. In this paper, we establish the Pareto optimal frontier of adaptive rates. We prove upper and matching lower bounds on the possible trade-offs in the performance of learning in conditionally benign and arbitrary environments, resolving an open question raised by Bilodeau et al. (2022). Furthermore, we are the first to obtain instance-dependent bounds for causal bandits, by reducing the problem to the linear bandit setting. Finally, we examine the common assumption that the marginal distributions of the post-action contexts are known and show that a nontrivial estimate is necessary for better-than-worst-case minimax rates.
- [55] arXiv:2407.00957 (cross-list from cs.NE) [pdf, html, other]
-
Title: Expressivity of Neural Networks with Random Weights and Learned BiasesEzekiel Williams, Avery Hee-Woon Ryoo, Thomas Jiralerspong, Alexandre Payeur, Matthew G. Perich, Luca Mazzucatto, Guillaume LajoieSubjects: Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
Landmark universal function approximation results for neural networks with trained weights and biases provided impetus for the ubiquitous use of neural networks as learning models in Artificial Intelligence (AI) and neuroscience. Recent work has pushed the bounds of universal approximation by showing that arbitrary functions can similarly be learned by tuning smaller subsets of parameters, for example the output weights, within randomly initialized networks. Motivated by the fact that biases can be interpreted as biologically plausible mechanisms for adjusting unit outputs in neural networks, such as tonic inputs or activation thresholds, we investigate the expressivity of neural networks with random weights where only biases are optimized. We provide theoretical and numerical evidence demonstrating that feedforward neural networks with fixed random weights can be trained to perform multiple tasks by learning biases only. We further show that an equivalent result holds for recurrent neural networks predicting dynamical system trajectories. Our results are relevant to neuroscience, where they demonstrate the potential for behaviourally relevant changes in dynamics without modifying synaptic weights, as well as for AI, where they shed light on multi-task methods such as bias fine-tuning and unit masking.
- [56] arXiv:2407.01111 (cross-list from cs.LG) [pdf, html, other]
-
Title: Proximity Matters: Local Proximity Preserved Balancing for Treatment Effect EstimationComments: Code is available at https://anonymous.4open.science/status/ncr-B697Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Heterogeneous treatment effect (HTE) estimation from observational data poses significant challenges due to treatment selection bias. Existing methods address this bias by minimizing distribution discrepancies between treatment groups in latent space, focusing on global alignment. However, the fruitful aspect of local proximity, where similar units exhibit similar outcomes, is often overlooked. In this study, we propose Proximity-aware Counterfactual Regression (PCR) to exploit proximity for representation balancing within the HTE estimation context. Specifically, we introduce a local proximity preservation regularizer based on optimal transport to depict the local proximity in discrepancy calculation. Furthermore, to overcome the curse of dimensionality that renders the estimation of discrepancy ineffective, exacerbated by limited data availability for HTE estimation, we develop an informative subspace projector, which trades off minimal distance precision for improved sample complexity. Extensive experiments demonstrate that PCR accurately matches units across different treatment groups, effectively mitigates treatment selection bias, and significantly outperforms competitors. Code is available at https://anonymous.4open.science/status/ncr-B697.
- [57] arXiv:2407.01115 (cross-list from cs.LG) [pdf, html, other]
-
Title: Enabling Mixed Effects Neural Networks for Diverse, Clustered Data Using Monte Carlo MethodsAndrej Tschalzev, Paul Nitschke, Lukas Kirchdorfer, Stefan Lüdtke, Christian Bartelt, Heiner StuckenschmidtSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Neural networks often assume independence among input data samples, disregarding correlations arising from inherent clustering patterns in real-world datasets (e.g., due to different sites or repeated measurements). Recently, mixed effects neural networks (MENNs) which separate cluster-specific 'random effects' from cluster-invariant 'fixed effects' have been proposed to improve generalization and interpretability for clustered data. However, existing methods only allow for approximate quantification of cluster effects and are limited to regression and binary targets with only one clustering feature. We present MC-GMENN, a novel approach employing Monte Carlo methods to train Generalized Mixed Effects Neural Networks. We empirically demonstrate that MC-GMENN outperforms existing mixed effects deep learning models in terms of generalization performance, time complexity, and quantification of inter-cluster variance. Additionally, MC-GMENN is applicable to a wide range of datasets, including multi-class classification tasks with multiple high-cardinality categorical features. For these datasets, we show that MC-GMENN outperforms conventional encoding and embedding methods, simultaneously offering a principled methodology for interpreting the effects of clustering patterns.
- [58] arXiv:2407.01171 (cross-list from cs.LG) [pdf, html, other]
-
Title: Neural Conditional Probability for InferenceVladimir R. Kostic, Karim Lounici, Gregoire Pacreau, Pietro Novelli, Giacomo Turri, Massimiliano PontilSubjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
We introduce NCP (Neural Conditional Probability), a novel operator-theoretic approach for learning conditional distributions with a particular focus on inference tasks. NCP can be used to build conditional confidence regions and extract important statistics like conditional quantiles, mean, and covariance. It offers streamlined learning through a single unconditional training phase, facilitating efficient inference without the need for retraining even when conditioning changes. By tap** into the powerful approximation capabilities of neural networks, our method efficiently handles a wide variety of complex probability distributions, effectively dealing with nonlinear relationships between input and output variables. Theoretical guarantees ensure both optimization consistency and statistical accuracy of the NCP method. Our experiments show that our approach matches or beats leading methods using a simple Multi-Layer Perceptron (MLP) with two hidden layers and GELU activations. This demonstrates that a minimalistic architecture with a theoretically grounded loss function can achieve competitive results without sacrificing performance, even in the face of more complex architectures.
- [59] arXiv:2407.01316 (cross-list from cs.LG) [pdf, html, other]
-
Title: Evaluating Model Performance Under Worst-case SubpopulationsComments: Earlier version appeared in the proceedings of Advances in Neural Information Processing Systems 34 (NeurIPS 2021): this https URLSubjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
The performance of ML models degrades when the training population is different from that seen under operation. Towards assessing distributional robustness, we study the worst-case performance of a model over all subpopulations of a given size, defined with respect to core attributes Z. This notion of robustness can consider arbitrary (continuous) attributes Z, and automatically accounts for complex intersectionality in disadvantaged groups. We develop a scalable yet principled two-stage estimation procedure that can evaluate the robustness of state-of-the-art models. We prove that our procedure enjoys several finite-sample convergence guarantees, including dimension-free convergence. Instead of overly conservative notions based on Rademacher complexities, our evaluation error depends on the dimension of Z only through the out-of-sample error in estimating the performance conditional on Z. On real datasets, we demonstrate that our method certifies the robustness of a model and prevents deployment of unreliable models.
- [60] arXiv:2407.01371 (cross-list from cs.LG) [pdf, html, other]
-
Title: Binary Losses for Density Ratio EstimationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Estimating the ratio of two probability densities from finitely many observations of the densities, is a central problem in machine learning and statistics. A large class of methods constructs estimators from binary classifiers which distinguish observations from the two densities. However, the error of these constructions depends on the choice of the binary loss function, raising the question of which loss function to choose based on desired error properties. In this work, we start from prescribed error measures in a class of Bregman divergences and characterize all loss functions that lead to density ratio estimators with a small error. Our characterization provides a simple recipe for constructing loss functions with certain properties, such as loss functions that prioritize an accurate estimation of large values. This contrasts with classical loss functions, such as the logistic loss or boosting loss, which prioritize accurate estimation of small values. We provide numerical illustrations with kernel methods and test their performance in applications of parameter selection for deep domain adaptation.
- [61] arXiv:2407.01526 (cross-list from cs.LG) [pdf, other]
-
Title: Scalable Nested Optimization for Deep LearningComments: View more research details at this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC); Machine Learning (stat.ML)
Gradient-based optimization has been critical to the success of machine learning, updating a single set of parameters to minimize a single loss. A growing number of applications rely on a generalization of this, where we have a bilevel or nested optimization of which subsets of parameters update on different objectives nested inside each other. We focus on motivating examples of hyperparameter optimization and generative adversarial networks. However, naively applying classical methods often fails when we look at solving these nested problems on a large scale. In this thesis, we build tools for nested optimization that scale to deep learning setups.
Cross submissions for Tuesday, 2 July 2024 (showing 31 of 31 entries )
- [62] arXiv:2010.00729 (replaced) [pdf, html, other]
-
Title: Individual-centered partial information in social networksSubjects: Methodology (stat.ME)
In statistical network analysis, we often assume either the full network is available or multiple subgraphs can be sampled to estimate various global properties of the network. However, in a real social network, people frequently make decisions based on their local view of the network alone. Here, we consider a partial information framework that characterizes the local network centered at a given individual by path length $L$ and gives rise to a partial adjacency matrix. Under $L=2$, we focus on the problem of (global) community detection using the popular stochastic block model (SBM) and its degree-corrected variant (DCSBM). We derive theoretical properties of the eigenvalues and eigenvectors from the signal term of the partial adjacency matrix and propose new spectral-based community detection algorithms that achieve consistency under appropriate conditions. Our analysis also allows us to propose a new centrality measure that assesses the importance of an individual's partial information in determining global community structure. Using simulated and real networks, we demonstrate the performance of our algorithms and compare our centrality measure with other popular alternatives to show it captures unique nodal information. Our results illustrate that the partial information framework enables us to compare the viewpoints of different individuals regarding the global structure.
- [63] arXiv:2106.09499 (replaced) [pdf, html, other]
-
Title: Maximum Entropy Spectral Analysis: an application to gravitational waves data analysisComments: 15 pages, 11 figuresSubjects: Methodology (stat.ME); Instrumentation and Methods for Astrophysics (astro-ph.IM); Data Analysis, Statistics and Probability (physics.data-an)
The Maximum Entropy Spectral Analysis (MESA) method, developed by Burg, offers a powerful tool for spectral estimation of a time-series. It relies on Jaynes' maximum entropy principle, allowing the spectrum of a stochastic process to be inferred using the coefficients of an autoregressive process AR($p$) of order $p$. A closed-form recursive solution provides estimates for both the autoregressive coefficients and the order $p$ of the process. We provide a ready-to-use implementation of this algorithm in a Python package called \texttt{memspectrum}, characterized through power spectral density (PSD) analysis on synthetic data with known PSD and comparisons of different criteria for stop** the recursion. Additionally, we compare the performance of our implementation with the ubiquitous Welch algorithm, using synthetic data generated from the GW150914 strain spectrum released by the LIGO-Virgo-Kagra collaboration. Our findings indicate that Burg's method provides PSD estimates with systematically lower variance and bias. This is particularly manifest in the case of a small (O($5000$)) number of data points, making Burg's method most suitable to work in this regime. Since this is close to the typical length of analysed gravitational waves data, improving the estimate of the PSD in this regime leads to more reliable posterior profiles for the system under study. We conclude our investigation by utilising MESA, and its particularly easy parametrisation where the only free parameter is the order $p$ of the AR process, to marginalise over the interferometers noise PSD in conjunction with inferring the parameters of GW150914.
- [64] arXiv:2211.07866 (replaced) [pdf, html, other]
-
Title: Efficient Estimation for Longitudinal Networks via Adaptive MergingComments: 30 pages and 4 figures; appendix including technical proof will be uploaded laterSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Longitudinal network consists of a sequence of temporal edges among multiple nodes, where the temporal edges are observed in real time. It has become ubiquitous with the rise of online social platform and e-commerce, but largely under-investigated in literature. In this paper, we propose an efficient estimation framework for longitudinal network, leveraging strengths of adaptive network merging, tensor decomposition and point process. It merges neighboring sparse networks so as to enlarge the number of observed edges and reduce estimation variance, whereas the estimation bias introduced by network merging is controlled by exploiting local temporal structures for adaptive network neighborhood. A projected gradient descent algorithm is proposed to facilitate estimation, where the upper bound of the estimation error in each iteration is established. A thorough analysis is conducted to quantify the asymptotic behavior of the proposed method, which shows that it can significantly reduce the estimation error and also provides guideline for network merging under various scenarios. We further demonstrate the advantage of the proposed method through extensive numerical experiments on synthetic datasets and a militarized interstate dispute dataset.
- [65] arXiv:2211.10776 (replaced) [pdf, html, other]
-
Title: Bayesian Modal Regression based on Mixture DistributionsComments: 44 pages, 16 figuresSubjects: Methodology (stat.ME); Applications (stat.AP)
Compared to mean regression and quantile regression, the literature on modal regression is very sparse. A unifying framework for Bayesian modal regression is proposed, based on a family of unimodal distributions indexed by the mode, along with other parameters that allow for flexible shapes and tail behaviors. Sufficient conditions for posterior propriety under an improper prior on the mode parameter are derived. Following prior elicitation, regression analysis of simulated data and datasets from several real-life applications are conducted. Besides drawing inference for covariate effects that are easy to interpret, prediction and model selection under the proposed Bayesian modal regression framework are also considered. Evidence from these analyses suggest that the proposed inference procedures are very robust to outliers, enabling one to discover interesting covariate effects missed by mean or median regression, and to construct much tighter prediction intervals than those from mean or median regression. Computer programs for implementing the proposed Bayesian modal regression are available at this https URL.
- [66] arXiv:2212.01699 (replaced) [pdf, html, other]
-
Title: Parametric Modal Regression with Error in CovariatesComments: 17 pages, 3 figuresSubjects: Methodology (stat.ME)
An inference procedure is proposed to provide consistent estimators of parameters in a modal regression model with a covariate prone to measurement error. A score-based diagnostic tool exploiting parametric bootstrap is developed to assess adequacy of parametric assumptions imposed on the regression model. The proposed estimation method and diagnostic tool are applied to synthetic data generated from simulation experiments and data from real-world applications to demonstrate their implementation and performance. These empirical examples illustrate the importance of adequately accounting for measurement error in the error-prone covariate when inferring the association between a response and covariates based on a modal regression model that is especially suitable for skewed and heavy-tailed response data.
- [67] arXiv:2212.01832 (replaced) [pdf, html, other]
-
Title: The flexible Gumbel distribution: A new model for inference about the modeComments: 15 pages, 3 figuresSubjects: Methodology (stat.ME)
A new unimodal distribution family indexed by the mode and three other parameters is derived from a mixture of a Gumbel distribution for the maximum and a Gumbel distribution for the minimum. Properties of the proposed distribution are explored, including model identifiability and flexibility in capturing heavy-tailed data that exhibit different directions of skewness over a wide range. Both frequentist and Bayesian methods are developed to infer parameters in the new distribution. Simulation studies are conducted to demonstrate satisfactory performance of both methods. By fitting the proposed model to simulated data and data from an application in hydrology, it is shown that the proposed flexible distribution is especially suitable for data containing extreme values in either direction, with the mode being a location parameter of interest. Using the proposed unimodal distribution, one can easily formulate a regression model concerning the mode of a response given covariates. We apply this model to data from an application in criminology to reveal interesting data features that are obscured by outliers. Computer programs for implementing all considered inference methods in the study are available at this https URL.
- [68] arXiv:2212.04746 (replaced) [pdf, html, other]
-
Title: Model-based clustering of categorical data based on the Hamming distanceSubjects: Methodology (stat.ME)
A model-based approach is developed for clustering categorical data with no natural ordering. The proposed method exploits the Hamming distance to define a family of probability mass functions to model the data. The elements of this family are then considered as kernels of a finite mixture model with an unknown number of components.
Conjugate Bayesian inference has been derived for the parameters of the Hamming distribution model. The mixture is framed in a Bayesian nonparametric setting, and a transdimensional blocked Gibbs sampler is developed to provide full Bayesian inference on the number of clusters, their structure, and the group-specific parameters, facilitating the computation with respect to customary reversible jump algorithms. The proposed model encompasses a parsimonious latent class model as a special case when the number of components is fixed. Model performances are assessed via a simulation study and reference datasets, showing improvements in clustering recovery over existing approaches. - [69] arXiv:2301.04625 (replaced) [pdf, html, other]
-
Title: Enhanced Response Envelope via Envelope RegularizationSubjects: Methodology (stat.ME)
The response envelope model provides substantial efficiency gains over the standard multivariate linear regression by identifying the material part of the response to the model and by excluding the immaterial part. In this paper, we propose the enhanced response envelope by incorporating a novel envelope regularization term based on a nonconvex manifold formulation. It is shown that the enhanced response envelope can yield better prediction risk than the original envelope estimator. The enhanced response envelope naturally handles high-dimensional data for which the original response envelope is not serviceable without necessary remedies. In an asymptotic high-dimensional regime where the ratio of the number of predictors over the number of samples converges to a non-zero constant, we characterize the risk function and reveal an interesting double descent phenomenon for the envelope model. A simulation study confirms our main theoretical findings. Simulations and real data applications demonstrate that the enhanced response envelope does have significantly improved prediction performance over the original envelope method, especially when the number of predictors is close to or moderately larger than the number of samples. Proofs and additional simulation results are shown in the supplementary file to this paper.
- [70] arXiv:2301.13088 (replaced) [pdf, other]
-
Title: Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces II: non-compact symmetric spacesSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
Gaussian processes are arguably the most important class of spatiotemporal models within machine learning. They encode prior information about the modeled function and can be used for exact or approximate Bayesian learning. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.
- [71] arXiv:2303.09644 (replaced) [pdf, html, other]
-
Title: Linear parametric model checks for functional time seriesSubjects: Statistics Theory (math.ST)
The presented methodology for testing the goodness-of-fit of an Autoregressive Hilbertian model (ARH(1) model) provides an infinite-dimensional formulation of the approach proposed in Koul and Stute (1999), based on empirical process marked by residuals. Applying a central and functional central limit result for Hilbert-valued martingale difference sequences, the asymptotic behavior of the formulated H-valued empirical process, also indexed by H, is obtained under the null hypothesis. The limiting process is H-valued generalized (i.e., indexed by H) Wiener process, leading to an asymptotically distribution free test. Consistency of the test is also proved. The case of misspecified autocorrelation operator of the ARH(1) process is addressed. The asymptotic equivalence in probability, uniformly in the norm of H, of the empirical processes formulated under known and unknown autocorrelation operator is obtained. Beyond the Euclidean setting, this approach allows to implement goodness of fit testing in the context of manifold and spherical functional autoregressive processes.
- [72] arXiv:2303.10016 (replaced) [pdf, html, other]
-
Title: Improving instrumental variable estimators with post-stratificationSubjects: Methodology (stat.ME)
Experiments studying get-out-the-vote (GOTV) efforts estimate the causal effect of various mobilization efforts on voter turnout. However, there is often substantial noncompliance in these studies. A usual approach is to use an instrumental variable (IV) analysis to estimate impacts for compliers, here being those actually contacted by the investigators. Unfortunately, popular IV estimators can be unstable in studies with a small fraction of compliers. We explore post-stratifying the data (e.g., taking a weighted average of IV estimates within each stratum) using variables that predict complier status (and, potentially, the outcome) to mitigate this. We present the benefits of post-stratification in terms of bias, variance, and improved standard error estimates, and provide a finite-sample asymptotic variance formula. We also compare the performance of different IV approaches and discuss the advantages of our design-based post-stratification approach over incorporating compliance-predictive covariates into the two-stage least squares estimator. In the end, we show that covariates predictive of compliance can increase precision, but only if one is willing to make a bias-variance trade-off by down-weighting or drop** strata with few compliers. By contrast, standard approaches such as two-stage least squares fail to use such information. We finally examine the benefits of our approach in two GOTV applications.
- [73] arXiv:2306.00541 (replaced) [pdf, html, other]
-
Title: Decomposing Global Feature Effects Based on Feature InteractionsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Global feature effect methods, such as partial dependence plots, provide an intelligible visualization of the expected marginal feature effect. However, such global feature effect methods can be misleading, as they do not represent local feature effects of single observations well when feature interactions are present. We formally introduce generalized additive decomposition of global effects (GADGET), which is a new framework based on recursive partitioning to find interpretable regions in the feature space such that the interaction-related heterogeneity of local feature effects is minimized. We provide a mathematical foundation of the framework and show that it is applicable to the most popular methods to visualize marginal feature effects, namely partial dependence, accumulated local effects, and Shapley additive explanations (SHAP) dependence. Furthermore, we introduce and validate a new permutation-based interaction test to detect significant feature interactions that is applicable to any feature effect method that fits into our proposed framework. We empirically evaluate the theoretical characteristics of the proposed methods based on various feature effect methods in different experimental settings. Moreover, we apply our introduced methodology to three real-world examples to showcase their usefulness.
- [74] arXiv:2306.01211 (replaced) [pdf, html, other]
-
Title: Priming bias versus post-treatment bias in experimental designsComments: 32 pages (main text), 22 pages (supplementary materials), 5 figuresSubjects: Methodology (stat.ME); Applications (stat.AP)
Conditioning on variables affected by treatment can induce post-treatment bias when estimating causal effects. Although this suggests that researchers should measure potential moderators before administering the treatment in an experiment, doing so may also bias causal effect estimation if the covariate measurement primes respondents to react differently to the treatment. This paper formally analyzes this trade-off between post-treatment and priming biases in three experimental designs that vary when moderators are measured: pre-treatment, post-treatment, or a randomized choice between the two. We derive nonparametric bounds for interactions between the treatment and the moderator under each design and show how to use substantive assumptions to narrow these bounds. These bounds allow researchers to assess the sensitivity of their empirical findings to either source of bias. We then apply the proposed methodology to a survey experiment on electoral messaging.
- [75] arXiv:2307.02331 (replaced) [pdf, html, other]
-
Title: Differential recall bias in estimating treatment effects in observational studiesComments: 26 pages, 2 figures, 3 tables. Supplementary materials are available. The R files are available at this https URLSubjects: Methodology (stat.ME)
Observational studies are frequently used to estimate the effect of an exposure or treatment on an outcome. To obtain an unbiased estimate of the treatment effect, it is crucial to measure the exposure accurately. A common type of exposure misclassification is recall bias, which occurs in retrospective cohort studies when study subjects may inaccurately recall their past exposure. Particularly challenging is differential recall bias in the context of self-reported binary exposures, where the bias may be directional rather than random , and its extent varies according to the outcomes experienced. This paper makes several contributions: (1) it establishes bounds for the average treatment effect (ATE) even when a validation study is not available; (2) it proposes multiple estimation methods across various strategies predicated on different assumptions; and (3) it suggests a sensitivity analysis technique to assess the robustness of the causal conclusion, incorporating insights from prior research. The effectiveness of these methods is demonstrated through simulation studies that explore various model misspecification scenarios. These approaches are then applied to investigate the effect of childhood physical abuse on mental health in adulthood.
- [76] arXiv:2308.01198 (replaced) [pdf, other]
-
Title: Analyzing the Reporting Error of Public Transport Trips in the Danish National Travel Survey Using Smart Card DataComments: 38 pages, 18 figures, 12 tablesSubjects: Applications (stat.AP); Econometrics (econ.EM); Other Statistics (stat.OT)
Household travel surveys have been used for decades to collect individuals and households' travel behavior. However, self-reported surveys are subject to recall bias, as respondents might struggle to recall and report their activities accurately. This study examines the time reporting error of public transit users in a nationwide household travel survey by matching, at the individual level, five consecutive years of data from two sources, namely the Danish National Travel Survey (TU) and the Danish Smart Card system (Rejsekort). Survey respondents are matched with travel cards from the Rejsekort data solely based on the respondents' declared spatiotemporal travel behavior. Approximately, 70% of the respondents were successfully matched with Rejsekort travel cards. The findings reveal a median time reporting error of 11.34 minutes, with an Interquartile Range of 28.14 minutes. Furthermore, a statistical analysis was performed to explore the relationships between the survey respondents' reporting error and their socio-economic and demographic characteristics. The results indicate that females and respondents with a fixed schedule are in general more accurate than males and respondents with a flexible schedule in reporting their times of travel. Moreover, trips reported during weekdays or via the internet displayed higher accuracies compared to trips reported during weekends and holidays or via telephone interviews. This disaggregated analysis provides valuable insights that could help in improving the design and analysis of travel surveys, as well accounting for reporting errors/biases in travel survey-based applications. Furthermore, it offers valuable insights underlying the psychology of travel recall by survey respondents.
- [77] arXiv:2308.09790 (replaced) [pdf, html, other]
-
Title: A Two-Part Machine Learning Approach to Characterizing Network Interference in A/B TestingComments: 47 pagesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
The reliability of controlled experiments, commonly referred to as "A/B tests," is often compromised by network interference, where the outcomes of individual units are influenced by interactions with others. Significant challenges in this domain include the lack of accounting for complex social network structures and the difficulty in suitably characterizing network interference. To address these challenges, we propose a machine learning-based method. We introduce "causal network motifs" and utilize transparent machine learning models to characterize network interference patterns underlying an A/B test on networks. Our method's performance has been demonstrated through simulations on both a synthetic experiment and a large-scale test on Instagram. Our experiments show that our approach outperforms conventional methods such as design-based cluster randomization and conventional analysis-based neighborhood exposure map**. Our approach provides a comprehensive and automated solution to address network interference for A/B testing practitioners. This aids in informing strategic business decisions in areas such as marketing effectiveness and product customization.
- [78] arXiv:2309.04742 (replaced) [pdf, html, other]
-
Title: Affine Invariant Ensemble Transform Methods to Improve Predictive Uncertainty in Neural NetworksSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
We consider the problem of performing Bayesian inference for logistic regression using appropriate extensions of the ensemble Kalman filter. Two interacting particle systems are proposed that sample from an approximate posterior and prove quantitative convergence rates of these interacting particle systems to their mean-field limit as the number of particles tends to infinity. Furthermore, we apply these techniques and examine their effectiveness as methods of Bayesian approximation for quantifying predictive uncertainty in neural networks.
- [79] arXiv:2310.01575 (replaced) [pdf, html, other]
-
Title: Derivation of outcome-dependent dietary patterns for low-income women obtained from survey data using a Supervised Weighted Overfitted Latent Class AnalysisComments: 16 pages, 8 tables, 7 figuresSubjects: Methodology (stat.ME); Applications (stat.AP)
Poor diet quality is a key modifiable risk factor for hypertension and disproportionately impacts low-income women. \sw{Analyzing diet-driven hypertensive outcomes in this demographic is challenging due to the complexity of dietary data and selection bias when the data come from surveys, a main data source for understanding diet-disease relationships in understudied populations. Supervised Bayesian model-based clustering methods summarize dietary data into latent patterns that holistically capture relationships among foods and a known health outcome but do not sufficiently account for complex survey design. This leads to biased estimation and inference and lack of generalizability of the patterns}. To address this, we propose a supervised weighted overfitted latent class analysis (SWOLCA) based on a Bayesian pseudo-likelihood approach that integrates sampling weights into an exposure-outcome model for discrete data. Our model adjusts for stratification, clustering, and informative sampling, and handles modifying effects via interaction terms within a Markov chain Monte Carlo Gibbs sampling algorithm. Simulation studies confirm that the SWOLCA model exhibits good performance in terms of bias, precision, and coverage. Using data from the National Health and Nutrition Examination Survey (2015-2018), we demonstrate the utility of our model by characterizing dietary patterns associated with hypertensive outcomes among low-income women in the United States.
- [80] arXiv:2310.10393 (replaced) [pdf, html, other]
-
Title: Statistical and Causal Robustness for Causal Null Hypothesis TestsSubjects: Methodology (stat.ME)
Prior work applying semiparametric theory to causal inference has primarily focused on deriving estimators that exhibit statistical robustness under a prespecified causal model that permits identification of a desired causal parameter. However, a fundamental challenge is correct specification of such a model, which usually involves making untestable assumptions. Evidence factors is an approach to combining hypothesis tests of a common causal null hypothesis under two or more candidate causal models. Under certain conditions, this yields a test that is valid if at least one of the underlying models is correct, which is a form of causal robustness. We propose a method of combining semiparametric theory with evidence factors. We develop a causal null hypothesis test based on joint asymptotic normality of K asymptotically linear semiparametric estimators, where each estimator is based on a distinct identifying functional derived from each of K candidate causal models. We show that this test provides both statistical and causal robustness in the sense that it is valid if at least one of the K proposed causal models is correct, while also allowing for slower than parametric rates of convergence in estimating nuisance functions. We demonstrate the effectiveness of our method via simulations and applications to the Framingham Heart Study and Wisconsin Longitudinal Study.
- [81] arXiv:2310.12806 (replaced) [pdf, html, other]
-
Title: DCSI -- An improved measure of cluster separability based on separation and connectednessSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Whether class labels in a given data set correspond to meaningful clusters is crucial for the evaluation of clustering algorithms using real-world data sets. This property can be quantified by separability measures. The central aspects of separability for density-based clustering are between-class separation and within-class connectedness, and neither classification-based complexity measures nor cluster validity indices (CVIs) adequately incorporate them. A newly developed measure (density cluster separability index, DCSI) aims to quantify these two characteristics and can also be used as a CVI. Extensive experiments on synthetic data indicate that DCSI correlates strongly with the performance of DBSCAN measured via the adjusted Rand index (ARI) but lacks robustness when it comes to multi-class data sets with overlap** classes that are ill-suited for density-based hard clustering. Detailed evaluation on frequently used real-world data sets shows that DCSI can correctly identify touching or overlap** classes that do not correspond to meaningful density-based clusters.
- [82] arXiv:2311.01053 (replaced) [pdf, html, other]
-
Title: A Regression-Based Approach to the CO2 Airborne Fraction: Enhancing Statistical Precision and Tackling Zero EmissionsSubjects: Applications (stat.AP); Statistics Theory (math.ST)
The global fraction of anthropogenically emitted carbon dioxide (CO$_2$) that stays in the atmosphere, the CO$_2$ airborne fraction, has been fluctuating around a constant value over the period 1959 to 2022. The consensus estimate of the airborne fraction is around $44\%$; the remaining $56\%$ is absorbed by the oceanic and terrestrials biospheres. In this study, we show that the conventional estimator of the airborne fraction, based on a ratio of changes in atmospheric CO$_2$ concentrations and CO$_2$ emissions, suffers from a number of statistical deficiencies, such as non-existence of moments and a non-Gaussian limiting distribution. We propose an alternative regression-based estimator of the airborne fraction that does not suffer from these deficiencies. We show that the regression-based estimator has a Gaussian limiting distribution and reduces estimation uncertainty substantially. Our empirical analysis leads to an estimate of the airborne fraction over 1959--2022 of $47.0\%$ ($\pm 1.1\%$; $1 \sigma$), implying a higher, and better constrained, estimate than the current consensus. Using climate model output, we show that a regression-based approach provides sensible estimates of the airborne fraction, also in future scenarios where emissions are at or near zero.
- [83] arXiv:2401.13665 (replaced) [pdf, other]
-
Title: Entrywise Inference for Missing Panel Data: A Simple and Instance-Optimal ApproachSubjects: Statistics Theory (math.ST); Econometrics (econ.EM); Methodology (stat.ME); Machine Learning (stat.ML)
Longitudinal or panel data can be represented as a matrix with rows indexed by units and columns indexed by time. We consider inferential questions associated with the missing data version of panel data induced by staggered adoption. We propose a computationally efficient procedure for estimation, involving only simple matrix algebra and singular value decomposition, and prove non-asymptotic and high-probability bounds on its error in estimating each missing entry. By controlling proximity to a suitably scaled Gaussian variable, we develop and analyze a data-driven procedure for constructing entrywise confidence intervals with pre-specified coverage. Despite its simplicity, our procedure turns out to be instance-optimal: we prove that the width of our confidence intervals match a non-asymptotic instance-wise lower bound derived via a Bayesian Cramér-Rao argument. We illustrate the sharpness of our theoretical characterization on a variety of numerical examples. Our analysis is based on a general inferential toolbox for SVD-based algorithm applied to the matrix denoising model, which might be of independent interest.
- [84] arXiv:2402.02306 (replaced) [pdf, html, other]
-
Title: A flexible Bayesian g-formula for causal survival analyses with time-dependent confoundingSubjects: Methodology (stat.ME); Computation (stat.CO); Machine Learning (stat.ML)
In longitudinal observational studies with a time-to-event outcome, a common objective in causal analysis is to estimate the causal survival curve under hypothetical intervention scenarios within the study cohort. The g-formula is a particularly useful tool for this analysis. To enhance the traditional parametric g-formula approach, we developed a more adaptable Bayesian g-formula estimator, which incorporates the Bayesian additive regression trees (BART) in the modeling of the time-evolving generative components, aiming to mitigate bias due to model misspecification. Specifically, we introduce a more general class of g-formulas for discrete survival data that can incorporate the longitudinal balancing scores, which serve as an effective method for dimension reduction and are vital when dealing with an expanding array of time-varying confounders. The minimum sufficient formulation of these longitudinal balancing scores is linked to the nature of treatment regimes, whether static or dynamic. For each type of treatment regime, we provide posterior sampling algorithms grounded in the BART framework. We have conducted simulation studies to illustrate the empirical performance of the proposed method and further demonstrate its practical utility using data from the Yale New Haven Health System's (YNHHS) electronic health records.
- [85] arXiv:2402.03220 (replaced) [pdf, html, other]
-
Title: The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap ExponentsComments: Accepted at the International Conference on Machine Learning (ICML), 2024Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We investigate the training dynamics of two-layer neural networks when learning multi-index target functions. We focus on multi-pass gradient descent (GD) that reuses the batches multiple times and show that it significantly changes the conclusion about which functions are learnable compared to single-pass gradient descent. In particular, multi-pass GD with finite stepsize is found to overcome the limitations of gradient flow and single-pass GD given by the information exponent (Ben Arous et al., 2021) and leap exponent (Abbe et al., 2023) of the target function. We show that upon re-using batches, the network achieves in just two time steps an overlap with the target subspace even for functions not satisfying the staircase property (Abbe et al., 2021). We characterize the (broad) class of functions efficiently learned in finite time. The proof of our results is based on the analysis of the Dynamical Mean-Field Theory (DMFT). We further provide a closed-form description of the dynamical process of the low-dimensional projections of the weights, and numerical experiments illustrating the theory.
- [86] arXiv:2402.05330 (replaced) [pdf, html, other]
-
Title: Classification under Nuisance Parameters and Generalized Label Shift in Likelihood-Free InferenceComments: 26 pages, 19 figures, code available at this https URLSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
An open scientific challenge is how to classify events with reliable measures of uncertainty, when we have a mechanistic model of the data-generating process but the distribution over both labels and latent nuisance parameters is different between train and target data. We refer to this type of distributional shift as generalized label shift (GLS). Direct classification using observed data $\mathbf{X}$ as covariates leads to biased predictions and invalid uncertainty estimates of labels $Y$. We overcome these biases by proposing a new method for robust uncertainty quantification that casts classification as a hypothesis testing problem under nuisance parameters. The key idea is to estimate the classifier's receiver operating characteristic (ROC) across the entire nuisance parameter space, which allows us to devise cutoffs that are invariant under GLS. Our method effectively endows a pre-trained classifier with domain adaptation capabilities and returns valid prediction sets while maintaining high power. We demonstrate its performance on two challenging scientific problems in biology and astroparticle physics with data from realistic mechanistic models.
- [87] arXiv:2402.07025 (replaced) [pdf, html, other]
-
Title: Generalization Error of Graph Neural Networks in the Mean-field RegimeComments: Accepted in ICML 2024Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
This work provides a theoretical framework for assessing the generalization error of graph neural networks in the over-parameterized regime, where the number of parameters surpasses the quantity of data points. We explore two widely utilized types of graph neural networks: graph convolutional neural networks and message passing graph neural networks. Prior to this study, existing bounds on the generalization error in the over-parametrized regime were uninformative, limiting our understanding of over-parameterized network performance. Our novel approach involves deriving upper bounds within the mean-field regime for evaluating the generalization error of these graph neural networks. We establish upper bounds with a convergence rate of $O(1/n)$, where $n$ is the number of graph samples. These upper bounds offer a theoretical assurance of the networks' performance on unseen data in the challenging over-parameterized regime and overall contribute to our understanding of their performance.
- [88] arXiv:2403.09928 (replaced) [pdf, html, other]
-
Title: Identification and estimation of mediational effects of longitudinal modified treatment policiesBrian Gilbert, Katherine L. Hoffman, Nicholas Williams, Kara E. Rudolph, Edward J. Schenck, Iván DíazComments: add references, minor textual changesSubjects: Methodology (stat.ME)
We demonstrate a comprehensive semiparametric approach to causal mediation analysis, addressing the complexities inherent in settings with longitudinal and continuous treatments, confounders, and mediators. Our methodology utilizes a nonparametric structural equation model and a cross-fitted sequential regression technique based on doubly robust pseudo-outcomes, yielding an efficient, asymptotically normal estimator without relying on restrictive parametric modeling assumptions. We are motivated by a recent scientific controversy regarding the effects of invasive mechanical ventilation (IMV) on the survival of COVID-19 patients, considering acute kidney injury (AKI) as a mediating factor. We highlight the possibility of "inconsistent mediation," in which the direct and indirect effects of the exposure operate in opposite directions. We discuss the significance of mediation analysis for scientific understanding and its potential utility in treatment decisions.
- [89] arXiv:2403.18782 (replaced) [pdf, html, other]
-
Title: Beyond boundaries: Gary Lorden's groundbreaking contributions to sequential analysisSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
Gary Lorden provided several fundamental and novel insights into sequential hypothesis testing and changepoint detection. In this article, we provide an overview of Lorden's contributions in the context of existing results in those areas, and some extensions made possible by Lorden's work. We also mention some of Lorden's significant consulting work, including as an expert witness and for NASA, the entertainment industry, and Major League Baseball.
- [90] arXiv:2404.10834 (replaced) [pdf, html, other]
-
Title: VARX Granger Analysis: Modeling, Inference, and ApplicationsSubjects: Methodology (stat.ME)
Complex systems, such as brains, markets, and societies, exhibit internal dynamics influenced by external factors. Disentangling delayed external effects from internal dynamics within these systems is often challenging. We propose using a Vector Autoregressive model with eXogenous input (VARX) to capture delayed interactions between internal and external variables. While this model aligns with Granger's statistical formalism for testing "causal relations", the connection between the two is not widely understood. Here, we bridge this gap by providing fundamental equations, user-friendly code, and demonstrations using simulated and real-world data from neuroscience, physiology, sociology, and economics. Our examples illustrate how the model avoids spurious correlation by factoring out external influences from internal dynamics, leading to more parsimonious explanations of the systems. We also provide methods for enhancing model efficiency, such as L2 regularization for limited data and basis functions to cope with extended delays. Additionally, we analyze model performance under various scenarios where model assumptions are violated. MATLAB, Python, and R code are provided for easy adoption: this https URL
- [91] arXiv:2404.11678 (replaced) [pdf, html, other]
-
Title: Corrected Correlation Estimates for Meta-AnalysisComments: 31 pages, 9 figuresSubjects: Methodology (stat.ME); Optimization and Control (math.OC); Applications (stat.AP)
Meta-analysis allows rigorous aggregation of estimates and uncertainty across multiple studies. When a given study reports multiple estimates, such as log odds ratios (ORs) or log relative risks (RRs) across exposure groups, accounting for within-study correlations improves accuracy and efficiency of meta-analytic results. Canonical approaches of Greenland-Longnecker and Hamling estimate pseudo cases and non-cases for exposure groups to obtain within-study correlations. However, currently available implementations for both methods fail on simple examples.
We review both GL and Hamling methods through the lens of optimization. For ORs, we provide modifications of each approach that ensure convergence for any feasible inputs. For GL, this is achieved through a new connection to entropic minimization. For Hamling, a modification leads to a provably solvable equivalent set of equations given a specific initialization. For each, we provide implementations a guaranteed to work for any feasible input.
For RRs, we show the new GL approach is always guaranteed to succeed, but any Hamling approach may fail: we give counter-examples where no solutions exist. We derive a sufficient condition on reported RRs that guarantees success when reported variances are all equal. - [92] arXiv:2404.16775 (replaced) [pdf, html, other]
-
Title: Estimating Metocean Environments Associated with Extreme Structural Response to Demonstrate the Dangers of Environmental Contour MethodsSubjects: Methodology (stat.ME); Applications (stat.AP)
Extreme value analysis (EVA) uses data to estimate long-term extreme environmental conditions for variables such as significant wave height and period, for the design of marine structures. Together with models for the short-term evolution of the ocean environment and for wave-structure interaction, EVA provides a basis for full probabilistic design analysis. Alternatively, environmental contours provide an approximate approach to estimating structural integrity, without requiring structural knowledge. These contour methods also exploit statistical models, including EVA, but avoid the need for structural modelling by making what are believed to be conservative assumptions about the shape of the structural failure boundary in the environment space. These assumptions, however, may not always be appropriate, or may lead to unnecessary wasted resources from over design. We demonstrate a methodology for efficient fully probabilistic analysis of structural failure. From this, we estimate the joint conditional probability density of the environment (CDE), given the occurrence of an extreme structural response. We use CDE as a diagnostic to highlight the deficiencies of environmental contour methods for design; none of the IFORM environmental contours considered characterise CDE well for three example structures.
- [93] arXiv:2405.02783 (replaced) [pdf, html, other]
-
Title: Linear Noise Approximation Assisted Bayesian Inference on Mechanistic Model of Partially Observed Stochastic Reaction NetworkComments: 11 pages, 2 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
To support mechanism online learning and facilitate digital twin development for biomanufacturing processes, this paper develops an efficient Bayesian inference approach for partially observed enzymatic stochastic reaction network (SRN), a fundamental building block of multi-scale bioprocess mechanistic model. To tackle the critical challenges brought by the nonlinear stochastic differential equations (SDEs)-based mechanistic model with partially observed state and having measurement errors, an interpretable Bayesian updating linear noise approximation (LNA) metamodel, incorporating the structure information of the mechanistic model, is proposed to approximate the likelihood of observations. Then, an efficient posterior sampling approach is developed by utilizing the gradients of the derived likelihood to speed up the convergence of Markov Chain Monte Carlo (MCMC). The empirical study demonstrates that the proposed approach has a promising performance.
- [94] arXiv:2405.03083 (replaced) [pdf, html, other]
-
Title: Causal K-Means ClusteringSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
Causal effects are often characterized with population summaries. These might provide an incomplete picture when there are heterogeneous treatment effects across subgroups. Since the subgroup structure is typically unknown, it is more challenging to identify and evaluate subgroup effects than population effects. We propose a new solution to this problem: Causal k-Means Clustering, which harnesses the widely-used k-means clustering algorithm to uncover the unknown subgroup structure. Our problem differs significantly from the conventional clustering setup since the variables to be clustered are unknown counterfactual functions. We present a plug-in estimator which is simple and readily implementable using off-the-shelf algorithms, and study its rate of convergence. We also develop a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning, and show that this estimator achieves fast root-n rates and asymptotic normality in large nonparametric models. Our proposed methods are especially useful for modern outcome-wide studies with multiple treatment levels. Further, our framework is extensible to clustering with generic pseudo-outcomes, such as partially observed outcomes or otherwise unknown functions. Finally, we explore finite sample properties via simulation, and illustrate the proposed methods in a study of treatment programs for adolescent substance abuse.
- [95] arXiv:2405.03180 (replaced) [pdf, html, other]
-
Title: Braced Fourier Continuation and Regression for Anomaly DetectionComments: 16 pages, 9 figures, associated Github link: this https URL -6/30/2024 update corrected and reworded erroneous figure references, minor typosSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
In this work, the concept of Braced Fourier Continuation and Regression (BFCR) is introduced. BFCR is a novel and computationally efficient means of finding nonlinear regressions or trend lines in arbitrary one-dimensional data sets. The Braced Fourier Continuation (BFC) and BFCR algorithms are first outlined, followed by a discussion of the properties of BFCR as well as demonstrations of how BFCR trend lines may be used effectively for anomaly detection both within and at the edges of arbitrary one-dimensional data sets. Finally, potential issues which may arise while using BFCR for anomaly detection as well as possible mitigation techniques are outlined and discussed. All source code and example data sets are either referenced or available via GitHub, and all associated code is written entirely in Python.
- [96] arXiv:2405.04715 (replaced) [pdf, html, other]
-
Title: Causality Pursuit from Heterogeneous Environments via Neural Adversarial Invariance LearningComments: 48 pages, 7 figures with appendixSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
Pursuing causality from data is a fundamental problem in scientific discovery, treatment intervention, and transfer learning. This paper introduces a novel algorithmic method for addressing nonparametric invariance and causality learning in regression models across multiple environments, where the joint distribution of response variables and covariates varies, but the conditional expectations of outcome given an unknown set of quasi-causal variables are invariant. The challenge of finding such an unknown set of quasi-causal or invariant variables is compounded by the presence of endogenous variables that have heterogeneous effects across different environments, including even one of them in the regression would make the estimation inconsistent. The proposed Focused Adversial Invariant Regularization (FAIR) framework utilizes an innovative minimax optimization approach that breaks down the barriers, driving regression models toward prediction-invariant solutions through adversarial testing. Leveraging the representation power of neural networks, FAIR neural networks (FAIR-NN) are introduced for causality pursuit. It is shown that FAIR-NN can find the invariant variables and quasi-causal variables under a minimal identification condition and that the resulting procedure is adaptive to low-dimensional composition structures in a non-asymptotic analysis. Under a structural causal model, variables identified by FAIR-NN represent pragmatic causality and provably align with exact causal mechanisms under conditions of sufficient heterogeneity. Computationally, FAIR-NN employs a novel Gumbel approximation with decreased temperature and stochastic gradient descent ascent algorithm. The procedures are convincingly demonstrated using simulated and real-data examples.
- [97] arXiv:2406.14717 (replaced) [pdf, html, other]
-
Title: Analysis of Linked Files: A Missing Data PerspectiveComments: Accepted manuscript, to be published in Statistical ScienceSubjects: Methodology (stat.ME); Applications (stat.AP)
In many applications, researchers seek to identify overlap** entities across multiple data files. Record linkage algorithms facilitate this task, in the absence of unique identifiers. As these algorithms rely on semi-identifying information, they may miss records that represent the same entity, or incorrectly link records that do not represent the same entity. Analysis of linked files commonly ignores such linkage errors, resulting in biased, or overly precise estimates of the associations of interest. We view record linkage as a missing data problem, and delineate the linkage mechanisms that underpin analysis methods with linked files. Following the missing data literature, we group these methods under three categories: likelihood and Bayesian methods, imputation methods, and weighting methods. We summarize the assumptions and limitations of the methods, and evaluate their performance in a wide range of simulation scenarios.
- [98] arXiv:2406.18905 (replaced) [pdf, html, other]
-
Title: Bayesian inference: More than Bayes's theoremComments: 35 pages, 11 figures; accepted for publication in Frontiers in Astronomy and Space Sciences (special issue for iid2022: Statistical Methods for Event Data - Illuminating the Dynamic Universe); fixed minor typoSubjects: Methodology (stat.ME); Instrumentation and Methods for Astrophysics (astro-ph.IM)
Bayesian inference gets its name from *Bayes's theorem*, expressing posterior probabilities for hypotheses about a data generating process as the (normalized) product of prior probabilities and a likelihood function. But Bayesian inference uses all of probability theory, not just Bayes's theorem. Many hypotheses of scientific interest are *composite hypotheses*, with the strength of evidence for the hypothesis dependent on knowledge about auxiliary factors, such as the values of nuisance parameters (e.g., uncertain background rates or calibration factors). Many important capabilities of Bayesian methods arise from use of the law of total probability, which instructs analysts to compute probabilities for composite hypotheses by *marginalization* over auxiliary factors. This tutorial targets relative newcomers to Bayesian inference, aiming to complement tutorials that focus on Bayes's theorem and how priors modulate likelihoods. The emphasis here is on marginalization over parameter spaces -- both how it is the foundation for important capabilities, and how it may motivate caution when parameter spaces are large. Topics covered include the difference between likelihood and probability, understanding the impact of priors beyond merely shifting the maximum likelihood estimate, and the role of marginalization in accounting for uncertainty in nuisance parameters, systematic error, and model misspecification.
- [99] arXiv:2406.19157 (replaced) [pdf, html, other]
-
Title: How to build your latent Markov model -- the role of time and spaceComments: 41 pages, 7 figuresSubjects: Methodology (stat.ME)
Statistical models that involve latent Markovian state processes have become immensely popular tools for analysing time series and other sequential data. However, the plethora of model formulations, the inconsistent use of terminology, and the various inferential approaches and software packages can be overwhelming to practitioners, especially when they are new to this area. With this review-like paper, we thus aim to provide guidance for both statisticians and practitioners working with latent Markov models by offering a unifying view on what otherwise are often considered separate model classes, from hidden Markov models over state-space models to Markov-modulated Poisson processes. In particular, we provide a roadmap for identifying a suitable latent Markov model formulation given the data to be analysed. Furthermore, we emphasise that it is key to applied work with any of these model classes to understand how recursive techniques exploiting the models' dependence structure can be used for inference. The R package LaMa adapts this unified view and provides an easy-to-use framework for very fast (C++ based) evaluation of the likelihood of any of the models discussed in this paper, allowing users to tailor a latent Markov model to their data using a Lego-type approach.
- [100] arXiv:2006.16202 (replaced) [pdf, html, other]
-
Title: Partitioned Least SquaresComments: To appear in Springer Machine Learning Journal (this https URL)Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In this paper we propose a variant of the linear least squares model allowing practitioners to partition the input features into groups of variables that they require to contribute similarly to the final result. The output allows practitioners to assess the importance of each group and of each variable in the group. We formally show that the new formulation is not convex and provide two alternative methods to deal with the problem: one non-exact method based on an alternating least squares approach; and one exact method based on a reformulation of the problem using an exponential number of sub-problems whose minimum is guaranteed to be the optimal solution. We formally show the correctness of the exact method and also compare the two solutions showing that the exact solution provides better results in a fraction of the time required by the alternating least squares solution (assuming that the number of partitions is small). For the sake of completeness, we also provide an alternative branch and bound algorithm that can be used in place of the exact method when the number of partitions is too large, and a proof of NP-completeness of the optimization problem introduced in this paper.
- [101] arXiv:2201.04811 (replaced) [pdf, html, other]
-
Title: Binary response model with many weak instrumentsSubjects: Econometrics (econ.EM); Applications (stat.AP)
This paper considers an endogenous binary response model with many weak instruments. We employ a control function approach and a regularization scheme to obtain better estimation results for the endogenous binary response model in the presence of many weak instruments. Two consistent and asymptotically normally distributed estimators are provided, each of which is called a regularized conditional maximum likelihood estimator (RCMLE) and a regularized nonlinear least squares estimator (RNLSE). Monte Carlo simulations show that the proposed estimators outperform the existing ones when there are many weak instruments. We use the proposed estimation method to examine the effect of family income on college completion.
- [102] arXiv:2208.13065 (replaced) [pdf, html, other]
-
Title: Towards Improving Unit Commitment Economics: An Add-On Tailor for Renewable Energy and Reserve PredictionsComments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY); Applications (stat.AP)
Generally, day-ahead unit commitment (UC) is conducted in a predict-then-optimize process: it starts by predicting the renewable energy source (RES) availability and system reserve requirements; given the predictions, the UC model is then optimized to determine the economic operation plans. In fact, predictions within the process are raw. In other words, if the predictions are further tailored to assist UC in making the economic operation plans against realizations of the RES and reserve requirements, UC economics will benefit significantly. To this end, this paper presents a cost-oriented tailor of RES-and-reserve predictions for UC, deployed as an add-on to the predict-then-optimize process. The RES-and-reserve tailor is trained by solving a bi-level mixed-integer programming model: the upper level trains the tailor based on its induced operating cost; the lower level, given tailored predictions, mimics the system operation process and feeds the induced operating cost back to the upper level; finally, the upper level evaluates the training quality according to the fed-back cost. Through this training, the tailor learns to customize the raw predictions into cost-oriented predictions. Moreover, the tailor can be embedded into the existing predict-then-optimize process as an add-on, improving the UC economics. Lastly, the presented method is compared to traditional, binary-relaxation, neural network-based, stochastic, and robust methods.
- [103] arXiv:2209.13694 (replaced) [pdf, html, other]
-
Title: Safe Linear Bandits over Unknown PolytopesComments: v3: Presented at COLT 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The safe linear bandit problem (SLB) is an online approach to linear programming with unknown objective and unknown roundwise constraints, under stochastic bandit feedback of rewards and safety risks of actions. We study the tradeoffs between efficacy and smooth safety costs of SLBs over polytopes, and the role of aggressive doubly-optimistic play in avoiding the strong assumptions made by extant pessimistic-optimistic approaches.
We first elucidate an inherent hardness in SLBs due the lack of knowledge of constraints: there exist `easy' instances, for which suboptimal extreme points have large `gaps', but on which SLB methods must still incur $\Omega(\sqrt{T})$ regret or safety violations, due to an inability to resolve unknown optima to arbitrary precision. We then analyse a natural doubly-optimistic strategy for the safe linear bandit problem, DOSS, which uses optimistic estimates of both reward and safety risks to select actions, and show that despite the lack of knowledge of constraints or feasible points, DOSS simultaneously obtains tight instance-dependent $O(\log^2 T)$ bounds on efficacy regret, and $\tilde O(\sqrt{T})$ bounds on safety violations. Further, when safety is demanded to a finite precision, violations improve to $O(\log^2 T).$ These results rely on a novel dual analysis of linear bandits: we argue that \algoname proceeds by activating noisy versions of at least $d$ constraints in each round, which allows us to separately analyse rounds where a `poor' set of constraints is activated, and rounds where `good' sets of constraints are activated. The costs in the former are controlled to $O(\log^2 T)$ by develo** new dual notions of gaps, based on global sensitivity analyses of linear programs, that quantify the suboptimality of each such set of constraints. The latter costs are controlled to $O(1)$ by explicitly analysing the solutions of optimistic play. - [104] arXiv:2210.13193 (replaced) [pdf, other]
-
Title: Langevin dynamics based algorithm e-TH$\varepsilon$O POULA for stochastic optimization problems with discontinuous stochastic gradientSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Machine Learning (stat.ML)
We introduce a new Langevin dynamics based algorithm, called e-TH$\varepsilon$O POULA, to solve optimization problems with discontinuous stochastic gradients which naturally appear in real-world applications such as quantile estimation, vector quantization, CVaR minimization, and regularized optimization problems involving ReLU neural networks. We demonstrate both theoretically and numerically the applicability of the e-TH$\varepsilon$O POULA algorithm. More precisely, under the conditions that the stochastic gradient is locally Lipschitz in average and satisfies a certain convexity at infinity condition, we establish non-asymptotic error bounds for e-TH$\varepsilon$O POULA in Wasserstein distances and provide a non-asymptotic estimate for the expected excess risk, which can be controlled to be arbitrarily small. Three key applications in finance and insurance are provided, namely, multi-period portfolio optimization, transfer learning in multi-period portfolio optimization, and insurance claim prediction, which involve neural networks with (Leaky)-ReLU activation functions. Numerical experiments conducted using real-world datasets illustrate the superior empirical performance of e-TH$\varepsilon$O POULA compared to SGLD, TUSLA, ADAM, and AMSGrad in terms of model accuracy.
- [105] arXiv:2211.07484 (replaced) [pdf, html, other]
-
Title: Contextual Bandits with Packing and Covering Constraints: A Modular Lagrangian Approach via RegressionComments: A preliminary version of this paper, authored by A. Slivkins, K.A. Sankararaman and D.J. Foster, has been published at COLT 2023. The present version features an important improvement, due to Xingyu Zhou. Specifically, the $\sqrt{T}$-regret result in Theorem 3.6(a) holds under a much weaker assumption, and is now positioned as the main guaranteeSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We consider contextual bandits with linear constraints (CBwLC), a variant of contextual bandits in which the algorithm consumes multiple resources subject to linear constraints on total consumption. This problem generalizes contextual bandits with knapsacks (CBwK), allowing for packing and covering constraints, as well as positive and negative resource consumption. We provide the first algorithm for CBwLC (or CBwK) that is based on regression oracles. The algorithm is simple, computationally efficient, and statistically optimal under mild assumptions. Further, we provide the first vanishing-regret guarantees for CBwLC (or CBwK) that extend beyond the stochastic environment. We side-step strong impossibility results from prior work by identifying a weaker (and, arguably, fairer) benchmark to compare against. Our algorithm builds on LagrangeBwK (Immorlica et al., FOCS 2019), a Lagrangian-based technique for CBwK, and SquareCB (Foster and Rakhlin, ICML 2020), a regression-based technique for contextual bandits. Our analysis leverages the inherent modularity of both techniques.
- [106] arXiv:2305.02449 (replaced) [pdf, other]
-
Title: Bayesian Safety Validation for Failure Probability Estimation of Black-Box SystemsJournal-ref: AIAA Journal of Aerospace Information Systems (JAIS) 21.7 (2024): 533-546Subjects: Machine Learning (cs.LG); Applications (stat.AP)
Estimating the probability of failure is an important step in the certification of safety-critical systems. Efficient estimation methods are often needed due to the challenges posed by high-dimensional input spaces, risky test scenarios, and computationally expensive simulators. This work frames the problem of black-box safety validation as a Bayesian optimization problem and introduces a method that iteratively fits a probabilistic surrogate model to efficiently predict failures. The algorithm is designed to search for failures, compute the most-likely failure, and estimate the failure probability over an operating domain using importance sampling. We introduce three acquisition functions that aim to reduce uncertainty by covering the design space, optimize the analytically derived failure boundaries, and sample the predicted failure regions. Results show this Bayesian safety validation approach provides a more accurate estimate of failure probability with orders of magnitude fewer samples and performs well across various safety validation metrics. We demonstrate this approach on three test problems, a stochastic decision making system, and a neural network-based runway detection system. This work is open sourced (this https URL) and currently being used to supplement the FAA certification process of the machine learning components for an autonomous cargo aircraft.
- [107] arXiv:2307.11465 (replaced) [pdf, html, other]
-
Title: A Deep Learning Approach for Overall Survival Prediction in Lung Cancer with Missing ValuesComments: 24 pages, 4 figuresSubjects: Machine Learning (cs.LG); Applications (stat.AP)
In the field of lung cancer research, particularly in the analysis of overall survival (OS), artificial intelligence (AI) serves crucial roles with specific aims. Given the prevalent issue of missing data in the medical domain, our primary objective is to develop an AI model capable of dynamically handling this missing data. Additionally, we aim to leverage all accessible data, effectively analyzing both uncensored patients who have experienced the event of interest and censored patients who have not, by embedding a specialized technique within our AI model, not commonly utilized in other AI tasks. Through the realization of these objectives, our model aims to provide precise OS predictions for non-small cell lung cancer (NSCLC) patients, thus overcoming these significant challenges. We present a novel approach to survival analysis with missing values in the context of NSCLC, which exploits the strengths of the transformer architecture to account only for available features without requiring any imputation strategy. More specifically, this model tailors the transformer architecture to tabular data by adapting its feature embedding and masked self-attention to mask missing data and fully exploit the available ones. By making use of ad-hoc designed losses for OS, it is able to account for both censored and uncensored patients, as well as changes in risks over time. We compared our method with state-of-the-art models for survival analysis coupled with different imputation strategies. We evaluated the results obtained over a period of 6 years using different time granularities obtaining a Ct-index, a time-dependent variant of the C-index, of 71.97, 77.58 and 80.72 for time units of 1 month, 1 year and 2 years, respectively, outperforming all state-of-the-art methods regardless of the imputation method used.
- [108] arXiv:2309.10642 (replaced) [pdf, html, other]
-
Title: Correcting Selection Bias in Standardized Test Scores ComparisonsSubjects: Econometrics (econ.EM); Applications (stat.AP)
This paper addresses the issue of sample selection bias when comparing countries using International assessments like PISA (Program for International Student Assessment). Despite its widespread use, PISA rankings may be biased due to different attrition patterns in different countries, leading to inaccurate comparisons. This study proposes a methodology to correct for sample selection bias using a quantile selection model. Applying the method to PISA 2018 data, I find that correcting for selection bias significantly changes the rankings (based on the mean) of countries' educational performances. My results highlight the importance of accounting for sample selection bias in international educational comparisons.
- [109] arXiv:2310.11439 (replaced) [pdf, html, other]
-
Title: From Alexnet to Transformers: Measuring the Non-linearity of Deep Neural Networks with Affine Optimal TransportQuentin Bouniot, Ievgen Redko, Anton Mallasto, Charlotte Laclau, Karol Arndt, Oliver Struckmeier, Markus Heinonen, Ville Kyrki, Samuel KaskiComments: Code available at this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
In the last decade, we have witnessed the introduction of several novel deep neural network (DNN) architectures exhibiting ever-increasing performance across diverse tasks. Explaining the upward trend of their performance, however, remains difficult as different DNN architectures of comparable depth and width -- common factors associated with their expressive power -- may exhibit a drastically different performance even when trained on the same dataset. In this paper, we introduce the concept of the non-linearity signature of DNN, the first theoretically sound solution for approximately measuring the non-linearity of deep neural networks. Built upon a score derived from closed-form optimal transport map**s, this signature provides a better understanding of the inner workings of a wide range of DNN architectures and learning paradigms, with a particular emphasis on the computer vision task. We provide extensive experimental results that highlight the practical usefulness of the proposed non-linearity signature and its potential for long-reaching implications. The code for our work is available at this https URL
- [110] arXiv:2311.13580 (replaced) [pdf, html, other]
-
Title: $\sigma$-PCA: a building block for neural learning of identifiable linear transformationsComments: Update with published versionSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Linear principal component analysis (PCA) learns (semi-)orthogonal transformations by orienting the axes to maximize variance. Consequently, it can only identify orthogonal axes whose variances are clearly distinct, but it cannot identify the subsets of axes whose variances are roughly equal. It cannot eliminate the subspace rotational indeterminacy: it fails to disentangle components with equal variances (eigenvalues), resulting, in each eigen subspace, in randomly rotated axes. In this paper, we propose $\sigma$-PCA, a method that (1) formulates a unified model for linear and nonlinear PCA, the latter being a special case of linear independent component analysis (ICA), and (2) introduces a missing piece into nonlinear PCA that allows it to eliminate, from the canonical linear PCA solution, the subspace rotational indeterminacy -- without whitening the inputs. Whitening, a preprocessing step which converts the inputs into unit-variance inputs, has generally been a prerequisite step for linear ICA methods, which meant that conventional nonlinear PCA could not necessarily preserve the orthogonality of the overall transformation, could not directly reduce dimensionality, and could not intrinsically order by variances. We offer insights on the relationship between linear PCA, nonlinear PCA, and linear ICA -- three methods with autoencoder formulations for learning special linear transformations from data, transformations that are (semi-)orthogonal for PCA, and arbitrary unit-variance for ICA. As part of our formulation, nonlinear PCA can be seen as a method that maximizes both variance and statistical independence, lying in the middle between linear PCA and linear ICA, serving as a building block for learning linear transformations that are identifiable.
- [111] arXiv:2312.02027 (replaced) [pdf, other]
-
Title: Stochastic Optimal Control MatchingSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Machine Learning (stat.ML)
Stochastic optimal control, which has the goal of driving the behavior of noisy systems, is broadly applicable in science, engineering and artificial intelligence. Our work introduces Stochastic Optimal Control Matching (SOCM), a novel Iterative Diffusion Optimization (IDO) technique for stochastic optimal control that stems from the same philosophy as the conditional score matching loss for diffusion models. That is, the control is learned via a least squares problem by trying to fit a matching vector field. The training loss, which is closely connected to the cross-entropy loss, is optimized with respect to both the control function and a family of reparameterization matrices which appear in the matching vector field. The optimization with respect to the reparameterization matrices aims at minimizing the variance of the matching vector field. Experimentally, our algorithm achieves lower error than all the existing IDO techniques for stochastic optimal control for three out of four control problems, in some cases by an order of magnitude. The key idea underlying SOCM is the path-wise reparameterization trick, a novel technique that may be of independent interest. Code at this https URL
- [112] arXiv:2312.07928 (replaced) [pdf, other]
-
Title: Bayesian inversion of GPR waveforms for sub-surface material characterization: an uncertainty-aware retrieval of soil moisture and overlaying biomass propertiesComments: Total 34 pages, 17 Figures. This paper under review in a journal but has not been published yetSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Applications (stat.AP)
Accurate estimation of sub-surface properties such as moisture content and depth of soil and vegetation layers is crucial for applications spanning sub-surface condition monitoring, precision agriculture, and effective wildfire risk assessment. Soil in nature is often covered by overlaying vegetation and surface organic material, making its characterization challenging. In addition, the estimation of the properties of the overlaying layer is crucial for applications like wildfire risk assessment. This study thus proposes a Bayesian model-updating-based approach for ground penetrating radar (GPR) waveform inversion to predict moisture contents and depths of soil and overlaying material layer. Due to its high correlation with moisture contents, the dielectric permittivity of both layers were predicted with the proposed method, along with other parameters, including depth and electrical conductivity of layers. The proposed Bayesian model updating approach yields probabilistic estimates of these parameters that can provide information about the confidence and uncertainty related to the estimates. The methodology was evaluated for a diverse range of experimental data collected through laboratory and field investigations. Laboratory investigations included variations in soil moisture values, depth of the overlaying surface layer, and coarseness of its material. The field investigation included measurement of field soil moisture for sixteen days. The results demonstrated predictions consistent with time-domain reflectometry (TDR) measurements and conventional gravimetric tests. The depth of the surface layer could also be predicted with reasonable accuracy. The proposed method provides a promising approach for uncertainty-aware sub-surface parameter estimation that can enable decision-making for risk assessment across a wide range of applications.
- [113] arXiv:2402.16710 (replaced) [pdf, html, other]
-
Title: Cost Aware Best Arm IdentificationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In this paper, we study a best arm identification problem with dual objects. In addition to the classic reward, each arm is associated with a cost distribution and the goal is to identify the largest reward arm using the minimum expected cost. We call it \emph{Cost Aware Best Arm Identification} (CABAI), which captures the separation of testing and implementation phases in product development pipelines and models the objective shift between phases, i.e., cost for testing and reward for implementation. We first derive a theoretical lower bound for CABAI and propose an algorithm called $\mathsf{CTAS}$ to match it asymptotically. To reduce the computation of $\mathsf{CTAS}$, we further propose a simple algorithm called \emph{Chernoff Overlap} (CO), based on a square-root rule, which we prove is optimal in simplified two-armed models and generalizes well in numerical experiments. Our results show that (i) ignoring the heterogeneous action cost results in sub-optimality in practice, and (ii) simple algorithms can deliver near-optimal performance over a wide range of problems.
- [114] arXiv:2403.01046 (replaced) [pdf, other]
-
Title: A Library of Mirrors: Deep Neural Nets in Low Dimensions are Convex Lasso Models with Reflection FeaturesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC); Machine Learning (stat.ML)
We prove that training neural networks on 1-D data is equivalent to solving a convex Lasso problem with a fixed, explicitly defined dictionary matrix of features. The specific dictionary depends on the activation and depth. We consider 2 and 3-layer networks with piecewise linear activations, and rectangular and tree networks with sign activation and arbitrary depth. Interestingly in absolute value and symmetrized ReLU networks, a third layer creates features that represent reflections of training data about themselves. The Lasso representation sheds insight to globally optimal networks and the solution landscape.
- [115] arXiv:2403.12975 (replaced) [pdf, other]
-
Title: Training morphological neural networks with gradient descent: some theoretical insightsSamy Blusseau (CMM)Journal-ref: IAPR Third International Conference on Discrete Geometry and Mathematical Morphology, Andrea Frosini; Elena Barcucci; Elisa Pergola; Michela Ascolese; Niccol{\'o} Di Marco; Simone Rinaldi; Sara Brunetti; Giulia Palma; Veronica Gierrini; Leonardo Bindi, Apr 2024, Firenze, Italy. pp.229-241Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Morphological neural networks, or layers, can be a powerful tool to boost the progress in mathematical morphology, either on theoretical aspects such as the representation of complete lattice operators, or in the development of image processing pipelines. However, these architectures turn out to be difficult to train when they count more than a few morphological layers, at least within popular machine learning frameworks which use gradient descent based optimization algorithms. In this paper we investigate the potential and limitations of differentiation based approaches and back-propagation applied to morphological networks, in light of the non-smooth optimization concept of Bouligand derivative. We provide insights and first theoretical guidelines, in particular regarding initialization and learning rates.
- [116] arXiv:2403.17852 (replaced) [pdf, html, other]
-
Title: Counterfactual Fairness through Transforming Data Orthogonal to BiasSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Machine learning models have shown exceptional prowess in solving complex issues across various domains. However, these models can sometimes exhibit biased decision-making, resulting in unequal treatment of different groups. Despite substantial research on counterfactual fairness, methods to reduce the impact of multivariate and continuous sensitive variables on decision-making outcomes are still underdeveloped. We propose a novel data pre-processing algorithm, Orthogonal to Bias (OB), which is designed to eliminate the influence of a group of continuous sensitive variables, thus promoting counterfactual fairness in machine learning applications. Our approach, based on the assumption of a jointly normal distribution within a structural causal model (SCM), demonstrates that counterfactual fairness can be achieved by ensuring the data is orthogonal to the observed sensitive variables. The OB algorithm is model-agnostic, making it applicable to a wide range of machine learning models and tasks. Additionally, it includes a sparse variant to improve numerical stability through regularization. Empirical evaluations on both simulated and real-world datasets, encompassing settings with both discrete and continuous sensitive variables, show that our methodology effectively promotes fairer outcomes without compromising accuracy.
- [117] arXiv:2404.01216 (replaced) [pdf, html, other]
-
Title: Novel Node Category Detection Under Subpopulation ShiftComments: Accepted to ECML-PKDD 2024Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
In real-world graph data, distribution shifts can manifest in various ways, such as the emergence of new categories and changes in the relative proportions of existing categories. It is often important to detect nodes of novel categories under such distribution shifts for safety or insight discovery purposes. We introduce a new approach, Recall-Constrained Optimization with Selective Link Prediction (RECO-SLIP), to detect nodes belonging to novel categories in attributed graphs under subpopulation shifts. By integrating a recall-constrained learning framework with a sample-efficient link prediction mechanism, RECO-SLIP addresses the dual challenges of resilience against subpopulation shifts and the effective exploitation of graph structure. Our extensive empirical evaluation across multiple graph datasets demonstrates the superior performance of RECO-SLIP over existing methods. The experimental code is available at this https URL.
- [118] arXiv:2404.01273 (replaced) [pdf, html, other]
-
Title: TWIN-GPT: Digital Twins for Clinical Trials via Large Language ModelSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Methodology (stat.ME)
Clinical trials are indispensable for medical research and the development of new treatments. However, clinical trials often involve thousands of participants and can span several years to complete, with a high probability of failure during the process. Recently, there has been a burgeoning interest in virtual clinical trials, which simulate real-world scenarios and hold the potential to significantly enhance patient safety, expedite development, reduce costs, and contribute to the broader scientific knowledge in healthcare. Existing research often focuses on leveraging electronic health records (EHRs) to support clinical trial outcome prediction. Yet, trained with limited clinical trial outcome data, existing approaches frequently struggle to perform accurate predictions. Some research has attempted to generate EHRs to augment model development but has fallen short in personalizing the generation for individual patient profiles. Recently, the emergence of large language models has illuminated new possibilities, as their embedded comprehensive clinical knowledge has proven beneficial in addressing medical issues. In this paper, we propose a large language model-based digital twin creation approach, called TWIN-GPT. TWIN-GPT can establish cross-dataset associations of medical information given limited data, generating unique personalized digital twins for different patients, thereby preserving individual patient characteristics. Comprehensive experiments show that using digital twins created by TWIN-GPT can boost the clinical trial outcome prediction, exceeding various previous prediction approaches.
- [119] arXiv:2404.14136 (replaced) [pdf, html, other]
-
Title: Elicitability and identifiability of tail risk measuresComments: 31 pages; typo in equation (5.1) fixed in version 2Subjects: Statistical Finance (q-fin.ST); Statistics Theory (math.ST); Risk Management (q-fin.RM); Methodology (stat.ME)
Tail risk measures are fully determined by the distribution of the underlying loss beyond its quantile at a certain level, with Value-at-Risk and Expected Shortfall being prime examples. They are induced by law-based risk measures, called their generators, evaluated on the tail distribution. This paper establishes joint identifiability and elicitability results of tail risk measures together with the corresponding quantile, provided that their generators are identifiable and elicitable, respectively. As an example, we establish the joint identifiability and elicitability of the tail expectile together with the quantile. The corresponding consistent scores constitute a novel class of weighted scores, nesting the known class of scores of Fissler and Ziegel for the Expected Shortfall together with the quantile. For statistical purposes, our results pave the way to easier model fitting for tail risk measures via regression and the generalized method of moments, but also model comparison and model validation in terms of established backtesting procedures.
- [120] arXiv:2405.04011 (replaced) [pdf, html, other]
-
Title: Adjoint Sensitivity Analysis on Multi-Scale Bioprocess Stochastic Reaction NetworkComments: 11 pages, 2 figuresSubjects: Molecular Networks (q-bio.MN); Machine Learning (stat.ML)
Motivated by the pressing challenges in the digital twin development for biomanufacturing systems, we introduce an adjoint sensitivity analysis (SA) approach to expedite the learning of mechanistic model parameters. In this paper, we consider enzymatic stochastic reaction networks representing a multi-scale bioprocess mechanistic model that allows us to integrate disparate data from diverse production processes and leverage the information from existing macro-kinetic and genome-scale models. To support forward prediction and backward reasoning, we develop a convergent adjoint SA algorithm studying how the perturbations of model parameters and inputs (e.g., initial state) propagate through enzymatic reaction networks and impact on output trajectory predictions. This SA can provide a sample efficient and interpretable way to assess the sensitivities between inputs and outputs accounting for their causal dependencies. Our empirical study underscores the resilience of these sensitivities and illuminates a deeper comprehension of the regulatory mechanisms behind bioprocess through sensitivities.
- [121] arXiv:2405.05097 (replaced) [pdf, html, other]
-
Title: Biology-inspired joint distribution neurons based on Hierarchical Correlation Reconstruction allowing for multidirectional neural networksComments: 7 pages, 6 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Biological neural networks seem qualitatively superior (e.g. in learning, flexibility, robustness) from current artificial like Multi-Layer Perceptron (MLP) or Kolmogorov-Arnold Network (KAN). Simultaneously, in contrast to them: have fundamentally multidirectional signal propagation~\cite{axon}, also of probability distributions e.g. for uncertainty estimation, and are believed not being able to use standard backpropagation training~\cite{backprop}. There are proposed novel artificial neurons based on HCR (Hierarchical Correlation Reconstruction) removing the above low level differences: with neurons containing local joint distribution model (of its connections), representing joint density on normalized variables as just linear combination among $(f_\mathbf{j})$ orthonormal polynomials: $\rho(\mathbf{x})=\sum_{\mathbf{j}\in B} a_\mathbf{j} f_\mathbf{j}(\mathbf{x})$ for $\mathbf{x} \in [0,1]^d$ and $B$ some chosen basis, with basis growth approaching complete description of joint distribution. By various index summations of such $(a_\mathbf{j})$ tensor as neuron parameters, we get simple formulas for e.g. conditional expected values for propagation in any direction, like $E[x|y,z]$, $E[y|x]$, which degenerate to KAN-like parametrization if restricting to pairwise dependencies. Such HCR network can also propagate probability distributions (also joint) like $\rho(y,z|x)$. It also allows for additional training approaches, like direct $(a_\mathbf{j})$ estimation, through tensor decomposition, or more biologically plausible information bottleneck training: layers directly influencing only neighbors, optimizing content to maximize information about the next layer, and minimizing about the previous to minimize the noise.
- [122] arXiv:2405.19440 (replaced) [pdf, html, other]
-
Title: On the Convergence of Multi-objective Optimization under Generalized SmoothnessSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning. Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard $L$-smooth or bounded-gradient assumptions, which are typically unsatisfactory for neural networks, such as recurrent neural networks (RNNs) and transformers. In this paper, we study a more general and realistic class of $\ell$-smooth loss functions, where $\ell$ is a general non-decreasing function of gradient norm. We develop two novel single-loop algorithms for $\ell$-smooth MOO problems, Generalized Smooth Multi-objective Gradient descent (GSMGrad) and its stochastic variant, Stochastic Generalized Smooth Multi-objective Gradient descent (SGSMGrad), which approximate the conflict-avoidant (CA) direction that maximizes the minimum improvement among objectives. We provide a comprehensive convergence analysis of both algorithms and show that they converge to an $\epsilon$-accurate Pareto stationary point with a guaranteed $\epsilon$-level average CA distance (i.e., the gap between the updating direction and the CA direction) over all iterations, where totally $\mathcal{O}(\epsilon^{-2})$ and $\mathcal{O}(\epsilon^{-4})$ samples are needed for deterministic and stochastic settings, respectively. Our algorithms can also guarantee a tighter $\epsilon$-level CA distance in each iteration using more samples. Moreover, we propose a practical variant of GSMGrad named GSMGrad-FA using only constant-level time and space, while achieving the same performance guarantee as GSMGrad. Our experiments validate our theory and demonstrate the effectiveness of the proposed methods.
- [123] arXiv:2406.00193 (replaced) [pdf, html, other]
-
Title: Learning topological states from randomized measurements using variational tensor network tomographyYanting Teng, Rhine Samajdar, Katherine Van Kirk, Frederik Wilde, Subir Sachdev, Jens Eisert, Ryan Sweke, Khadijeh NajafiComments: 11+35 pages, 4+3 figures; Added additional referencesSubjects: Quantum Physics (quant-ph); Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (stat.ML)
Learning faithful representations of quantum states is crucial to fully characterizing the variety of many-body states created on quantum processors. While various tomographic methods such as classical shadow and MPS tomography have shown promise in characterizing a wide class of quantum states, they face unique limitations in detecting topologically ordered two-dimensional states. To address this problem, we implement and study a heuristic tomographic method that combines variational optimization on tensor networks with randomized measurement techniques. Using this approach, we demonstrate its ability to learn the ground state of the surface code Hamiltonian as well as an experimentally realizable quantum spin liquid state. In particular, we perform numerical experiments using MPS ansätze and systematically investigate the sample complexity required to achieve high fidelities for systems of sizes up to $48$ qubits. In addition, we provide theoretical insights into the scaling of our learning algorithm by analyzing the statistical properties of maximum likelihood estimation. Notably, our method is sample-efficient and experimentally friendly, only requiring snapshots of the quantum state measured randomly in the $X$ or $Z$ bases. Using this subset of measurements, our approach can effectively learn any real pure states represented by tensor networks, and we rigorously prove that random-$XZ$ measurements are tomographically complete for such states.
- [124] arXiv:2406.00535 (replaced) [pdf, html, other]
-
Title: Causal Contrastive Learning for Counterfactual Regression Over TimeSubjects: Machine Learning (cs.LG); Methodology (stat.ME)
Estimating treatment effects over time holds significance in various domains, including precision medicine, epidemiology, economy, and marketing. This paper introduces a unique approach to counterfactual regression over time, emphasizing long-term predictions. Distinguishing itself from existing models like Causal Transformer, our approach highlights the efficacy of employing RNNs for long-term forecasting, complemented by Contrastive Predictive Coding (CPC) and Information Maximization (InfoMax). Emphasizing efficiency, we avoid the need for computationally expensive transformers. Leveraging CPC, our method captures long-term dependencies in the presence of time-varying confounders. Notably, recent models have disregarded the importance of invertible representation, compromising identification assumptions. To remedy this, we employ the InfoMax principle, maximizing a lower bound of mutual information between sequence data and its representation. Our method achieves state-of-the-art counterfactual estimation results using both synthetic and real-world data, marking the pioneering incorporation of Contrastive Predictive Encoding in causal inference.
- [125] arXiv:2406.04043 (replaced) [pdf, other]
-
Title: Energy-based Epistemic Uncertainty for Graph Neural NetworksSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In domains with interdependent data, such as graphs, quantifying the epistemic uncertainty of a Graph Neural Network (GNN) is challenging as uncertainty can arise at different structural scales. Existing techniques neglect this issue or only distinguish between structure-aware and structure-agnostic uncertainty without combining them into a single measure. We propose GEBM, an energy-based model (EBM) that provides high-quality uncertainty estimates by aggregating energy at different structural levels that naturally arise from graph diffusion. In contrast to logit-based EBMs, we provably induce an integrable density in the data space by regularizing the energy function. We introduce an evidential interpretation of our EBM that significantly improves the predictive robustness of the GNN. Our framework is a simple and effective post hoc method applicable to any pre-trained GNN that is sensitive to various distribution shifts. It consistently achieves the best separation of in-distribution and out-of-distribution data on 6 out of 7 anomaly types while having the best average rank over shifts on \emph{all} datasets.
- [126] arXiv:2406.04824 (replaced) [pdf, html, other]
-
Title: FunBO: Discovering Acquisition Functions for Bayesian Optimization with FunSearchVirginia Aglietti, Ira Ktena, Jessica Schrouff, Eleni Sgouritsa, Francisco J. R. Ruiz, Alan Malek, Alexis Bellot, Silvia ChiappaSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The sample efficiency of Bayesian optimization algorithms depends on carefully crafted acquisition functions (AFs) guiding the sequential collection of function evaluations. The best-performing AF can vary significantly across optimization problems, often requiring ad-hoc and problem-specific choices. This work tackles the challenge of designing novel AFs that perform well across a variety of experimental settings. Based on FunSearch, a recent work using Large Language Models (LLMs) for discovery in mathematical sciences, we propose FunBO, an LLM-based method that can be used to learn new AFs written in computer code by leveraging access to a limited number of evaluations for a set of objective functions. We provide the analytic expression of all discovered AFs and evaluate them on various global optimization benchmarks and hyperparameter optimization tasks. We show how FunBO identifies AFs that generalize well in and out of the training distribution of functions, thus outperforming established general-purpose AFs and achieving competitive performance against AFs that are customized to specific function types and are learned via transfer-learning algorithms.
- [127] arXiv:2406.11011 (replaced) [pdf, html, other]
-
Title: Data Shapley in One Training RunSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. However, existing approaches require re-training models on different data subsets, which is computationally intensive, foreclosing their application to large-scale models. Furthermore, they produce the same attribution score for any models produced by running the learning algorithm, meaning they cannot perform targeted attribution towards a specific model obtained from a single run of the algorithm. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest. In its most efficient implementation, our technique incurs negligible additional runtime compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage for the first time. We present several case studies that offer fresh insights into pretraining data's contribution and discuss their implications for copyright in generative AI and pretraining data curation.
- [128] arXiv:2406.17831 (replaced) [pdf, html, other]
-
Title: Empirical Bayes for Dynamic Bayesian Networks Using Generalized Variational InferenceVyacheslav Kungurtsev, Apaar, Aarya Khandelwal, Parth Sandeep Rastogi, Bapi Chatterjee, Jakub MarečekSubjects: Machine Learning (cs.LG); Statistics Theory (math.ST)
In this work, we demonstrate the Empirical Bayes approach to learning a Dynamic Bayesian Network. By starting with several point estimates of structure and weights, we can use a data-driven prior to subsequently obtain a model to quantify uncertainty. This approach uses a recent development of Generalized Variational Inference, and indicates the potential of sampling the uncertainty of a mixture of DAG structures as well as a parameter posterior.