Machine Learning
See recent articles
- [1] arXiv:2407.00644 [pdf, html, other]
-
Title: Clusterpath Gaussian Graphical ModelingComments: 43 pages, 11 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Graphical models serve as effective tools for visualizing conditional dependencies between variables. However, as the number of variables grows, interpretation becomes increasingly difficult, and estimation uncertainty increases due to the large number of parameters relative to the number of observations. To address these challenges, we introduce the Clusterpath estimator of the Gaussian Graphical Model (CGGM) that encourages variable clustering in the graphical model in a data-driven way. Through the use of a clusterpath penalty, we group variables together, which in turn results in a block-structured precision matrix whose block structure remains preserved in the covariance matrix. We present a computationally efficient implementation of the CGGM estimator by using a cyclic block coordinate descent algorithm. In simulations, we show that CGGM not only matches, but oftentimes outperforms other state-of-the-art methods for variable clustering in graphical models. We also demonstrate CGGM's practical advantages and versatility on a diverse collection of empirical applications.
- [2] arXiv:2407.00649 [pdf, html, other]
-
Title: Particle Semi-Implicit Variational InferenceSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Semi-implicit variational inference (SIVI) enriches the expressiveness of variational families by utilizing a kernel and a mixing distribution to hierarchically define the variational distribution. Existing SIVI methods parameterize the mixing distribution using implicit distributions, leading to intractable variational densities. As a result, directly maximizing the evidence lower bound (ELBO) is not possible and so, they resort to either: optimizing bounds on the ELBO, employing costly inner-loop Markov chain Monte Carlo runs, or solving minimax objectives. In this paper, we propose a novel method for SIVI called Particle Variational Inference (PVI) which employs empirical measures to approximate the optimal mixing distributions characterized as the minimizer of a natural free energy functional via a particle approximation of an Euclidean--Wasserstein gradient flow. This approach means that, unlike prior works, PVI can directly optimize the ELBO; furthermore, it makes no parametric assumption about the mixing distribution. Our empirical results demonstrate that PVI performs favourably against other SIVI methods across various tasks. Moreover, we provide a theoretical analysis of the behaviour of the gradient flow of a related free energy functional: establishing the existence and uniqueness of solutions as well as propagation of chaos results.
- [3] arXiv:2407.00730 [pdf, html, other]
-
Title: D-CDLF: Decomposition of Common and Distinctive Latent Factors for Multi-view High-dimensional DataSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
A typical approach to the joint analysis of multiple high-dimensional data views is to decompose each view's data matrix into three parts: a low-rank common-source matrix generated by common latent factors of all data views, a low-rank distinctive-source matrix generated by distinctive latent factors of the corresponding data view, and an additive noise matrix. Existing decomposition methods often focus on the uncorrelatedness between the common latent factors and distinctive latent factors, but inadequately address the equally necessary uncorrelatedness between distinctive latent factors from different data views. We propose a novel decomposition method, called Decomposition of Common and Distinctive Latent Factors (D-CDLF), to effectively achieve both types of uncorrelatedness for two-view data. We also discuss the estimation of the D-CDLF under high-dimensional settings.
- [4] arXiv:2407.01015 [pdf, html, other]
-
Title: Bayesian Entropy Neural Networks for Physics-Aware PredictionComments: 15 pagesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
This paper addresses the need for deep learning models to integrate well-defined constraints into their outputs, driven by their application in surrogate models, learning with limited data and partial information, and scenarios requiring flexible model behavior to incorporate non-data sample information. We introduce Bayesian Entropy Neural Networks (BENN), a framework grounded in Maximum Entropy (MaxEnt) principles, designed to impose constraints on Bayesian Neural Network (BNN) predictions. BENN is capable of constraining not only the predicted values but also their derivatives and variances, ensuring a more robust and reliable model output. To achieve simultaneous uncertainty quantification and constraint satisfaction, we employ the method of multipliers approach. This allows for the concurrent estimation of neural network parameters and the Lagrangian multipliers associated with the constraints. Our experiments, spanning diverse applications such as beam deflection modeling and microstructure generation, demonstrate the effectiveness of BENN. The results highlight significant improvements over traditional BNNs and showcase competitive performance relative to contemporary constrained deep learning methods.
- [5] arXiv:2407.01079 [pdf, html, other]
-
Title: On Statistical Rates and Provably Efficient Criteria of Latent Diffusion Transformers (DiTs)Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
We investigate the statistical and computational limits of latent \textbf{Di}ffusion \textbf{T}ransformers (\textbf{DiT}s) under the low-dimensional linear latent space assumption. Statistically, we study the universal approximation and sample complexity of the DiTs score function, as well as the distribution recovery property of the initial data. Specifically, under mild data assumptions, we derive an approximation error bound for the score network of latent DiTs, which is sub-linear in the latent space dimension. Additionally, we derive the corresponding sample complexity bound and show that the data distribution generated from the estimated score function converges toward a proximate area of the original one. Computationally, we characterize the hardness of both forward inference and backward computation of latent DiTs, assuming the Strong Exponential Time Hypothesis (SETH). For forward inference, we identify efficient criteria for all possible latent DiTs inference algorithms and showcase our theory by pushing the efficiency toward almost-linear time inference. For backward computation, we leverage the low-rank structure within the gradient computation of DiTs training for possible algorithmic speedup. Specifically, we show that such speedup achieves almost-linear time latent DiTs training by casting the DiTs gradient as a series of chained low-rank approximations with bounded error. Under the low-dimensional assumption, we show that the convergence rate and the computational efficiency are both dominated by the dimension of the subspace, suggesting that latent DiTs have the potential to bypass the challenges associated with the high dimensionality of initial data.
New submissions for Tuesday, 2 July 2024 (showing 5 of 5 entries )
- [6] arXiv:2407.00143 (cross-list from cs.LG) [pdf, other]
-
Title: InfoNCE: Identifying the Gap Between Theory and PracticeEvgenia Rusak, Patrik Reizinger, Attila Juhos, Oliver Bringmann, Roland S. Zimmermann, Wieland BrendelSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Previous theoretical work on contrastive learning (CL) with InfoNCE showed that, under certain assumptions, the learned representations uncover the ground-truth latent factors. We argue these theories overlook crucial aspects of how CL is deployed in practice. Specifically, they assume that within a positive pair, all latent factors either vary to a similar extent, or that some do not vary at all. However, in practice, positive pairs are often generated using augmentations such as strong crop** to just a few pixels. Hence, a more realistic assumption is that all latent factors change, with a continuum of variability across these factors. We introduce AnInfoNCE, a generalization of InfoNCE that can provably uncover the latent factors in this anisotropic setting, broadly generalizing previous identifiability results in CL. We validate our identifiability results in controlled experiments and show that AnInfoNCE increases the recovery of previously collapsed information in CIFAR10 and ImageNet, albeit at the cost of downstream accuracy. Additionally, we explore and discuss further mismatches between theoretical assumptions and practical implementations, including extensions to hard negative mining and loss ensembles.
- [7] arXiv:2407.00175 (cross-list from q-bio.QM) [pdf, other]
-
Title: Permutation invariant multi-output Gaussian Processes for drug combination prediction in cancerSubjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
Dose-response prediction in cancer is an active application field in machine learning. Using large libraries of \textit{in-vitro} drug sensitivity screens, the goal is to develop accurate predictive models that can be used to guide experimental design or inform treatment decisions. Building on previous work that makes use of permutation invariant multi-output Gaussian Processes in the context of dose-response prediction for drug combinations, we develop a variational approximation to these models. The variational approximation enables a more scalable model that provides uncertainty quantification and naturally handles missing data. Furthermore, we propose using a deep generative model to encode the chemical space in a continuous manner, enabling prediction for new drugs and new combinations. We demonstrate the performance of our model in a simple setting using a high-throughput dataset and show that the model is able to efficiently borrow information across outputs.
- [8] arXiv:2407.00256 (cross-list from cs.AI) [pdf, html, other]
-
Title: One Prompt is not Enough: Automated Construction of a Mixture-of-Expert PromptsComments: ICML 2024. code available at this https URLJournal-ref: Proceedings of the 41st International Conference on Machine Learning (ICML), Vienna, Austria, 2024Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Large Language Models (LLMs) exhibit strong generalization capabilities to novel tasks when prompted with language instructions and in-context demos. Since this ability sensitively depends on the quality of prompts, various methods have been explored to automate the instruction design. While these methods demonstrated promising results, they also restricted the searched prompt to one instruction. Such simplification significantly limits their capacity, as a single demo-free instruction might not be able to cover the entire complex problem space of the targeted task. To alleviate this issue, we adopt the Mixture-of-Expert paradigm and divide the problem space into a set of sub-regions; Each sub-region is governed by a specialized expert, equipped with both an instruction and a set of demos. A two-phase process is developed to construct the specialized expert for each region: (1) demo assignment: Inspired by the theoretical connection between in-context learning and kernel regression, we group demos into experts based on their semantic similarity; (2) instruction assignment: A region-based joint search of an instruction per expert complements the demos assigned to it, yielding a synergistic effect. The resulting method, codenamed Mixture-of-Prompts (MoP), achieves an average win rate of 81% against prior arts across several major benchmarks.
- [9] arXiv:2407.00271 (cross-list from math.DS) [pdf, html, other]
-
Title: Minimum Reduced-Order Models via Causal InferenceSubjects: Dynamical Systems (math.DS); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
Enhancing the sparsity of data-driven reduced-order models (ROMs) has gained increasing attention in recent years. In this work, we analyze an efficient approach to identifying skillful ROMs with a sparse structure using an information-theoretic indicator called causation entropy. The causation entropy quantifies in a statistical way the additional contribution of each term to the underlying dynamics beyond the information already captured by all the other terms in the ansatz. By doing so, the causation entropy assesses the importance of each term to the dynamics before a parameter estimation procedure is performed. Thus, the approach can be utilized to eliminate terms with little dynamic impact, leading to a parsimonious structure that retains the essential physics. To circumvent the difficulty of estimating high-dimensional probability density functions (PDFs) involved in the causation entropy computation, we leverage Gaussian approximations for such PDFs, which are demonstrated to be sufficient even in the presence of highly non-Gaussian dynamics. The effectiveness of the approach is illustrated by the Kuramoto-Sivashinsky equation by building sparse causation-based ROMs for various purposes, such as recovering long-term statistics and inferring unobserved dynamics via data assimilation with partial observations.
- [10] arXiv:2407.00364 (cross-list from stat.ME) [pdf, html, other]
-
Title: Medical Knowledge Integration into Reinforcement Learning Algorithms for Dynamic Treatment RegimesSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
The goal of precision medicine is to provide individualized treatment at each stage of chronic diseases, a concept formalized by Dynamic Treatment Regimes (DTR). These regimes adapt treatment strategies based on decision rules learned from clinical data to enhance therapeutic effectiveness. Reinforcement Learning (RL) algorithms allow to determine these decision rules conditioned by individual patient data and their medical history. The integration of medical expertise into these models makes possible to increase confidence in treatment recommendations and facilitate the adoption of this approach by healthcare professionals and patients. In this work, we examine the mathematical foundations of RL, contextualize its application in the field of DTR, and present an overview of methods to improve its effectiveness by integrating medical expertise.
- [11] arXiv:2407.00397 (cross-list from cs.LG) [pdf, html, other]
-
Title: Markovian Gaussian Process: A Universal State-Space Representation for Stationary Temporal Gaussian ProcessSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Gaussian Processes (GPs) and Linear Dynamical Systems (LDSs) are essential time series and dynamic system modeling tools. GPs can handle complex, nonlinear dynamics but are computationally demanding, while LDSs offer efficient computation but lack the expressive power of GPs. To combine their benefits, we introduce a universal method that allows an LDS to mirror stationary temporal GPs. This state-space representation, known as the Markovian Gaussian Process (Markovian GP), leverages the flexibility of kernel functions while maintaining efficient linear computation. Unlike existing GP-LDS conversion methods, which require separability for most multi-output kernels, our approach works universally for single- and multi-output stationary temporal kernels. We evaluate our method by computing covariance, performing regression tasks, and applying it to a neuroscience application, demonstrating that our method provides an accurate state-space representation for stationary temporal GPs.
- [12] arXiv:2407.00490 (cross-list from cs.LG) [pdf, html, other]
-
Title: Toward Global Convergence of Gradient EM for Over-Parameterized Gaussian Mixture ModelsComments: 25 pagesSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
We study the gradient Expectation-Maximization (EM) algorithm for Gaussian Mixture Models (GMM) in the over-parameterized setting, where a general GMM with $n>1$ components learns from data that are generated by a single ground truth Gaussian distribution. While results for the special case of 2-Gaussian mixtures are well-known, a general global convergence analysis for arbitrary $n$ remains unresolved and faces several new technical barriers since the convergence becomes sub-linear and non-monotonic. To address these challenges, we construct a novel likelihood-based convergence analysis framework and rigorously prove that gradient EM converges globally with a sublinear rate $O(1/\sqrt{t})$. This is the first global convergence result for Gaussian mixtures with more than $2$ components. The sublinear convergence rate is due to the algorithmic nature of learning over-parameterized GMM with gradient EM. We also identify a new emerging technical challenge for learning general over-parameterized GMM: the existence of bad local regions that can trap gradient EM for an exponential number of steps.
- [13] arXiv:2407.00492 (cross-list from cs.LG) [pdf, html, other]
-
Title: Fast Gibbs sampling for the local and global trend Bayesian exponential smoothing modelSubjects: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
In Smyl et al. [Local and global trend Bayesian exponential smoothing models. International Journal of Forecasting, 2024.], a generalised exponential smoothing model was proposed that is able to capture strong trends and volatility in time series. This method achieved state-of-the-art performance in many forecasting tasks, but its fitting procedure, which is based on the NUTS sampler, is very computationally expensive. In this work, we propose several modifications to the original model, as well as a bespoke Gibbs sampler for posterior exploration; these changes improve sampling time by an order of magnitude, thus rendering the model much more practically relevant. The new model, and sampler, are evaluated on the M3 dataset and are shown to be competitive, or superior, in terms of accuracy to the original method, while being substantially faster to run.
- [14] arXiv:2407.00529 (cross-list from cs.LG) [pdf, html, other]
-
Title: Detecting and Identifying Selection Structure in Sequential DataComments: ICML 2024Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Statistics Theory (math.ST); Machine Learning (stat.ML)
We argue that the selective inclusion of data points based on latent objectives is common in practical situations, such as music sequences. Since this selection process often distorts statistical analysis, previous work primarily views it as a bias to be corrected and proposes various methods to mitigate its effect. However, while controlling this bias is crucial, selection also offers an opportunity to provide a deeper insight into the hidden generation process, as it is a fundamental mechanism underlying what we observe. In particular, overlooking selection in sequential data can lead to an incomplete or overcomplicated inductive bias in modeling, such as assuming a universal autoregressive structure for all dependencies. Therefore, rather than merely viewing it as a bias, we explore the causal structure of selection in sequential data to delve deeper into the complete causal process. Specifically, we show that selection structure is identifiable without any parametric assumptions or interventional experiments. Moreover, even in cases where selection variables coexist with latent confounders, we still establish the nonparametric identifiability under appropriate structural conditions. Meanwhile, we also propose a provably correct algorithm to detect and identify selection structures as well as other types of dependencies. The framework has been validated empirically on both synthetic data and real-world music.
- [15] arXiv:2407.00584 (cross-list from cs.LG) [pdf, html, other]
-
Title: Hyperparameter Optimization for Randomized Algorithms: A Case Study for Random FeaturesSubjects: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
Randomized algorithms exploit stochasticity to reduce computational complexity. One important example is random feature regression (RFR) that accelerates Gaussian process regression (GPR). RFR approximates an unknown function with a random neural network whose hidden weights and biases are sampled from a probability distribution. Only the final output layer is fit to data. In randomized algorithms like RFR, the hyperparameters that characterize the sampling distribution greatly impact performance, yet are not directly accessible from samples. This makes optimization of hyperparameters via standard (gradient-based) optimization tools inapplicable. Inspired by Bayesian ideas from GPR, this paper introduces a random objective function that is tailored for hyperparameter tuning of vector-valued random features. The objective is minimized with ensemble Kalman inversion (EKI). EKI is a gradient-free particle-based optimizer that is scalable to high-dimensions and robust to randomness in objective functions. A numerical study showcases the new black-box methodology to learn hyperparameter distributions in several problems that are sensitive to the hyperparameter selection: two global sensitivity analyses, integrating a chaotic dynamical system, and solving a Bayesian inverse problem from atmospheric dynamics. The success of the proposed EKI-based algorithm for RFR suggests its potential for automated optimization of hyperparameters arising in other randomized algorithms.
- [16] arXiv:2407.00706 (cross-list from cs.LG) [pdf, html, other]
-
Title: Sum-of-norms regularized Nonnegative Matrix FactorizationComments: 22 pages, 12 figuresSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
When applying nonnegative matrix factorization (NMF), generally the rank parameter is unknown. Such rank in NMF, called the nonnegative rank, is usually estimated heuristically since computing the exact value of it is NP-hard. In this work, we propose an approximation method to estimate such rank while solving NMF on-the-fly. We use sum-of-norm (SON), a group-lasso structure that encourages pairwise similarity, to reduce the rank of a factor matrix where the rank is overestimated at the beginning. On various datasets, SON-NMF is able to reveal the correct nonnegative rank of the data without any prior knowledge nor tuning.
SON-NMF is a nonconvx nonsmmoth non-separable non-proximable problem, solving it is nontrivial. First, as rank estimation in NMF is NP-hard, the proposed approach does not enjoy a lower computational complexity. Using a graph-theoretic argument, we prove that the complexity of the SON-NMF is almost irreducible. Second, the per-iteration cost of any algorithm solving SON-NMF is possibly high, which motivated us to propose a first-order BCD algorithm to approximately solve SON-NMF with a low per-iteration cost, in which we do so by the proximal average operator. Lastly, we propose a simple greedy method for post-processing.
SON-NMF exhibits favourable features for applications. Beside the ability to automatically estimate the rank from data, SON-NMF can deal with rank-deficient data matrix, can detect weak component with small energy. Furthermore, on the application of hyperspectral imaging, SON-NMF handle the issue of spectral variability naturally. - [17] arXiv:2407.00710 (cross-list from cs.LG) [pdf, html, other]
-
Title: Weighted Missing Linear Discriminant Analysis: An Explainable Approach for Classification with Missing DataSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
As Artificial Intelligence (AI) models are gradually being adopted in real-life applications, the explainability of the model used is critical, especially in high-stakes areas such as medicine, finance, etc. Among the commonly used models, Linear Discriminant Analysis (LDA) is a widely used classification tool that is also explainable thanks to its ability to model class distributions and maximize class separation through linear feature combinations. Nevertheless, real-world data is frequently incomplete, presenting significant challenges for classification tasks and model explanations. In this paper, we propose a novel approach to LDA under missing data, termed \textbf{\textit{Weighted missing Linear Discriminant Analysis (WLDA)}}, to directly classify observations in data that contains missing values without imputation effectively by estimating the parameters directly on missing data and use a weight matrix for missing values to penalize missing entries during classification. Furthermore, we also analyze the theoretical properties and examine the explainability of the proposed technique in a comprehensive manner. Experimental results demonstrate that WLDA outperforms conventional methods by a significant margin, particularly in scenarios where missing values are present in both training and test sets.
- [18] arXiv:2407.00745 (cross-list from cs.LG) [pdf, other]
-
Title: Posterior Sampling with Denoising Oracles via Tilted TransportSubjects: Machine Learning (cs.LG); Probability (math.PR); Computation (stat.CO); Machine Learning (stat.ML)
Score-based diffusion models have significantly advanced high-dimensional data generation across various domains, by learning a denoising oracle (or score) from datasets. From a Bayesian perspective, they offer a realistic modeling of data priors and facilitate solving inverse problems through posterior sampling. Although many heuristic methods have been developed recently for this purpose, they lack the quantitative guarantees needed in many scientific applications.
In this work, we introduce the \textit{tilted transport} technique, which leverages the quadratic structure of the log-likelihood in linear inverse problems in combination with the prior denoising oracle to transform the original posterior sampling problem into a new `boosted' posterior that is provably easier to sample from. We quantify the conditions under which this boosted posterior is strongly log-concave, highlighting the dependencies on the condition number of the measurement matrix and the signal-to-noise ratio. The resulting posterior sampling scheme is shown to reach the computational threshold predicted for sampling Ising models [Kunisky'23] with a direct analysis, and is further validated on high-dimensional Gaussian mixture models and scalar field $\varphi^4$ models. - [19] arXiv:2407.00765 (cross-list from cs.LG) [pdf, html, other]
-
Title: Structured and Balanced Multi-component and Multi-layer Neural NetworksComments: Our codes and implementation details are available at this https URLSubjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA); Machine Learning (stat.ML)
In this work, we propose a balanced multi-component and multi-layer neural network (MMNN) structure to approximate functions with complex features with both accuracy and efficiency in terms of degrees of freedom and computation cost. The main idea is motivated by a multi-component, each of which can be approximated effectively by a single-layer network, and multi-layer decomposition in a "divide-and-conquer" type of strategy to deal with a complex function. While an easy modification to fully connected neural networks (FCNNs) or multi-layer perceptrons (MLPs) through the introduction of balanced multi-component structures in the network, MMNNs achieve a significant reduction of training parameters, a much more efficient training process, and a much improved accuracy compared to FCNNs or MLPs. Extensive numerical experiments are presented to illustrate the effectiveness of MMNNs in approximating high oscillatory functions and its automatic adaptivity in capturing localized features.
- [20] arXiv:2407.00927 (cross-list from cs.LG) [pdf, html, other]
-
Title: Learnability of Parameter-Bounded Bayes NetsComments: 15 pages, 2 figuresSubjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Machine Learning (stat.ML)
Bayes nets are extensively used in practice to efficiently represent joint probability distributions over a set of random variables and capture dependency relations. In a seminal paper, Chickering et al. (JMLR 2004) showed that given a distribution $P$, that is defined as the marginal distribution of a Bayes net, it is $\mathsf{NP}$-hard to decide whether there is a parameter-bounded Bayes net that represents $P$. They called this problem LEARN. In this work, we extend the $\mathsf{NP}$-hardness result of LEARN and prove the $\mathsf{NP}$-hardness of a promise search variant of LEARN, whereby the Bayes net in question is guaranteed to exist and one is asked to find such a Bayes net. We complement our hardness result with a positive result about the sample complexity that is sufficient to recover a parameter-bounded Bayes net that is close (in TV distance) to a given distribution $P$, that is represented by some parameter-bounded Bayes net, generalizing a degree-bounded sample complexity result of Brustle et al. (EC 2020).
- [21] arXiv:2407.00950 (cross-list from cs.LG) [pdf, html, other]
-
Title: Causal Bandits: The Pareto Optimal Frontier of Adaptivity, a Reduction to Linear Bandits, and Limitations around Unknown MarginalsComments: Accepted to ICML 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In this work, we investigate the problem of adapting to the presence or absence of causal structure in multi-armed bandit problems. In addition to the usual reward signal, we assume the learner has access to additional variables, observed in each round after acting. When these variables $d$-separate the action from the reward, existing work in causal bandits demonstrates that one can achieve strictly better (minimax) rates of regret (Lu et al., 2020). Our goal is to adapt to this favorable "conditionally benign" structure, if it is present in the environment, while simultaneously recovering worst-case minimax regret, if it is not. Notably, the learner has no prior knowledge of whether the favorable structure holds. In this paper, we establish the Pareto optimal frontier of adaptive rates. We prove upper and matching lower bounds on the possible trade-offs in the performance of learning in conditionally benign and arbitrary environments, resolving an open question raised by Bilodeau et al. (2022). Furthermore, we are the first to obtain instance-dependent bounds for causal bandits, by reducing the problem to the linear bandit setting. Finally, we examine the common assumption that the marginal distributions of the post-action contexts are known and show that a nontrivial estimate is necessary for better-than-worst-case minimax rates.
- [22] arXiv:2407.00957 (cross-list from cs.NE) [pdf, html, other]
-
Title: Expressivity of Neural Networks with Random Weights and Learned BiasesEzekiel Williams, Avery Hee-Woon Ryoo, Thomas Jiralerspong, Alexandre Payeur, Matthew G. Perich, Luca Mazzucatto, Guillaume LajoieSubjects: Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
Landmark universal function approximation results for neural networks with trained weights and biases provided impetus for the ubiquitous use of neural networks as learning models in Artificial Intelligence (AI) and neuroscience. Recent work has pushed the bounds of universal approximation by showing that arbitrary functions can similarly be learned by tuning smaller subsets of parameters, for example the output weights, within randomly initialized networks. Motivated by the fact that biases can be interpreted as biologically plausible mechanisms for adjusting unit outputs in neural networks, such as tonic inputs or activation thresholds, we investigate the expressivity of neural networks with random weights where only biases are optimized. We provide theoretical and numerical evidence demonstrating that feedforward neural networks with fixed random weights can be trained to perform multiple tasks by learning biases only. We further show that an equivalent result holds for recurrent neural networks predicting dynamical system trajectories. Our results are relevant to neuroscience, where they demonstrate the potential for behaviourally relevant changes in dynamics without modifying synaptic weights, as well as for AI, where they shed light on multi-task methods such as bias fine-tuning and unit masking.
- [23] arXiv:2407.01036 (cross-list from stat.ME) [pdf, html, other]
-
Title: Ranking by Lifts: A Cost-Benefit Approach to Large-Scale A/B TestsSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
A/B testers conducting large-scale tests prioritize lifts and want to be able to control false rejections of the null. This work develops a decision-theoretic framework for maximizing profits subject to false discovery rate (FDR) control. We build an empirical Bayes solution for the problem via the greedy knapsack approach. We derive an oracle rule based on ranking the ratio of expected lifts and the cost of wrong rejections using the local false discovery rate (lfdr) statistic. Our oracle decision rule is valid and optimal for large-scale tests. Further, we establish asymptotic validity for the data-driven procedure and demonstrate finite-sample validity in experimental studies. We also demonstrate the merit of the proposed method over other FDR control methods. Finally, we discuss an application to actual Optimizely experiments.
- [24] arXiv:2407.01111 (cross-list from cs.LG) [pdf, html, other]
-
Title: Proximity Matters: Local Proximity Preserved Balancing for Treatment Effect EstimationComments: Code is available at https://anonymous.4open.science/status/ncr-B697Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Heterogeneous treatment effect (HTE) estimation from observational data poses significant challenges due to treatment selection bias. Existing methods address this bias by minimizing distribution discrepancies between treatment groups in latent space, focusing on global alignment. However, the fruitful aspect of local proximity, where similar units exhibit similar outcomes, is often overlooked. In this study, we propose Proximity-aware Counterfactual Regression (PCR) to exploit proximity for representation balancing within the HTE estimation context. Specifically, we introduce a local proximity preservation regularizer based on optimal transport to depict the local proximity in discrepancy calculation. Furthermore, to overcome the curse of dimensionality that renders the estimation of discrepancy ineffective, exacerbated by limited data availability for HTE estimation, we develop an informative subspace projector, which trades off minimal distance precision for improved sample complexity. Extensive experiments demonstrate that PCR accurately matches units across different treatment groups, effectively mitigates treatment selection bias, and significantly outperforms competitors. Code is available at https://anonymous.4open.science/status/ncr-B697.
- [25] arXiv:2407.01115 (cross-list from cs.LG) [pdf, html, other]
-
Title: Enabling Mixed Effects Neural Networks for Diverse, Clustered Data Using Monte Carlo MethodsAndrej Tschalzev, Paul Nitschke, Lukas Kirchdorfer, Stefan Lüdtke, Christian Bartelt, Heiner StuckenschmidtSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Neural networks often assume independence among input data samples, disregarding correlations arising from inherent clustering patterns in real-world datasets (e.g., due to different sites or repeated measurements). Recently, mixed effects neural networks (MENNs) which separate cluster-specific 'random effects' from cluster-invariant 'fixed effects' have been proposed to improve generalization and interpretability for clustered data. However, existing methods only allow for approximate quantification of cluster effects and are limited to regression and binary targets with only one clustering feature. We present MC-GMENN, a novel approach employing Monte Carlo methods to train Generalized Mixed Effects Neural Networks. We empirically demonstrate that MC-GMENN outperforms existing mixed effects deep learning models in terms of generalization performance, time complexity, and quantification of inter-cluster variance. Additionally, MC-GMENN is applicable to a wide range of datasets, including multi-class classification tasks with multiple high-cardinality categorical features. For these datasets, we show that MC-GMENN outperforms conventional encoding and embedding methods, simultaneously offering a principled methodology for interpreting the effects of clustering patterns.
- [26] arXiv:2407.01171 (cross-list from cs.LG) [pdf, html, other]
-
Title: Neural Conditional Probability for InferenceVladimir R. Kostic, Karim Lounici, Gregoire Pacreau, Pietro Novelli, Giacomo Turri, Massimiliano PontilSubjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
We introduce NCP (Neural Conditional Probability), a novel operator-theoretic approach for learning conditional distributions with a particular focus on inference tasks. NCP can be used to build conditional confidence regions and extract important statistics like conditional quantiles, mean, and covariance. It offers streamlined learning through a single unconditional training phase, facilitating efficient inference without the need for retraining even when conditioning changes. By tap** into the powerful approximation capabilities of neural networks, our method efficiently handles a wide variety of complex probability distributions, effectively dealing with nonlinear relationships between input and output variables. Theoretical guarantees ensure both optimization consistency and statistical accuracy of the NCP method. Our experiments show that our approach matches or beats leading methods using a simple Multi-Layer Perceptron (MLP) with two hidden layers and GELU activations. This demonstrates that a minimalistic architecture with a theoretically grounded loss function can achieve competitive results without sacrificing performance, even in the face of more complex architectures.
- [27] arXiv:2407.01316 (cross-list from cs.LG) [pdf, html, other]
-
Title: Evaluating Model Performance Under Worst-case SubpopulationsComments: Earlier version appeared in the proceedings of Advances in Neural Information Processing Systems 34 (NeurIPS 2021): this https URLSubjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
The performance of ML models degrades when the training population is different from that seen under operation. Towards assessing distributional robustness, we study the worst-case performance of a model over all subpopulations of a given size, defined with respect to core attributes Z. This notion of robustness can consider arbitrary (continuous) attributes Z, and automatically accounts for complex intersectionality in disadvantaged groups. We develop a scalable yet principled two-stage estimation procedure that can evaluate the robustness of state-of-the-art models. We prove that our procedure enjoys several finite-sample convergence guarantees, including dimension-free convergence. Instead of overly conservative notions based on Rademacher complexities, our evaluation error depends on the dimension of Z only through the out-of-sample error in estimating the performance conditional on Z. On real datasets, we demonstrate that our method certifies the robustness of a model and prevents deployment of unreliable models.
- [28] arXiv:2407.01371 (cross-list from cs.LG) [pdf, html, other]
-
Title: Binary Losses for Density Ratio EstimationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Estimating the ratio of two probability densities from finitely many observations of the densities, is a central problem in machine learning and statistics. A large class of methods constructs estimators from binary classifiers which distinguish observations from the two densities. However, the error of these constructions depends on the choice of the binary loss function, raising the question of which loss function to choose based on desired error properties. In this work, we start from prescribed error measures in a class of Bregman divergences and characterize all loss functions that lead to density ratio estimators with a small error. Our characterization provides a simple recipe for constructing loss functions with certain properties, such as loss functions that prioritize an accurate estimation of large values. This contrasts with classical loss functions, such as the logistic loss or boosting loss, which prioritize accurate estimation of small values. We provide numerical illustrations with kernel methods and test their performance in applications of parameter selection for deep domain adaptation.
- [29] arXiv:2407.01495 (cross-list from stat.CO) [pdf, html, other]
-
Title: Multifidelity Cross-validationComments: arXiv admin note: text overlap with arXiv:2203.01436Subjects: Computation (stat.CO); Machine Learning (stat.ML)
Emulating the map** between quantities of interest and their control parameters using surrogate models finds widespread application in engineering design, including in numerical optimization and uncertainty quantification. Gaussian process models can serve as a probabilistic surrogate model of unknown functions, thereby making them highly suitable for engineering design and decision-making in the presence of uncertainty. In this work, we are interested in emulating quantities of interest observed from models of a system at multiple fidelities, which trade accuracy for computational efficiency. Using multifidelity Gaussian process models, to efficiently fuse models at multiple fidelities, we propose a novel method to actively learn the surrogate model via leave-one-out cross-validation (LOO-CV). Our proposed multifidelity cross-validation (\texttt{MFCV}) approach develops an adaptive approach to reduce the LOO-CV error at the target (highest) fidelity, by learning the correlations between the LOO-CV at all fidelities. \texttt{MFCV} develops a two-step lookahead policy to select optimal input-fidelity pairs, both in sequence and in batches, both for continuous and discrete fidelity spaces. We demonstrate the utility of our method on several synthetic test problems as well as on the thermal stress analysis of a gas turbine blade.
- [30] arXiv:2407.01526 (cross-list from cs.LG) [pdf, other]
-
Title: Scalable Nested Optimization for Deep LearningComments: View more research details at this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC); Machine Learning (stat.ML)
Gradient-based optimization has been critical to the success of machine learning, updating a single set of parameters to minimize a single loss. A growing number of applications rely on a generalization of this, where we have a bilevel or nested optimization of which subsets of parameters update on different objectives nested inside each other. We focus on motivating examples of hyperparameter optimization and generative adversarial networks. However, naively applying classical methods often fails when we look at solving these nested problems on a large scale. In this thesis, we build tools for nested optimization that scale to deep learning setups.
Cross submissions for Tuesday, 2 July 2024 (showing 25 of 25 entries )
- [31] arXiv:2211.07866 (replaced) [pdf, html, other]
-
Title: Efficient Estimation for Longitudinal Networks via Adaptive MergingComments: 30 pages and 4 figures; appendix including technical proof will be uploaded laterSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Longitudinal network consists of a sequence of temporal edges among multiple nodes, where the temporal edges are observed in real time. It has become ubiquitous with the rise of online social platform and e-commerce, but largely under-investigated in literature. In this paper, we propose an efficient estimation framework for longitudinal network, leveraging strengths of adaptive network merging, tensor decomposition and point process. It merges neighboring sparse networks so as to enlarge the number of observed edges and reduce estimation variance, whereas the estimation bias introduced by network merging is controlled by exploiting local temporal structures for adaptive network neighborhood. A projected gradient descent algorithm is proposed to facilitate estimation, where the upper bound of the estimation error in each iteration is established. A thorough analysis is conducted to quantify the asymptotic behavior of the proposed method, which shows that it can significantly reduce the estimation error and also provides guideline for network merging under various scenarios. We further demonstrate the advantage of the proposed method through extensive numerical experiments on synthetic datasets and a militarized interstate dispute dataset.
- [32] arXiv:2306.00541 (replaced) [pdf, html, other]
-
Title: Decomposing Global Feature Effects Based on Feature InteractionsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Global feature effect methods, such as partial dependence plots, provide an intelligible visualization of the expected marginal feature effect. However, such global feature effect methods can be misleading, as they do not represent local feature effects of single observations well when feature interactions are present. We formally introduce generalized additive decomposition of global effects (GADGET), which is a new framework based on recursive partitioning to find interpretable regions in the feature space such that the interaction-related heterogeneity of local feature effects is minimized. We provide a mathematical foundation of the framework and show that it is applicable to the most popular methods to visualize marginal feature effects, namely partial dependence, accumulated local effects, and Shapley additive explanations (SHAP) dependence. Furthermore, we introduce and validate a new permutation-based interaction test to detect significant feature interactions that is applicable to any feature effect method that fits into our proposed framework. We empirically evaluate the theoretical characteristics of the proposed methods based on various feature effect methods in different experimental settings. Moreover, we apply our introduced methodology to three real-world examples to showcase their usefulness.
- [33] arXiv:2308.09790 (replaced) [pdf, html, other]
-
Title: A Two-Part Machine Learning Approach to Characterizing Network Interference in A/B TestingComments: 47 pagesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
The reliability of controlled experiments, commonly referred to as "A/B tests," is often compromised by network interference, where the outcomes of individual units are influenced by interactions with others. Significant challenges in this domain include the lack of accounting for complex social network structures and the difficulty in suitably characterizing network interference. To address these challenges, we propose a machine learning-based method. We introduce "causal network motifs" and utilize transparent machine learning models to characterize network interference patterns underlying an A/B test on networks. Our method's performance has been demonstrated through simulations on both a synthetic experiment and a large-scale test on Instagram. Our experiments show that our approach outperforms conventional methods such as design-based cluster randomization and conventional analysis-based neighborhood exposure map**. Our approach provides a comprehensive and automated solution to address network interference for A/B testing practitioners. This aids in informing strategic business decisions in areas such as marketing effectiveness and product customization.
- [34] arXiv:2309.04742 (replaced) [pdf, html, other]
-
Title: Affine Invariant Ensemble Transform Methods to Improve Predictive Uncertainty in Neural NetworksSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
We consider the problem of performing Bayesian inference for logistic regression using appropriate extensions of the ensemble Kalman filter. Two interacting particle systems are proposed that sample from an approximate posterior and prove quantitative convergence rates of these interacting particle systems to their mean-field limit as the number of particles tends to infinity. Furthermore, we apply these techniques and examine their effectiveness as methods of Bayesian approximation for quantifying predictive uncertainty in neural networks.
- [35] arXiv:2310.12806 (replaced) [pdf, html, other]
-
Title: DCSI -- An improved measure of cluster separability based on separation and connectednessSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Whether class labels in a given data set correspond to meaningful clusters is crucial for the evaluation of clustering algorithms using real-world data sets. This property can be quantified by separability measures. The central aspects of separability for density-based clustering are between-class separation and within-class connectedness, and neither classification-based complexity measures nor cluster validity indices (CVIs) adequately incorporate them. A newly developed measure (density cluster separability index, DCSI) aims to quantify these two characteristics and can also be used as a CVI. Extensive experiments on synthetic data indicate that DCSI correlates strongly with the performance of DBSCAN measured via the adjusted Rand index (ARI) but lacks robustness when it comes to multi-class data sets with overlap** classes that are ill-suited for density-based hard clustering. Detailed evaluation on frequently used real-world data sets shows that DCSI can correctly identify touching or overlap** classes that do not correspond to meaningful density-based clusters.
- [36] arXiv:2402.03220 (replaced) [pdf, html, other]
-
Title: The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap ExponentsComments: Accepted at the International Conference on Machine Learning (ICML), 2024Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We investigate the training dynamics of two-layer neural networks when learning multi-index target functions. We focus on multi-pass gradient descent (GD) that reuses the batches multiple times and show that it significantly changes the conclusion about which functions are learnable compared to single-pass gradient descent. In particular, multi-pass GD with finite stepsize is found to overcome the limitations of gradient flow and single-pass GD given by the information exponent (Ben Arous et al., 2021) and leap exponent (Abbe et al., 2023) of the target function. We show that upon re-using batches, the network achieves in just two time steps an overlap with the target subspace even for functions not satisfying the staircase property (Abbe et al., 2021). We characterize the (broad) class of functions efficiently learned in finite time. The proof of our results is based on the analysis of the Dynamical Mean-Field Theory (DMFT). We further provide a closed-form description of the dynamical process of the low-dimensional projections of the weights, and numerical experiments illustrating the theory.
- [37] arXiv:2402.05330 (replaced) [pdf, html, other]
-
Title: Classification under Nuisance Parameters and Generalized Label Shift in Likelihood-Free InferenceComments: 26 pages, 19 figures, code available at this https URLSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
An open scientific challenge is how to classify events with reliable measures of uncertainty, when we have a mechanistic model of the data-generating process but the distribution over both labels and latent nuisance parameters is different between train and target data. We refer to this type of distributional shift as generalized label shift (GLS). Direct classification using observed data $\mathbf{X}$ as covariates leads to biased predictions and invalid uncertainty estimates of labels $Y$. We overcome these biases by proposing a new method for robust uncertainty quantification that casts classification as a hypothesis testing problem under nuisance parameters. The key idea is to estimate the classifier's receiver operating characteristic (ROC) across the entire nuisance parameter space, which allows us to devise cutoffs that are invariant under GLS. Our method effectively endows a pre-trained classifier with domain adaptation capabilities and returns valid prediction sets while maintaining high power. We demonstrate its performance on two challenging scientific problems in biology and astroparticle physics with data from realistic mechanistic models.
- [38] arXiv:2402.07025 (replaced) [pdf, html, other]
-
Title: Generalization Error of Graph Neural Networks in the Mean-field RegimeComments: Accepted in ICML 2024Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
This work provides a theoretical framework for assessing the generalization error of graph neural networks in the over-parameterized regime, where the number of parameters surpasses the quantity of data points. We explore two widely utilized types of graph neural networks: graph convolutional neural networks and message passing graph neural networks. Prior to this study, existing bounds on the generalization error in the over-parametrized regime were uninformative, limiting our understanding of over-parameterized network performance. Our novel approach involves deriving upper bounds within the mean-field regime for evaluating the generalization error of these graph neural networks. We establish upper bounds with a convergence rate of $O(1/n)$, where $n$ is the number of graph samples. These upper bounds offer a theoretical assurance of the networks' performance on unseen data in the challenging over-parameterized regime and overall contribute to our understanding of their performance.
- [39] arXiv:2405.02783 (replaced) [pdf, html, other]
-
Title: Linear Noise Approximation Assisted Bayesian Inference on Mechanistic Model of Partially Observed Stochastic Reaction NetworkComments: 11 pages, 2 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
To support mechanism online learning and facilitate digital twin development for biomanufacturing processes, this paper develops an efficient Bayesian inference approach for partially observed enzymatic stochastic reaction network (SRN), a fundamental building block of multi-scale bioprocess mechanistic model. To tackle the critical challenges brought by the nonlinear stochastic differential equations (SDEs)-based mechanistic model with partially observed state and having measurement errors, an interpretable Bayesian updating linear noise approximation (LNA) metamodel, incorporating the structure information of the mechanistic model, is proposed to approximate the likelihood of observations. Then, an efficient posterior sampling approach is developed by utilizing the gradients of the derived likelihood to speed up the convergence of Markov Chain Monte Carlo (MCMC). The empirical study demonstrates that the proposed approach has a promising performance.
- [40] arXiv:2405.03180 (replaced) [pdf, html, other]
-
Title: Braced Fourier Continuation and Regression for Anomaly DetectionComments: 16 pages, 9 figures, associated Github link: this https URL -6/30/2024 update corrected and reworded erroneous figure references, minor typosSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
In this work, the concept of Braced Fourier Continuation and Regression (BFCR) is introduced. BFCR is a novel and computationally efficient means of finding nonlinear regressions or trend lines in arbitrary one-dimensional data sets. The Braced Fourier Continuation (BFC) and BFCR algorithms are first outlined, followed by a discussion of the properties of BFCR as well as demonstrations of how BFCR trend lines may be used effectively for anomaly detection both within and at the edges of arbitrary one-dimensional data sets. Finally, potential issues which may arise while using BFCR for anomaly detection as well as possible mitigation techniques are outlined and discussed. All source code and example data sets are either referenced or available via GitHub, and all associated code is written entirely in Python.
- [41] arXiv:2006.16202 (replaced) [pdf, html, other]
-
Title: Partitioned Least SquaresComments: To appear in Springer Machine Learning Journal (this https URL)Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In this paper we propose a variant of the linear least squares model allowing practitioners to partition the input features into groups of variables that they require to contribute similarly to the final result. The output allows practitioners to assess the importance of each group and of each variable in the group. We formally show that the new formulation is not convex and provide two alternative methods to deal with the problem: one non-exact method based on an alternating least squares approach; and one exact method based on a reformulation of the problem using an exponential number of sub-problems whose minimum is guaranteed to be the optimal solution. We formally show the correctness of the exact method and also compare the two solutions showing that the exact solution provides better results in a fraction of the time required by the alternating least squares solution (assuming that the number of partitions is small). For the sake of completeness, we also provide an alternative branch and bound algorithm that can be used in place of the exact method when the number of partitions is too large, and a proof of NP-completeness of the optimization problem introduced in this paper.
- [42] arXiv:2209.13694 (replaced) [pdf, html, other]
-
Title: Safe Linear Bandits over Unknown PolytopesComments: v3: Presented at COLT 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The safe linear bandit problem (SLB) is an online approach to linear programming with unknown objective and unknown roundwise constraints, under stochastic bandit feedback of rewards and safety risks of actions. We study the tradeoffs between efficacy and smooth safety costs of SLBs over polytopes, and the role of aggressive doubly-optimistic play in avoiding the strong assumptions made by extant pessimistic-optimistic approaches.
We first elucidate an inherent hardness in SLBs due the lack of knowledge of constraints: there exist `easy' instances, for which suboptimal extreme points have large `gaps', but on which SLB methods must still incur $\Omega(\sqrt{T})$ regret or safety violations, due to an inability to resolve unknown optima to arbitrary precision. We then analyse a natural doubly-optimistic strategy for the safe linear bandit problem, DOSS, which uses optimistic estimates of both reward and safety risks to select actions, and show that despite the lack of knowledge of constraints or feasible points, DOSS simultaneously obtains tight instance-dependent $O(\log^2 T)$ bounds on efficacy regret, and $\tilde O(\sqrt{T})$ bounds on safety violations. Further, when safety is demanded to a finite precision, violations improve to $O(\log^2 T).$ These results rely on a novel dual analysis of linear bandits: we argue that \algoname proceeds by activating noisy versions of at least $d$ constraints in each round, which allows us to separately analyse rounds where a `poor' set of constraints is activated, and rounds where `good' sets of constraints are activated. The costs in the former are controlled to $O(\log^2 T)$ by develo** new dual notions of gaps, based on global sensitivity analyses of linear programs, that quantify the suboptimality of each such set of constraints. The latter costs are controlled to $O(1)$ by explicitly analysing the solutions of optimistic play. - [43] arXiv:2210.13193 (replaced) [pdf, other]
-
Title: Langevin dynamics based algorithm e-TH$\varepsilon$O POULA for stochastic optimization problems with discontinuous stochastic gradientSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Machine Learning (stat.ML)
We introduce a new Langevin dynamics based algorithm, called e-TH$\varepsilon$O POULA, to solve optimization problems with discontinuous stochastic gradients which naturally appear in real-world applications such as quantile estimation, vector quantization, CVaR minimization, and regularized optimization problems involving ReLU neural networks. We demonstrate both theoretically and numerically the applicability of the e-TH$\varepsilon$O POULA algorithm. More precisely, under the conditions that the stochastic gradient is locally Lipschitz in average and satisfies a certain convexity at infinity condition, we establish non-asymptotic error bounds for e-TH$\varepsilon$O POULA in Wasserstein distances and provide a non-asymptotic estimate for the expected excess risk, which can be controlled to be arbitrarily small. Three key applications in finance and insurance are provided, namely, multi-period portfolio optimization, transfer learning in multi-period portfolio optimization, and insurance claim prediction, which involve neural networks with (Leaky)-ReLU activation functions. Numerical experiments conducted using real-world datasets illustrate the superior empirical performance of e-TH$\varepsilon$O POULA compared to SGLD, TUSLA, ADAM, and AMSGrad in terms of model accuracy.
- [44] arXiv:2211.07484 (replaced) [pdf, html, other]
-
Title: Contextual Bandits with Packing and Covering Constraints: A Modular Lagrangian Approach via RegressionComments: A preliminary version of this paper, authored by A. Slivkins, K.A. Sankararaman and D.J. Foster, has been published at COLT 2023. The present version features an important improvement, due to Xingyu Zhou. Specifically, the $\sqrt{T}$-regret result in Theorem 3.6(a) holds under a much weaker assumption, and is now positioned as the main guaranteeSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We consider contextual bandits with linear constraints (CBwLC), a variant of contextual bandits in which the algorithm consumes multiple resources subject to linear constraints on total consumption. This problem generalizes contextual bandits with knapsacks (CBwK), allowing for packing and covering constraints, as well as positive and negative resource consumption. We provide the first algorithm for CBwLC (or CBwK) that is based on regression oracles. The algorithm is simple, computationally efficient, and statistically optimal under mild assumptions. Further, we provide the first vanishing-regret guarantees for CBwLC (or CBwK) that extend beyond the stochastic environment. We side-step strong impossibility results from prior work by identifying a weaker (and, arguably, fairer) benchmark to compare against. Our algorithm builds on LagrangeBwK (Immorlica et al., FOCS 2019), a Lagrangian-based technique for CBwK, and SquareCB (Foster and Rakhlin, ICML 2020), a regression-based technique for contextual bandits. Our analysis leverages the inherent modularity of both techniques.
- [45] arXiv:2301.13088 (replaced) [pdf, other]
-
Title: Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces II: non-compact symmetric spacesSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
Gaussian processes are arguably the most important class of spatiotemporal models within machine learning. They encode prior information about the modeled function and can be used for exact or approximate Bayesian learning. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.
- [46] arXiv:2310.11439 (replaced) [pdf, html, other]
-
Title: From Alexnet to Transformers: Measuring the Non-linearity of Deep Neural Networks with Affine Optimal TransportQuentin Bouniot, Ievgen Redko, Anton Mallasto, Charlotte Laclau, Karol Arndt, Oliver Struckmeier, Markus Heinonen, Ville Kyrki, Samuel KaskiComments: Code available at this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
In the last decade, we have witnessed the introduction of several novel deep neural network (DNN) architectures exhibiting ever-increasing performance across diverse tasks. Explaining the upward trend of their performance, however, remains difficult as different DNN architectures of comparable depth and width -- common factors associated with their expressive power -- may exhibit a drastically different performance even when trained on the same dataset. In this paper, we introduce the concept of the non-linearity signature of DNN, the first theoretically sound solution for approximately measuring the non-linearity of deep neural networks. Built upon a score derived from closed-form optimal transport map**s, this signature provides a better understanding of the inner workings of a wide range of DNN architectures and learning paradigms, with a particular emphasis on the computer vision task. We provide extensive experimental results that highlight the practical usefulness of the proposed non-linearity signature and its potential for long-reaching implications. The code for our work is available at this https URL
- [47] arXiv:2311.13580 (replaced) [pdf, html, other]
-
Title: $\sigma$-PCA: a building block for neural learning of identifiable linear transformationsComments: Update with published versionSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Linear principal component analysis (PCA) learns (semi-)orthogonal transformations by orienting the axes to maximize variance. Consequently, it can only identify orthogonal axes whose variances are clearly distinct, but it cannot identify the subsets of axes whose variances are roughly equal. It cannot eliminate the subspace rotational indeterminacy: it fails to disentangle components with equal variances (eigenvalues), resulting, in each eigen subspace, in randomly rotated axes. In this paper, we propose $\sigma$-PCA, a method that (1) formulates a unified model for linear and nonlinear PCA, the latter being a special case of linear independent component analysis (ICA), and (2) introduces a missing piece into nonlinear PCA that allows it to eliminate, from the canonical linear PCA solution, the subspace rotational indeterminacy -- without whitening the inputs. Whitening, a preprocessing step which converts the inputs into unit-variance inputs, has generally been a prerequisite step for linear ICA methods, which meant that conventional nonlinear PCA could not necessarily preserve the orthogonality of the overall transformation, could not directly reduce dimensionality, and could not intrinsically order by variances. We offer insights on the relationship between linear PCA, nonlinear PCA, and linear ICA -- three methods with autoencoder formulations for learning special linear transformations from data, transformations that are (semi-)orthogonal for PCA, and arbitrary unit-variance for ICA. As part of our formulation, nonlinear PCA can be seen as a method that maximizes both variance and statistical independence, lying in the middle between linear PCA and linear ICA, serving as a building block for learning linear transformations that are identifiable.
- [48] arXiv:2312.02027 (replaced) [pdf, other]
-
Title: Stochastic Optimal Control MatchingSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Machine Learning (stat.ML)
Stochastic optimal control, which has the goal of driving the behavior of noisy systems, is broadly applicable in science, engineering and artificial intelligence. Our work introduces Stochastic Optimal Control Matching (SOCM), a novel Iterative Diffusion Optimization (IDO) technique for stochastic optimal control that stems from the same philosophy as the conditional score matching loss for diffusion models. That is, the control is learned via a least squares problem by trying to fit a matching vector field. The training loss, which is closely connected to the cross-entropy loss, is optimized with respect to both the control function and a family of reparameterization matrices which appear in the matching vector field. The optimization with respect to the reparameterization matrices aims at minimizing the variance of the matching vector field. Experimentally, our algorithm achieves lower error than all the existing IDO techniques for stochastic optimal control for three out of four control problems, in some cases by an order of magnitude. The key idea underlying SOCM is the path-wise reparameterization trick, a novel technique that may be of independent interest. Code at this https URL
- [49] arXiv:2401.13665 (replaced) [pdf, other]
-
Title: Entrywise Inference for Missing Panel Data: A Simple and Instance-Optimal ApproachSubjects: Statistics Theory (math.ST); Econometrics (econ.EM); Methodology (stat.ME); Machine Learning (stat.ML)
Longitudinal or panel data can be represented as a matrix with rows indexed by units and columns indexed by time. We consider inferential questions associated with the missing data version of panel data induced by staggered adoption. We propose a computationally efficient procedure for estimation, involving only simple matrix algebra and singular value decomposition, and prove non-asymptotic and high-probability bounds on its error in estimating each missing entry. By controlling proximity to a suitably scaled Gaussian variable, we develop and analyze a data-driven procedure for constructing entrywise confidence intervals with pre-specified coverage. Despite its simplicity, our procedure turns out to be instance-optimal: we prove that the width of our confidence intervals match a non-asymptotic instance-wise lower bound derived via a Bayesian Cramér-Rao argument. We illustrate the sharpness of our theoretical characterization on a variety of numerical examples. Our analysis is based on a general inferential toolbox for SVD-based algorithm applied to the matrix denoising model, which might be of independent interest.
- [50] arXiv:2402.02306 (replaced) [pdf, html, other]
-
Title: A flexible Bayesian g-formula for causal survival analyses with time-dependent confoundingSubjects: Methodology (stat.ME); Computation (stat.CO); Machine Learning (stat.ML)
In longitudinal observational studies with a time-to-event outcome, a common objective in causal analysis is to estimate the causal survival curve under hypothetical intervention scenarios within the study cohort. The g-formula is a particularly useful tool for this analysis. To enhance the traditional parametric g-formula approach, we developed a more adaptable Bayesian g-formula estimator, which incorporates the Bayesian additive regression trees (BART) in the modeling of the time-evolving generative components, aiming to mitigate bias due to model misspecification. Specifically, we introduce a more general class of g-formulas for discrete survival data that can incorporate the longitudinal balancing scores, which serve as an effective method for dimension reduction and are vital when dealing with an expanding array of time-varying confounders. The minimum sufficient formulation of these longitudinal balancing scores is linked to the nature of treatment regimes, whether static or dynamic. For each type of treatment regime, we provide posterior sampling algorithms grounded in the BART framework. We have conducted simulation studies to illustrate the empirical performance of the proposed method and further demonstrate its practical utility using data from the Yale New Haven Health System's (YNHHS) electronic health records.
- [51] arXiv:2402.16710 (replaced) [pdf, html, other]
-
Title: Cost Aware Best Arm IdentificationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In this paper, we study a best arm identification problem with dual objects. In addition to the classic reward, each arm is associated with a cost distribution and the goal is to identify the largest reward arm using the minimum expected cost. We call it \emph{Cost Aware Best Arm Identification} (CABAI), which captures the separation of testing and implementation phases in product development pipelines and models the objective shift between phases, i.e., cost for testing and reward for implementation. We first derive a theoretical lower bound for CABAI and propose an algorithm called $\mathsf{CTAS}$ to match it asymptotically. To reduce the computation of $\mathsf{CTAS}$, we further propose a simple algorithm called \emph{Chernoff Overlap} (CO), based on a square-root rule, which we prove is optimal in simplified two-armed models and generalizes well in numerical experiments. Our results show that (i) ignoring the heterogeneous action cost results in sub-optimality in practice, and (ii) simple algorithms can deliver near-optimal performance over a wide range of problems.
- [52] arXiv:2403.01046 (replaced) [pdf, other]
-
Title: A Library of Mirrors: Deep Neural Nets in Low Dimensions are Convex Lasso Models with Reflection FeaturesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC); Machine Learning (stat.ML)
We prove that training neural networks on 1-D data is equivalent to solving a convex Lasso problem with a fixed, explicitly defined dictionary matrix of features. The specific dictionary depends on the activation and depth. We consider 2 and 3-layer networks with piecewise linear activations, and rectangular and tree networks with sign activation and arbitrary depth. Interestingly in absolute value and symmetrized ReLU networks, a third layer creates features that represent reflections of training data about themselves. The Lasso representation sheds insight to globally optimal networks and the solution landscape.
- [53] arXiv:2403.12975 (replaced) [pdf, other]
-
Title: Training morphological neural networks with gradient descent: some theoretical insightsSamy Blusseau (CMM)Journal-ref: IAPR Third International Conference on Discrete Geometry and Mathematical Morphology, Andrea Frosini; Elena Barcucci; Elisa Pergola; Michela Ascolese; Niccol{\'o} Di Marco; Simone Rinaldi; Sara Brunetti; Giulia Palma; Veronica Gierrini; Leonardo Bindi, Apr 2024, Firenze, Italy. pp.229-241Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Morphological neural networks, or layers, can be a powerful tool to boost the progress in mathematical morphology, either on theoretical aspects such as the representation of complete lattice operators, or in the development of image processing pipelines. However, these architectures turn out to be difficult to train when they count more than a few morphological layers, at least within popular machine learning frameworks which use gradient descent based optimization algorithms. In this paper we investigate the potential and limitations of differentiation based approaches and back-propagation applied to morphological networks, in light of the non-smooth optimization concept of Bouligand derivative. We provide insights and first theoretical guidelines, in particular regarding initialization and learning rates.
- [54] arXiv:2403.17852 (replaced) [pdf, html, other]
-
Title: Counterfactual Fairness through Transforming Data Orthogonal to BiasSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Machine learning models have shown exceptional prowess in solving complex issues across various domains. However, these models can sometimes exhibit biased decision-making, resulting in unequal treatment of different groups. Despite substantial research on counterfactual fairness, methods to reduce the impact of multivariate and continuous sensitive variables on decision-making outcomes are still underdeveloped. We propose a novel data pre-processing algorithm, Orthogonal to Bias (OB), which is designed to eliminate the influence of a group of continuous sensitive variables, thus promoting counterfactual fairness in machine learning applications. Our approach, based on the assumption of a jointly normal distribution within a structural causal model (SCM), demonstrates that counterfactual fairness can be achieved by ensuring the data is orthogonal to the observed sensitive variables. The OB algorithm is model-agnostic, making it applicable to a wide range of machine learning models and tasks. Additionally, it includes a sparse variant to improve numerical stability through regularization. Empirical evaluations on both simulated and real-world datasets, encompassing settings with both discrete and continuous sensitive variables, show that our methodology effectively promotes fairer outcomes without compromising accuracy.
- [55] arXiv:2404.01216 (replaced) [pdf, html, other]
-
Title: Novel Node Category Detection Under Subpopulation ShiftComments: Accepted to ECML-PKDD 2024Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
In real-world graph data, distribution shifts can manifest in various ways, such as the emergence of new categories and changes in the relative proportions of existing categories. It is often important to detect nodes of novel categories under such distribution shifts for safety or insight discovery purposes. We introduce a new approach, Recall-Constrained Optimization with Selective Link Prediction (RECO-SLIP), to detect nodes belonging to novel categories in attributed graphs under subpopulation shifts. By integrating a recall-constrained learning framework with a sample-efficient link prediction mechanism, RECO-SLIP addresses the dual challenges of resilience against subpopulation shifts and the effective exploitation of graph structure. Our extensive empirical evaluation across multiple graph datasets demonstrates the superior performance of RECO-SLIP over existing methods. The experimental code is available at this https URL.
- [56] arXiv:2405.03083 (replaced) [pdf, html, other]
-
Title: Causal K-Means ClusteringSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
Causal effects are often characterized with population summaries. These might provide an incomplete picture when there are heterogeneous treatment effects across subgroups. Since the subgroup structure is typically unknown, it is more challenging to identify and evaluate subgroup effects than population effects. We propose a new solution to this problem: Causal k-Means Clustering, which harnesses the widely-used k-means clustering algorithm to uncover the unknown subgroup structure. Our problem differs significantly from the conventional clustering setup since the variables to be clustered are unknown counterfactual functions. We present a plug-in estimator which is simple and readily implementable using off-the-shelf algorithms, and study its rate of convergence. We also develop a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning, and show that this estimator achieves fast root-n rates and asymptotic normality in large nonparametric models. Our proposed methods are especially useful for modern outcome-wide studies with multiple treatment levels. Further, our framework is extensible to clustering with generic pseudo-outcomes, such as partially observed outcomes or otherwise unknown functions. Finally, we explore finite sample properties via simulation, and illustrate the proposed methods in a study of treatment programs for adolescent substance abuse.
- [57] arXiv:2405.04011 (replaced) [pdf, html, other]
-
Title: Adjoint Sensitivity Analysis on Multi-Scale Bioprocess Stochastic Reaction NetworkComments: 11 pages, 2 figuresSubjects: Molecular Networks (q-bio.MN); Machine Learning (stat.ML)
Motivated by the pressing challenges in the digital twin development for biomanufacturing systems, we introduce an adjoint sensitivity analysis (SA) approach to expedite the learning of mechanistic model parameters. In this paper, we consider enzymatic stochastic reaction networks representing a multi-scale bioprocess mechanistic model that allows us to integrate disparate data from diverse production processes and leverage the information from existing macro-kinetic and genome-scale models. To support forward prediction and backward reasoning, we develop a convergent adjoint SA algorithm studying how the perturbations of model parameters and inputs (e.g., initial state) propagate through enzymatic reaction networks and impact on output trajectory predictions. This SA can provide a sample efficient and interpretable way to assess the sensitivities between inputs and outputs accounting for their causal dependencies. Our empirical study underscores the resilience of these sensitivities and illuminates a deeper comprehension of the regulatory mechanisms behind bioprocess through sensitivities.
- [58] arXiv:2405.04715 (replaced) [pdf, html, other]
-
Title: Causality Pursuit from Heterogeneous Environments via Neural Adversarial Invariance LearningComments: 48 pages, 7 figures with appendixSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
Pursuing causality from data is a fundamental problem in scientific discovery, treatment intervention, and transfer learning. This paper introduces a novel algorithmic method for addressing nonparametric invariance and causality learning in regression models across multiple environments, where the joint distribution of response variables and covariates varies, but the conditional expectations of outcome given an unknown set of quasi-causal variables are invariant. The challenge of finding such an unknown set of quasi-causal or invariant variables is compounded by the presence of endogenous variables that have heterogeneous effects across different environments, including even one of them in the regression would make the estimation inconsistent. The proposed Focused Adversial Invariant Regularization (FAIR) framework utilizes an innovative minimax optimization approach that breaks down the barriers, driving regression models toward prediction-invariant solutions through adversarial testing. Leveraging the representation power of neural networks, FAIR neural networks (FAIR-NN) are introduced for causality pursuit. It is shown that FAIR-NN can find the invariant variables and quasi-causal variables under a minimal identification condition and that the resulting procedure is adaptive to low-dimensional composition structures in a non-asymptotic analysis. Under a structural causal model, variables identified by FAIR-NN represent pragmatic causality and provably align with exact causal mechanisms under conditions of sufficient heterogeneity. Computationally, FAIR-NN employs a novel Gumbel approximation with decreased temperature and stochastic gradient descent ascent algorithm. The procedures are convincingly demonstrated using simulated and real-data examples.
- [59] arXiv:2405.05097 (replaced) [pdf, html, other]
-
Title: Biology-inspired joint distribution neurons based on Hierarchical Correlation Reconstruction allowing for multidirectional neural networksComments: 7 pages, 6 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Biological neural networks seem qualitatively superior (e.g. in learning, flexibility, robustness) from current artificial like Multi-Layer Perceptron (MLP) or Kolmogorov-Arnold Network (KAN). Simultaneously, in contrast to them: have fundamentally multidirectional signal propagation~\cite{axon}, also of probability distributions e.g. for uncertainty estimation, and are believed not being able to use standard backpropagation training~\cite{backprop}. There are proposed novel artificial neurons based on HCR (Hierarchical Correlation Reconstruction) removing the above low level differences: with neurons containing local joint distribution model (of its connections), representing joint density on normalized variables as just linear combination among $(f_\mathbf{j})$ orthonormal polynomials: $\rho(\mathbf{x})=\sum_{\mathbf{j}\in B} a_\mathbf{j} f_\mathbf{j}(\mathbf{x})$ for $\mathbf{x} \in [0,1]^d$ and $B$ some chosen basis, with basis growth approaching complete description of joint distribution. By various index summations of such $(a_\mathbf{j})$ tensor as neuron parameters, we get simple formulas for e.g. conditional expected values for propagation in any direction, like $E[x|y,z]$, $E[y|x]$, which degenerate to KAN-like parametrization if restricting to pairwise dependencies. Such HCR network can also propagate probability distributions (also joint) like $\rho(y,z|x)$. It also allows for additional training approaches, like direct $(a_\mathbf{j})$ estimation, through tensor decomposition, or more biologically plausible information bottleneck training: layers directly influencing only neighbors, optimizing content to maximize information about the next layer, and minimizing about the previous to minimize the noise.
- [60] arXiv:2405.19440 (replaced) [pdf, html, other]
-
Title: On the Convergence of Multi-objective Optimization under Generalized SmoothnessSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning. Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard $L$-smooth or bounded-gradient assumptions, which are typically unsatisfactory for neural networks, such as recurrent neural networks (RNNs) and transformers. In this paper, we study a more general and realistic class of $\ell$-smooth loss functions, where $\ell$ is a general non-decreasing function of gradient norm. We develop two novel single-loop algorithms for $\ell$-smooth MOO problems, Generalized Smooth Multi-objective Gradient descent (GSMGrad) and its stochastic variant, Stochastic Generalized Smooth Multi-objective Gradient descent (SGSMGrad), which approximate the conflict-avoidant (CA) direction that maximizes the minimum improvement among objectives. We provide a comprehensive convergence analysis of both algorithms and show that they converge to an $\epsilon$-accurate Pareto stationary point with a guaranteed $\epsilon$-level average CA distance (i.e., the gap between the updating direction and the CA direction) over all iterations, where totally $\mathcal{O}(\epsilon^{-2})$ and $\mathcal{O}(\epsilon^{-4})$ samples are needed for deterministic and stochastic settings, respectively. Our algorithms can also guarantee a tighter $\epsilon$-level CA distance in each iteration using more samples. Moreover, we propose a practical variant of GSMGrad named GSMGrad-FA using only constant-level time and space, while achieving the same performance guarantee as GSMGrad. Our experiments validate our theory and demonstrate the effectiveness of the proposed methods.
- [61] arXiv:2406.00193 (replaced) [pdf, html, other]
-
Title: Learning topological states from randomized measurements using variational tensor network tomographyYanting Teng, Rhine Samajdar, Katherine Van Kirk, Frederik Wilde, Subir Sachdev, Jens Eisert, Ryan Sweke, Khadijeh NajafiComments: 11+35 pages, 4+3 figures; Added additional referencesSubjects: Quantum Physics (quant-ph); Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (stat.ML)
Learning faithful representations of quantum states is crucial to fully characterizing the variety of many-body states created on quantum processors. While various tomographic methods such as classical shadow and MPS tomography have shown promise in characterizing a wide class of quantum states, they face unique limitations in detecting topologically ordered two-dimensional states. To address this problem, we implement and study a heuristic tomographic method that combines variational optimization on tensor networks with randomized measurement techniques. Using this approach, we demonstrate its ability to learn the ground state of the surface code Hamiltonian as well as an experimentally realizable quantum spin liquid state. In particular, we perform numerical experiments using MPS ansätze and systematically investigate the sample complexity required to achieve high fidelities for systems of sizes up to $48$ qubits. In addition, we provide theoretical insights into the scaling of our learning algorithm by analyzing the statistical properties of maximum likelihood estimation. Notably, our method is sample-efficient and experimentally friendly, only requiring snapshots of the quantum state measured randomly in the $X$ or $Z$ bases. Using this subset of measurements, our approach can effectively learn any real pure states represented by tensor networks, and we rigorously prove that random-$XZ$ measurements are tomographically complete for such states.
- [62] arXiv:2406.04043 (replaced) [pdf, other]
-
Title: Energy-based Epistemic Uncertainty for Graph Neural NetworksSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In domains with interdependent data, such as graphs, quantifying the epistemic uncertainty of a Graph Neural Network (GNN) is challenging as uncertainty can arise at different structural scales. Existing techniques neglect this issue or only distinguish between structure-aware and structure-agnostic uncertainty without combining them into a single measure. We propose GEBM, an energy-based model (EBM) that provides high-quality uncertainty estimates by aggregating energy at different structural levels that naturally arise from graph diffusion. In contrast to logit-based EBMs, we provably induce an integrable density in the data space by regularizing the energy function. We introduce an evidential interpretation of our EBM that significantly improves the predictive robustness of the GNN. Our framework is a simple and effective post hoc method applicable to any pre-trained GNN that is sensitive to various distribution shifts. It consistently achieves the best separation of in-distribution and out-of-distribution data on 6 out of 7 anomaly types while having the best average rank over shifts on \emph{all} datasets.
- [63] arXiv:2406.04824 (replaced) [pdf, html, other]
-
Title: FunBO: Discovering Acquisition Functions for Bayesian Optimization with FunSearchVirginia Aglietti, Ira Ktena, Jessica Schrouff, Eleni Sgouritsa, Francisco J. R. Ruiz, Alan Malek, Alexis Bellot, Silvia ChiappaSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The sample efficiency of Bayesian optimization algorithms depends on carefully crafted acquisition functions (AFs) guiding the sequential collection of function evaluations. The best-performing AF can vary significantly across optimization problems, often requiring ad-hoc and problem-specific choices. This work tackles the challenge of designing novel AFs that perform well across a variety of experimental settings. Based on FunSearch, a recent work using Large Language Models (LLMs) for discovery in mathematical sciences, we propose FunBO, an LLM-based method that can be used to learn new AFs written in computer code by leveraging access to a limited number of evaluations for a set of objective functions. We provide the analytic expression of all discovered AFs and evaluate them on various global optimization benchmarks and hyperparameter optimization tasks. We show how FunBO identifies AFs that generalize well in and out of the training distribution of functions, thus outperforming established general-purpose AFs and achieving competitive performance against AFs that are customized to specific function types and are learned via transfer-learning algorithms.
- [64] arXiv:2406.11011 (replaced) [pdf, html, other]
-
Title: Data Shapley in One Training RunSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. However, existing approaches require re-training models on different data subsets, which is computationally intensive, foreclosing their application to large-scale models. Furthermore, they produce the same attribution score for any models produced by running the learning algorithm, meaning they cannot perform targeted attribution towards a specific model obtained from a single run of the algorithm. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest. In its most efficient implementation, our technique incurs negligible additional runtime compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage for the first time. We present several case studies that offer fresh insights into pretraining data's contribution and discuss their implications for copyright in generative AI and pretraining data curation.