-
Statistical constraints on climate model parameters using a scalable cloud-based inference framework
Authors:
James Carzon,
Bruno R. de Abreu,
Leighton Regayre,
Kenneth Carslaw,
Lucia Deaconu,
Philip Stier,
Hamish Gordon,
Mikael Kuusela
Abstract:
Atmospheric aerosols influence the Earth's climate, primarily by affecting cloud formation and scattering visible radiation. However, aerosol-related physical processes in climate simulations are highly uncertain. Constraining these processes could help improve model-based climate predictions. We propose a scalable statistical framework for constraining parameters in expensive climate models by co…
▽ More
Atmospheric aerosols influence the Earth's climate, primarily by affecting cloud formation and scattering visible radiation. However, aerosol-related physical processes in climate simulations are highly uncertain. Constraining these processes could help improve model-based climate predictions. We propose a scalable statistical framework for constraining parameters in expensive climate models by comparing model outputs with observations. Using the C3.ai Suite, a cloud computing platform, we use a perturbed parameter ensemble of the UKESM1 climate model to efficiently train a surrogate model. A method for estimating a data-driven model discrepancy term is described. The strict bounds method is applied to quantify parametric uncertainty in a principled way. We demonstrate the scalability of this framework with two weeks' worth of simulated aerosol optical depth data over the South Atlantic and Central African region, written from the model every three hours and matched in time to twice-daily MODIS satellite observations. When constraining the model using real satellite observations, we establish constraints on combinations of two model parameters using much higher time-resolution outputs from the climate model than previous studies. This result suggests that, within the limits imposed by an imperfect climate model, potentially very powerful constraints may be achieved when our framework is scaled to the analysis of more observations and for longer time periods.
△ Less
Submitted 20 May, 2023; v1 submitted 6 April, 2023;
originally announced April 2023.
-
Background Modeling for Double Higgs Boson Production: Density Ratios and Optimal Transport
Authors:
Tudor Manole,
Patrick Bryant,
John Alison,
Mikael Kuusela,
Larry Wasserman
Abstract:
We study the problem of data-driven background estimation, arising in the search of physics signals predicted by the Standard Model at the Large Hadron Collider. Our work is motivated by the search for the production of pairs of Higgs bosons decaying into four bottom quarks. A number of other physical processes, known as background, also share the same final state. The data arising in this problem…
▽ More
We study the problem of data-driven background estimation, arising in the search of physics signals predicted by the Standard Model at the Large Hadron Collider. Our work is motivated by the search for the production of pairs of Higgs bosons decaying into four bottom quarks. A number of other physical processes, known as background, also share the same final state. The data arising in this problem is therefore a mixture of unlabeled background and signal events, and the primary aim of the analysis is to determine whether the proportion of unlabeled signal events is nonzero. A challenging but necessary first step is to estimate the distribution of background events. Past work in this area has determined regions of the space of collider events where signal is unlikely to appear, and where the background distribution is therefore identifiable. The background distribution can be estimated in these regions, and extrapolated into the region of primary interest using transfer learning with a multivariate classifier. We build upon this existing approach in two ways. First, we revisit this method by develo** a customized residual neural network which is tailored to the structure and symmetries of collider data. Second, we develop a new method for background estimation, based on the optimal transport problem, which relies on modeling assumptions distinct from earlier work. These two methods can serve as cross-checks for each other in particle physics analyses, due to the complementarity of their underlying assumptions. We compare their performance on simulated double Higgs boson data.
△ Less
Submitted 16 June, 2024; v1 submitted 4 August, 2022;
originally announced August 2022.
-
Uncertainty quantification for wide-bin unfolding: one-at-a-time strict bounds and prior-optimized confidence intervals
Authors:
Michael Stanley,
Pratik Patil,
Mikael Kuusela
Abstract:
Unfolding is an ill-posed inverse problem in particle physics aiming to infer a true particle-level spectrum from smeared detector-level data. For computational and practical reasons, these spaces are typically discretized using histograms, and the smearing is modeled through a response matrix corresponding to a discretized smearing kernel of the particle detector. This response matrix depends on…
▽ More
Unfolding is an ill-posed inverse problem in particle physics aiming to infer a true particle-level spectrum from smeared detector-level data. For computational and practical reasons, these spaces are typically discretized using histograms, and the smearing is modeled through a response matrix corresponding to a discretized smearing kernel of the particle detector. This response matrix depends on the unknown shape of the true spectrum, leading to a fundamental systematic uncertainty in the unfolding problem. To handle the ill-posed nature of the problem, common approaches regularize the problem either directly via methods such as Tikhonov regularization, or implicitly by using wide-bins in the true space that match the resolution of the detector. Unfortunately, both of these methods lead to a non-trivial bias in the unfolded estimator, thereby hampering frequentist coverage guarantees for confidence intervals constructed from these methods. We propose two new approaches to addressing the bias in the wide-bin setting through methods called One-at-a-time Strict Bounds (OSB) and Prior-Optimized (PO) intervals. The OSB intervals are a bin-wise modification of an existing guaranteed-coverage procedure, while the PO intervals are based on a decision-theoretic view of the problem. Importantly, both approaches provide well-calibrated frequentist confidence intervals even in constrained and rank-deficient settings. These methods are built upon a more general answer to the wide-bin bias problem, involving unfolding with fine bins first, followed by constructing confidence intervals for linear functionals of the fine-bin counts. We test and compare these methods to other available methodologies in a wide-bin deconvolution example and a realistic particle physics simulation of unfolding a steeply falling particle spectrum.
△ Less
Submitted 10 April, 2022; v1 submitted 1 November, 2021;
originally announced November 2021.
-
Spatio-temporal Local Interpolation of Global Ocean Heat Transport using Argo Floats: A Debiased Latent Gaussian Process Approach
Authors:
Beomjo Park,
Mikael Kuusela,
Donata Giglio,
Alison Gray
Abstract:
The world ocean plays a key role in redistributing heat in the climate system and hence in regulating Earth's climate. Yet statistical analysis of ocean heat transport suffers from partially incomplete large-scale data intertwined with complex spatio-temporal dynamics, as well as from potential model misspecification. We present a comprehensive spatio-temporal statistical framework tailored to int…
▽ More
The world ocean plays a key role in redistributing heat in the climate system and hence in regulating Earth's climate. Yet statistical analysis of ocean heat transport suffers from partially incomplete large-scale data intertwined with complex spatio-temporal dynamics, as well as from potential model misspecification. We present a comprehensive spatio-temporal statistical framework tailored to interpolating the global ocean heat transport using in-situ Argo profiling float measurements. We formalize the statistical challenges using latent local Gaussian process regression accompanied by a two-stage fitting procedure. We introduce an approximate Expectation-Maximization algorithm to jointly estimate both the mean field and the covariance parameters, and refine the potentially under-specified mean field model with a debiasing procedure. This approach provides data-driven global ocean heat transport fields that vary in both space and time and can provide insights into crucial dynamical phenomena, such as El Ni{ñ}o \& La Ni{ñ}a, as well as the global climatological mean heat transport field, which by itself is of scientific interest. The proposed framework and the Argo-based estimates are thoroughly validated with state-of-the-art multimission satellite products and shown to yield realistic subsurface ocean heat transport estimates.
△ Less
Submitted 18 July, 2022; v1 submitted 20 May, 2021;
originally announced May 2021.
-
Model-Independent Detection of New Physics Signals Using Interpretable Semi-Supervised Classifier Tests
Authors:
Purvasha Chakravarti,
Mikael Kuusela,
**g Lei,
Larry Wasserman
Abstract:
A central goal in experimental high energy physics is to detect new physics signals that are not explained by known physics. In this paper, we aim to search for new signals that appear as deviations from known Standard Model physics in high-dimensional particle physics data. To do this, we determine whether there is any statistically significant difference between the distribution of Standard Mode…
▽ More
A central goal in experimental high energy physics is to detect new physics signals that are not explained by known physics. In this paper, we aim to search for new signals that appear as deviations from known Standard Model physics in high-dimensional particle physics data. To do this, we determine whether there is any statistically significant difference between the distribution of Standard Model background samples and the distribution of the experimental observations, which are a mixture of the background and a potential new signal. Traditionally, one also assumes access to a sample from a model for the hypothesized signal distribution. Here we instead investigate a model-independent method that does not make any assumptions about the signal and uses a semi-supervised classifier to detect the presence of the signal in the experimental data. We construct three test statistics using the classifier: an estimated likelihood ratio test (LRT) statistic, a test based on the area under the ROC curve (AUC), and a test based on the misclassification error (MCE). Additionally, we propose a method for estimating the signal strength parameter and explore active subspace methods to interpret the proposed semi-supervised classifier in order to understand the properties of the detected signal. We also propose a Score test statistic that can be used in the model-dependent setting. We investigate the performance of the methods on a simulated data set related to the search for the Higgs boson at the Large Hadron Collider at CERN. We demonstrate that the semi-supervised tests have power competitive with the classical supervised methods for a well-specified signal, but much higher power for an unexpected signal which might be entirely missed by the supervised tests.
△ Less
Submitted 13 December, 2022; v1 submitted 15 February, 2021;
originally announced February 2021.
-
Objective frequentist uncertainty quantification for atmospheric CO$_2$ retrievals
Authors:
Pratik Patil,
Mikael Kuusela,
Jonathan Hobbs
Abstract:
The steadily increasing amount of atmospheric carbon dioxide (CO$_2$) is affecting the global climate system and threatening the long-term sustainability of Earth's ecosystem. In order to better understand the sources and sinks of CO$_2$, NASA operates the Orbiting Carbon Observatory-2 & 3 satellites to monitor CO$_2$ from space. These satellites make passive radiance measurements of the sunlight…
▽ More
The steadily increasing amount of atmospheric carbon dioxide (CO$_2$) is affecting the global climate system and threatening the long-term sustainability of Earth's ecosystem. In order to better understand the sources and sinks of CO$_2$, NASA operates the Orbiting Carbon Observatory-2 & 3 satellites to monitor CO$_2$ from space. These satellites make passive radiance measurements of the sunlight reflected off the Earth's surface in different spectral bands, which are then inverted in an ill-posed inverse problem to obtain estimates of the atmospheric CO$_2$ concentration. In this work, we propose a new CO$_2$ retrieval method that uses known physical constraints on the state variables and direct inversion of the target functional of interest to construct well-calibrated frequentist confidence intervals based on convex programming. We compare the method with the current operational retrieval procedure, which uses prior knowledge in the form of probability distributions to regularize the problem. We demonstrate that the proposed intervals consistently achieve the desired frequentist coverage, while the operational uncertainties are poorly calibrated in a frequentist sense both at individual locations and over a spatial region in a realistic simulation experiment. We also study the influence of specific nuisance state variables on the length of the proposed intervals and identify certain key variables that can greatly reduce the final uncertainty given additional deterministic or probabilistic constraints, and develop a principled framework to incorporate such information into our method.
△ Less
Submitted 10 April, 2022; v1 submitted 29 July, 2020;
originally announced July 2020.
-
Locally stationary spatio-temporal interpolation of Argo profiling float data
Authors:
Mikael Kuusela,
Michael L. Stein
Abstract:
Argo floats measure seawater temperature and salinity in the upper 2,000 m of the global ocean. Statistical analysis of the resulting spatio-temporal dataset is challenging due to its nonstationary structure and large size. We propose map** these data using locally stationary Gaussian process regression where covariance parameter estimation and spatio-temporal prediction are carried out in a mov…
▽ More
Argo floats measure seawater temperature and salinity in the upper 2,000 m of the global ocean. Statistical analysis of the resulting spatio-temporal dataset is challenging due to its nonstationary structure and large size. We propose map** these data using locally stationary Gaussian process regression where covariance parameter estimation and spatio-temporal prediction are carried out in a moving-window fashion. This yields computationally tractable nonstationary anomaly fields without the need to explicitly model the nonstationary covariance structure. We also investigate Student-$t$ distributed fine-scale variation as a means to account for non-Gaussian heavy tails in ocean temperature data. Cross-validation studies comparing the proposed approach with the existing state-of-the-art demonstrate clear improvements in point predictions and show that accounting for the nonstationarity and non-Gaussianity is crucial for obtaining well-calibrated uncertainties. This approach also provides data-driven local estimates of the spatial and temporal dependence scales for the global ocean which are of scientific interest in their own right.
△ Less
Submitted 28 December, 2018; v1 submitted 1 November, 2017;
originally announced November 2017.
-
Shape-constrained uncertainty quantification in unfolding steeply falling elementary particle spectra
Authors:
Mikael Kuusela,
Philip B. Stark
Abstract:
The high energy physics unfolding problem is an important statistical inverse problem in data analysis at the Large Hadron Collider (LHC) at CERN. The goal of unfolding is to make nonparametric inferences about a particle spectrum from measurements smeared by the finite resolution of the particle detectors. Previous unfolding methods use ad hoc discretization and regularization, resulting in confi…
▽ More
The high energy physics unfolding problem is an important statistical inverse problem in data analysis at the Large Hadron Collider (LHC) at CERN. The goal of unfolding is to make nonparametric inferences about a particle spectrum from measurements smeared by the finite resolution of the particle detectors. Previous unfolding methods use ad hoc discretization and regularization, resulting in confidence intervals that can have significantly lower coverage than their nominal level. Instead of regularizing using a roughness penalty or stop** iterative methods early, we impose physically motivated shape constraints: positivity, monotonicity, and convexity. We quantify the uncertainty by constructing a nonparametric confidence set for the true spectrum, consisting of all those spectra that satisfy the shape constraints and that predict the observations within an appropriately calibrated level of fit. Projecting that set produces simultaneous confidence intervals for all functionals of the spectrum, including averages within bins. The confidence intervals have guaranteed conservative frequentist finite-sample coverage in the important and challenging class of unfolding problems for steeply falling particle spectra. We demonstrate the method using simulations that mimic unfolding the inclusive jet transverse momentum spectrum at the LHC. The shape-constrained intervals provide usefully tight conservative inferences, while the conventional methods suffer from severe undercoverage.
△ Less
Submitted 7 June, 2017; v1 submitted 2 December, 2015;
originally announced December 2015.
-
Statistical unfolding of elementary particle spectra: Empirical Bayes estimation and bias-corrected uncertainty quantification
Authors:
Mikael Kuusela,
Victor M. Panaretos
Abstract:
We consider the high energy physics unfolding problem where the goal is to estimate the spectrum of elementary particles given observations distorted by the limited resolution of a particle detector. This important statistical inverse problem arising in data analysis at the Large Hadron Collider at CERN consists in estimating the intensity function of an indirectly observed Poisson point process.…
▽ More
We consider the high energy physics unfolding problem where the goal is to estimate the spectrum of elementary particles given observations distorted by the limited resolution of a particle detector. This important statistical inverse problem arising in data analysis at the Large Hadron Collider at CERN consists in estimating the intensity function of an indirectly observed Poisson point process. Unfolding typically proceeds in two steps: one first produces a regularized point estimate of the unknown intensity and then uses the variability of this estimator to form frequentist confidence intervals that quantify the uncertainty of the solution. In this paper, we propose forming the point estimate using empirical Bayes estimation which enables a data-driven choice of the regularization strength through marginal maximum likelihood estimation. Observing that neither Bayesian credible intervals nor standard bootstrap confidence intervals succeed in achieving good frequentist coverage in this problem due to the inherent bias of the regularized point estimate, we introduce an iteratively bias-corrected bootstrap technique for constructing improved confidence intervals. We show using simulations that this enables us to achieve nearly nominal frequentist coverage with only a modest increase in interval length. The proposed methodology is applied to unfolding the $Z$ boson invariant mass spectrum as measured in the CMS experiment at the Large Hadron Collider.
△ Less
Submitted 17 November, 2015; v1 submitted 18 May, 2015;
originally announced May 2015.
-
Empirical Bayes unfolding of elementary particle spectra at the Large Hadron Collider
Authors:
Mikael Kuusela,
Victor M. Panaretos
Abstract:
We consider the so-called unfolding problem in experimental high energy physics, where the goal is to estimate the true spectrum of elementary particles given observations distorted by measurement error due to the limited resolution of a particle detector. This an important statistical inverse problem arising in the analysis of data at the Large Hadron Collider at CERN. Mathematically, the problem…
▽ More
We consider the so-called unfolding problem in experimental high energy physics, where the goal is to estimate the true spectrum of elementary particles given observations distorted by measurement error due to the limited resolution of a particle detector. This an important statistical inverse problem arising in the analysis of data at the Large Hadron Collider at CERN. Mathematically, the problem is formalized as one of estimating the intensity function of an indirectly observed Poisson point process. Particle physicists are particularly keen on unfolding methods that feature a principled way of choosing the regularization strength and allow for the quantification of the uncertainty inherent in the solution. Though there are many approaches that have been considered by experimental physicists, it can be argued that few -- if any -- of these deal with these two key issues in a satisfactory manner. In this paper, we propose to attack the unfolding problem within the framework of empirical Bayes estimation: we consider Bayes estimators of the coefficients of a basis expansion of the unknown intensity, using a regularizing prior; and employ a Monte Carlo expectation-maximization algorithm to find the marginal maximum likelihood estimate of the hyperparameter controlling the strength of the regularization. Due to the data-driven choice of the hyperparameter, credible intervals derived using the empirical Bayes posterior lose their subjective Bayesian interpretation. Since the properties and meaning of such intervals are poorly understood, we explore instead the use of bootstrap resampling for constructing purely frequentist confidence bands for the true intensity. The performance of the proposed methodology is demonstrated using both simulations and real data from the Large Hadron Collider.
△ Less
Submitted 31 January, 2014;
originally announced January 2014.
-
Semi-Supervised Anomaly Detection - Towards Model-Independent Searches of New Physics
Authors:
Mikael Kuusela,
Tommi Vatanen,
Eric Malmi,
Tapani Raiko,
Timo Aaltonen,
Yoshikazu Nagai
Abstract:
Most classification algorithms used in high energy physics fall under the category of supervised machine learning. Such methods require a training set containing both signal and background events and are prone to classification errors should this training data be systematically inaccurate for example due to the assumed MC model. To complement such model-dependent searches, we propose an algorithm…
▽ More
Most classification algorithms used in high energy physics fall under the category of supervised machine learning. Such methods require a training set containing both signal and background events and are prone to classification errors should this training data be systematically inaccurate for example due to the assumed MC model. To complement such model-dependent searches, we propose an algorithm based on semi-supervised anomaly detection techniques, which does not require a MC training sample for the signal data. We first model the background using a multivariate Gaussian mixture model. We then search for deviations from this model by fitting to the observations a mixture of the background model and a number of additional Gaussians. This allows us to perform pattern recognition of any anomalous excess over the background. We show by a comparison to neural network classifiers that such an approach is a lot more robust against misspecification of the signal MC than supervised classification. In cases where there is an unexpected signal, a neural network might fail to correctly identify it, while anomaly detection does not suffer from such a limitation. On the other hand, when there are no systematic errors in the training data, both methods perform comparably.
△ Less
Submitted 16 April, 2012; v1 submitted 14 December, 2011;
originally announced December 2011.
-
Soft Classification of Diffractive Interactions at the LHC
Authors:
Mikael Kuusela,
Eric Malmi,
Risto Orava,
Tommi Vatanen
Abstract:
Multivariate machine learning techniques provide an alternative to the rapidity gap method for event-by-event identification and classification of diffraction in hadron-hadron collisions. Traditionally, such methods assign each event exclusively to a single class producing classification errors in overlap regions of data space. As an alternative to this so called hard classification approach, we p…
▽ More
Multivariate machine learning techniques provide an alternative to the rapidity gap method for event-by-event identification and classification of diffraction in hadron-hadron collisions. Traditionally, such methods assign each event exclusively to a single class producing classification errors in overlap regions of data space. As an alternative to this so called hard classification approach, we propose estimating posterior probabilities of each diffractive class and using these estimates to weigh event contributions to physical observables. It is shown with a Monte Carlo study that such a soft classification scheme is able to reproduce observables such as multiplicity distributions and relative event rates with a much higher accuracy than hard classification.
△ Less
Submitted 30 December, 2010;
originally announced January 2011.