Unifying Simulation and Inference with Normalizing Flows

Haoxing Du [email protected] Department of Physics, University of California, Berkeley, CA 94720, USA    Claudius Krause [email protected] Institut für Theoretische Physik, Universität Heidelberg, Philosophenweg 12, 69120 Heidelberg, Germany Institute of High Energy Physics (HEPHY), Austrian Academy of Sciences (OeAW), Georg-Coch-Platz 2, A-1010 Vienna, Austria    Vinicius Mikuni [email protected] National Energy Research Scientific Computing Center, Berkeley Lab, Berkeley, CA 94720, USA    Benjamin Nachman [email protected] Physics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Berkeley Institute for Data Science, University of California, Berkeley, CA 94720, USA    Ian Pang [email protected] NHETC, Department of Physics and Astronomy, Rutgers University, Piscataway, NJ 08854, USA    David Shih [email protected] NHETC, Department of Physics and Astronomy, Rutgers University, Piscataway, NJ 08854, USA
Abstract

There have been many applications of deep neural networks to detector calibrations and a growing number of studies that propose deep generative models as automated fast detector simulators. We show that these two tasks can be unified by using maximum likelihood estimation (MLE) from conditional generative models for energy regression. Unlike direct regression techniques, the MLE approach is prior-independent and non-Gaussian resolutions can be determined from the shape of the likelihood near the maximum. Using an ATLAS-like calorimeter simulation, we demonstrate this concept in the context of calorimeter energy calibration.

preprint: HEPHY-ML-24-01

I Introduction

Detector calibrations are one of the most important and foundational tasks in experimental physics. In particle and nuclear physics, the largest calibration step is usually a simulation-based111The residual correction to account for differences between data and simulation is usually smaller. Methods in this paper may also be applicable to data-based corrections in certain cases. correction to ensure that the reported properties of the measured particles are unbiased. Detector simulations are complex and accurate, but can only run forward in time and the resolutions are non-trivial. Combined, these properties mean that we cannot simply invert the simulation to predict true quantities given measured ones.

This is particularly acute for highly-segmented detectors where many individual channels are activated for a single particle as the measured phase space can be high dimensional and complex. A common example of this setting is calorimeter reconstruction. Unlike tracking detectors that aim to minimally disrupt the trajectory of a particle, calorimeters are designed to stop particles and the resulting showers inside dense materials are challenging to parse and are highly stochastic. Traditional calibration methods are optimized using relatively low-dimensional summary statistics. Such calibrations have enabled many science results, but the ultimate precision cannot be reached until we utilize all of the available low-level information for event reconstruction.

Machine learning provides a set of tools that can analyze hadronic final states holistically to achieve the best precision. Recent examples supporting this claim include the energy calibrations of single hadrons [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], jets [12, 13, 14, 15, 16, 17, 18, 19, 20, 21], and global event properties [22, 23, 24, 25] at colliders. These techniques can automatically make use of finer segmentation to improve the resolution of reconstructed particle energies [4, 10]. While machine learning has broader utility than reconstructing individual hadron showers within a calorimeter, we focus on this case because it is particularly challenging and a critical component of event reconstruction at the Large Hadron Collider and elsewhere.

While existing machine learning methods are promising, most approaches have at least two undesirable features: they are only point estimates and are prior-dependent. Firstly, most approaches produce a single estimate without quantifying ‘uncertainties’. In particle/ nuclear physics, the spread of the difference between the inferred estimates and the true value is usually called the resolution, even though it is part of Uncertainty Quantification (UQ) in the broader machine learning literature. Secondly, techniques that employ standard machine learning regression tools are not universal: their performance depends on the distribution of examples (the ‘prior’, e.g. uniform in energy) used during training. In fact, this prior dependence often results in a large calibration bias.

Over the last years, some solutions to these two central challenges have been proposed. Generalized numerical inversion [26, 27, 28] is a prior-independent calibration approach, although it does not scale well to many output dimensions. The Gaussian Ansatz [17, 18] is a maximum likelihood estimator, and thus prior-independent (more in Sec. II). For UQ, loss functions can be modified to estimate quantiles in addition to just the mean, median or mode [13, 14] and deep generative models can estimate the exact resolution function [24, 25]. The Gaussian Ansatz produces an estimate of the resolution and so is currently the only method that is both prior-independent and that goes beyond a point estimate. We note that it is also possible to use Bayesian versions [29, 30, 31, 32, 33] of neural networks to go beyond point estimates by providing the uncertainty of the output. However, the uncertainty would still be prior-dependent.

In this paper, we propose a prior-independent method for detector calibrations that produces per-shower resolution estimates. The approach is based on deep generative models with access to the explicit likelihood. In particular, we use normalizing flows222It is also possible to extract the likelihood in many dimensions from a diffusion model [34] or if the inputs are discretized, from an autoregressive model [35]. [36, 37, 38, 39, 40, 41]. These machine learning models are invertible functions with a tractable Jacobian so that one can be used for both sampling and density estimation. The core idea behind a normalizing flow is that one starts with a simple random variable with a known probability density (e.g. a Gaussian) and then applies a series of transformations with the tractable Jacobian. This results in a complex probability density that can also be computed via the change of variables formula. The probability density can be made conditional by letting each transformation depend on the conditional quantity. Our approach serves as an alternative to other methods based on maximum likelihood estimation.

Normalizing flows are state-of-the-art as calorimeter simulation surrogate models [42, 43, 44, 45, 46, 47, 48, 49]. Additionally, normalizing flows have shown similar good performance in other tasks in high-energy physics [50, 51, 52, 53, 54, 55, 56, 33, 57, 58, 59, 60, 61, 62, 24, 63, 64, 65, 66, 24, 25, 67, 68, 69, 70, 71, 72]. In this paper, we unify simulation and inference by showing how models trained for generation can be reused for calibration. As a demonstration of this approach, we use CaloFlow [42, 43], a normalizing flow-based calorimeter surrogate model. This machine learning model is trained on single-pion showers from an extended version [73] of the CaloGAN dataset [74, 75, 76] that now includes both a sampling calorimeter setup [77] and a hadronic component. The calibration task is to predict the incident pion energy given the distribution of energies recorded in the cells of the calorimeter. Importantly, we demonstrate the following key advantages of our approach:

  1. 1.

    Zero-shot calibration:
    Normalizing flow-based models trained for surrogate modelling can be reused for calibration by repurposing the probability density as a likelihood without any additional model retrainings.

  2. 2.

    Access to per-shower resolution:
    With access to the complete likelihood, we can compute the exact resolution function. This allows us to obtain per-shower resolution estimates which give us additional information about how close the predicted energy is to the true energy. Furthermore, it enables us to estimate asymmetries in the resolution function.

  3. 3.

    Less biased calibration:
    We achieve a smaller calibration bias compared to a typical direct regression approach for the calorimeter setup considered in this work.

This paper is organized as follows. Machine learning-based calibration methods are introduced in Sec. II. Here we also provide precise definitions for bias and resolution used in this paper. In Sec. III, we summarize the CaloFlow algorithm and discuss the results of the calorimeter example which demonstrate the key advantages listed above. The paper ends with conclusions and outlook in Sec. IV. We collect some details on the used network architectures in an appendix.

II Methods

Given samples from a forward model (usually a physics-based simulation) XpX|Z(x|z)similar-to𝑋subscript𝑝conditional𝑋𝑍conditional𝑥𝑧X\sim p_{X|Z}(x|z)italic_X ∼ italic_p start_POSTSUBSCRIPT italic_X | italic_Z end_POSTSUBSCRIPT ( italic_x | italic_z ) for measured values X𝑋Xitalic_X and true values Z𝑍Zitalic_Z, the goal of calibration is to estimate Z𝑍Zitalic_Z from X𝑋Xitalic_X. In our notation, capital letters represent random variables and lower case letters represent realizations of those random variables. Both the measured and true values can be multidimensional, although we will focus on the common case where Z𝑍Zitalic_Z is one dimensional and XN𝑋superscript𝑁X\in\mathbb{R}^{N}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT.

The usual assumption when deriving a simulation-based calibration is that the detector response is correct and universal, i.e. pX|Ztrain(x|z)=pX|Ztest(x|z)superscriptsubscript𝑝conditional𝑋𝑍trainconditional𝑥𝑧superscriptsubscript𝑝conditional𝑋𝑍testconditional𝑥𝑧p_{X|Z}^{\text{train}}(x|z)=p_{X|Z}^{\text{test}}(x|z)italic_p start_POSTSUBSCRIPT italic_X | italic_Z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT ( italic_x | italic_z ) = italic_p start_POSTSUBSCRIPT italic_X | italic_Z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT ( italic_x | italic_z ) for different physics processes (i.e. different p(z)𝑝𝑧p(z)italic_p ( italic_z )) or for actual data.

There are two important quantities related to calibration performance — the bias and the resolution.

  1. 1.

    The bias of a calibration is the deviation between the central tendency of the inferred estimate z^^𝑧\hat{z}over^ start_ARG italic_z end_ARG and the true reference value z𝑧zitalic_z. Any measure of central tendency can be used to measure closure, such as the median or mode. In this paper, we will focus on the mode333Usually bias refers to the mean, but experimentalists typically calibrate to the mode. of the inferred estimates given a fixed z𝑧zitalic_z value. To compute the mode in practice, for a given z𝑧zitalic_z, we look at K𝐾Kitalic_K measured values x1,x2xKsubscript𝑥1subscript𝑥2subscript𝑥𝐾x_{1},x_{2}\dots x_{K}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT corresponding to that z𝑧zitalic_z. Each measured value xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT undergoes calibration to derive an inferred estimate z^isubscript^𝑧𝑖\hat{z}_{i}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The mode is subsequently determined based on the collection of inferred estimates {z^1,z^2,,z^K}subscript^𝑧1subscript^𝑧2subscript^𝑧𝐾\{\hat{z}_{1},\hat{z}_{2},\dots,\hat{z}_{K}\}{ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT } (see steps in Sec. III.3).

  2. 2.

    The resolution is the spread of the difference between the inferred estimates and the true value. In this work, we have two relevant definitions:

    • Full resolution: For fixed z𝑧zitalic_z, it is defined as half of the 68% confidence interval about the mode determined based on a collection of inferred estimates.

    • Per-event resolution: For each x𝑥xitalic_x, it is defined as half of the 68% confidence interval of the corresponding maximum likelihood estimate (see Sec. II.2).

    The full resolution is not to be confused with the per-event resolution. The full resolution is defined for a collection of inferred estimates, whereas the per-event resolution is defined for each inferred estimate.

II.1 Direct Regression

Most proposals for deep learning-based calibration directly regress Z𝑍Zitalic_Z from X𝑋Xitalic_X using a loss function like the mean-squared error (MSE):

L[f]=i(fMSE(xi)zi)2,𝐿delimited-[]𝑓subscript𝑖superscriptsubscript𝑓MSEsubscript𝑥𝑖subscript𝑧𝑖2\displaystyle L[f]=\sum_{i}(f_{\rm MSE}(x_{i})-z_{i})^{2}\,,italic_L [ italic_f ] = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT roman_MSE end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (1)

for a neural network fMSEsubscript𝑓MSEf_{\rm MSE}italic_f start_POSTSUBSCRIPT roman_MSE end_POSTSUBSCRIPT. Using the calculus of variations, one can show that with enough training data and sufficiently flexible neural network architecture and training protocol, the solution to Eq. 1 is the average value of Z𝑍Zitalic_Z given X=x𝑋𝑥X=xitalic_X = italic_x:

fMSE(x)subscript𝑓MSE𝑥\displaystyle f_{\rm MSE}(x)italic_f start_POSTSUBSCRIPT roman_MSE end_POSTSUBSCRIPT ( italic_x ) =Z|X=xabsentinner-product𝑍𝑋𝑥\displaystyle=\langle Z|X=x\rangle= ⟨ italic_Z | italic_X = italic_x ⟩ (2)
=𝑑zzpZ|Xtrain(z|x)absentdifferential-d𝑧𝑧superscriptsubscript𝑝conditional𝑍𝑋trainconditional𝑧𝑥\displaystyle=\int dz\,z\,p_{Z|X}^{\text{train}}(z|x)= ∫ italic_d italic_z italic_z italic_p start_POSTSUBSCRIPT italic_Z | italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT ( italic_z | italic_x ) (3)
=𝑑zzpX|Ztrain(x|z)pZtrain(z)pXtrain(x).absentdifferential-d𝑧𝑧superscriptsubscript𝑝conditional𝑋𝑍trainconditional𝑥𝑧superscriptsubscript𝑝𝑍train𝑧superscriptsubscript𝑝𝑋train𝑥\displaystyle=\int dz\,z\,p_{X|Z}^{\text{train}}(x|z)\,\frac{p_{Z}^{\text{% train}}(z)}{p_{X}^{\text{train}}(x)}\,.= ∫ italic_d italic_z italic_z italic_p start_POSTSUBSCRIPT italic_X | italic_Z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT ( italic_x | italic_z ) divide start_ARG italic_p start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT ( italic_z ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT ( italic_x ) end_ARG . (4)

Note that fMSEsubscript𝑓MSEf_{\rm MSE}italic_f start_POSTSUBSCRIPT roman_MSE end_POSTSUBSCRIPT only gives a point estimate of z𝑧zitalic_z for each x𝑥xitalic_x and does not have access to per-event resolutions. In other words, there is no calibration uncertainty quantified by fMSEsubscript𝑓MSEf_{\rm MSE}italic_f start_POSTSUBSCRIPT roman_MSE end_POSTSUBSCRIPT.

For a given z𝑧zitalic_z, the bias can be computed as the deviation between z𝑧zitalic_z and the mode of a collection of fMSE(xi)subscript𝑓MSEsubscript𝑥𝑖f_{\rm MSE}(x_{i})italic_f start_POSTSUBSCRIPT roman_MSE end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and the full resolution at z𝑧zitalic_z is then computed based on the 68% confidence interval about the mode. Note that the bias is defined as a function of z𝑧zitalic_z.444Alternatively, defining the bias as a function of the inferred values would result in the bias being dependent on the prior used during testing (i.e. pZtest(z)subscriptsuperscript𝑝test𝑍𝑧p^{\rm test}_{Z}(z)italic_p start_POSTSUPERSCRIPT roman_test end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_z )). This bias also cannot be corrected since we do not know a priori what pZtest(z)subscriptsuperscript𝑝test𝑍𝑧p^{\rm test}_{Z}(z)italic_p start_POSTSUPERSCRIPT roman_test end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_z ) is used during calibration. Hence, we cannot correct for this bias as we do not have access to z𝑧zitalic_z while performing the calibration.

The challenge with Eq. 4 is that it depends on pZtrainsuperscriptsubscript𝑝𝑍trainp_{Z}^{\text{train}}italic_p start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT even when we assume that the detector response is universal. Unsurprisingly then, the bias of direct regression also depends on the training dataset. In Ref. [18], the prior dependence of MSE-based calibration is shown to be the main source of large calibration bias.

II.2 Maximum Likelihood Inference

Maximum likelihood estimation (MLE) involves finding the Z𝑍Zitalic_Z that maximizes the likelihood of the data,

Z^=argmax𝑧pX|Z(X|z).^𝑍𝑧argmaxsubscript𝑝conditional𝑋𝑍conditional𝑋𝑧\hat{Z}=\underset{z}{\text{argmax}}\,p_{X|Z}(X|z)\,.over^ start_ARG italic_Z end_ARG = underitalic_z start_ARG argmax end_ARG italic_p start_POSTSUBSCRIPT italic_X | italic_Z end_POSTSUBSCRIPT ( italic_X | italic_z ) . (5)

Since we assume that pX|Zsubscript𝑝conditional𝑋𝑍p_{X|Z}italic_p start_POSTSUBSCRIPT italic_X | italic_Z end_POSTSUBSCRIPT is universal and Eq. 5 only depends on this likelihood, then Z^^𝑍\hat{Z}over^ start_ARG italic_Z end_ARG is universal. We can go beyond the point estimate and use the likelihood function pX|Zsubscript𝑝conditional𝑋𝑍p_{X|Z}italic_p start_POSTSUBSCRIPT italic_X | italic_Z end_POSTSUBSCRIPT to obtain the per-event resolution for each maximum likelihood estimate z^^𝑧\hat{z}over^ start_ARG italic_z end_ARG. The access to per-event resolutions is a major advantage of MLE over direct regression in calibration. Examples showcasing this advantage are discussed in Sec. III.4.

Similar to the direct regression case, the bias and full resolution at a fixed z𝑧zitalic_z value can be defined based on the mode and 68% confidence interval about the mode.

Although the MLE-based calibration is always prior-independent, the prior dependence does not guarantee that the method is always unbiased. Only in certain scenarios (e.g. 1D Gaussian noise model such that pX|Z(x|z)N(z,σ)similar-tosubscript𝑝conditional𝑋𝑍conditional𝑥𝑧𝑁𝑧𝜎p_{X|Z}(x|z)\sim N(z,\sigma)italic_p start_POSTSUBSCRIPT italic_X | italic_Z end_POSTSUBSCRIPT ( italic_x | italic_z ) ∼ italic_N ( italic_z , italic_σ )) can the MLE calibration be shown to be unbiased. Certain detector responses pX|Zsubscript𝑝conditional𝑋𝑍p_{X|Z}italic_p start_POSTSUBSCRIPT italic_X | italic_Z end_POSTSUBSCRIPT may result in biased calibration. In general, prior independence is a necessary but insufficient condition for unbiased calibration. Even so, we show in Sec. III.3 that MLE-based calibration results in a smaller bias compared to direct regression for an ATLAS-like calorimeter setup. This suggests that in our application, the dependence on training prior is a bigger contributing factor to the bias than the detector response.

Also, the challenge with MLE is that we usually do not know pX|Zsubscript𝑝conditional𝑋𝑍p_{X|Z}italic_p start_POSTSUBSCRIPT italic_X | italic_Z end_POSTSUBSCRIPT explicitly. Nevertheless, we are able to sample from this conditional density by running a simulation. In this work, we instead aim to learn the entire likelihood using a neural network — an approach sometimes referred to as neural likelihood estimation in the simulation-based inference literature [78, 79, 80, 81]. This is substantially more work, but we observe that the work may already be done: given a fast simulation based on neural networks with access to the likelihood, we can use it for calibration in addition to generation without requiring any additional retraining.

Our tool of choice is the normalizing flow (NF). NFs are neural networks that are optimized using maximum likelihood estimation. With access to an estimate of the full likelihood, we can study both Gaussian and non-Gaussian aspects of the resolution.

III Regressing particle incident energy

In this section, we compare the performance of direct regression and NF in regressing the particle incident energy from the resulting calorimeter shower information.

III.1 Dataset

The events are π+superscript𝜋\pi^{+}italic_π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT calorimeter showers from a new version [73] of the CaloGAN dataset which we generated with Geant[82, 83, 84] for this study. We generated 100k showers with π+superscript𝜋\pi^{+}italic_π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT incident energy uniformly distributed in the range [1,100] GeV. Additionally, we generated 100k showers with incident energy log-uniformly distributed in the same range. The calorimeter setup used in the original dataset only included the electromagnetic calorimeter (ECAL). In this new dataset, we also include the hadronic calorimeter (HCAL) which is positioned behind the ECAL. Also, the original dataset included energy contributions from both active and inactive calorimeter layers, whereas this new sampling calorimeter dataset only includes energy contributions from the active layers as would be available in practice.

The ECAL is a three-layer sampling calorimeter cube with 480 mm side-length that is inspired by the ATLAS liquid argon (LAr) electromagnetic calorimeter [85]. For the ECAL, the active material is LAr and the absorber material is lead (Pb). Note that the ECAL used here is identical to the one used in Ref. [77]. However, photon (instead of π+superscript𝜋\pi^{+}italic_π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT) showers were considered in that study. The HCAL is a three-layer sampling calorimeter cube with 2000 mm side-length located behind the ECAL. For the HCAL, the active material is LAr and the absorber material is tungsten (W). The sampling fractions of the ECAL and HCAL are similar-to\sim 20% and similar-to\sim 1.3%, respectively. The calorimeter showers are represented as three-dimensional images that are binned in position space.

In this representation, the calorimeter shower geometry is made up of voxels (volumetric pixels) and the details of the calorimeter voxel dimensions are included in Table 1. The energy distribution in the calorimeter for uniformly distributed π+superscript𝜋\pi^{+}italic_π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT incident energies is shown in Fig. 1. Note that the actual energy deposited in the sampling calorimeter is usually much smaller than the incident energy. Hence, it is common practice that the energy deposited is rescaled555In this work, the rescaling is done by dividing the actual energy deposited in the ECAL or HCAL by their respective sampling fractions. based on the sampling fraction to ensure that the dataset contains deposited energy Edepsubscript𝐸depE_{\rm dep}italic_E start_POSTSUBSCRIPT roman_dep end_POSTSUBSCRIPT that is close to the incident energy Eincsubscript𝐸incE_{\rm inc}italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT. For this reason, it is possible for some showers to have Edep/Einc>1subscript𝐸depsubscript𝐸inc1E_{\rm dep}/E_{\rm inc}>1italic_E start_POSTSUBSCRIPT roman_dep end_POSTSUBSCRIPT / italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT > 1 as shown in Fig. 1.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Generated versus reference energy distributions in our new ECAL+HCAL calorimeter setup for uniformly distributed incident energies Eincsubscript𝐸incE_{\rm inc}italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT. The energy deposited in layer i𝑖iitalic_i is denoted by Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the energy deposited in the calorimeter is denoted by Edep=i=06Eisubscript𝐸depsuperscriptsubscript𝑖06subscript𝐸𝑖E_{\rm dep}=\sum_{i=0}^{6}E_{i}italic_E start_POSTSUBSCRIPT roman_dep end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The ratio of energy deposited in the calorimeter and the incident energy of the particle is denoted by Edep/Eincsubscript𝐸depsubscript𝐸incE_{\text{dep}}/E_{\text{inc}}italic_E start_POSTSUBSCRIPT dep end_POSTSUBSCRIPT / italic_E start_POSTSUBSCRIPT inc end_POSTSUBSCRIPT.
Layer
index
z𝑧zitalic_z length
(mm)
η𝜂\etaitalic_η length
(mm)
ϕitalic-ϕ\phiitalic_ϕ length
(mm)
Number
of voxels
ECAL 0 90 5 160 3×963963\times 963 × 96
1 347 40 40 12×12121212\times 1212 × 12
2 43 80 40 12×612612\times 612 × 6
HCAL 3 375 20.83 666.67 3×963963\times 963 × 96
4 667 166.67 166.67 12×12121212\times 1212 × 12
5 958 333.33 166.67 12×612612\times 612 × 6
Table 1: Dimensions of a calorimeter voxel. The positive zlimit-from𝑧z-italic_z -axis (radial direction in full detector) is the direction of particle propagation, the η𝜂\etaitalic_η direction is along the proton beam axis, and ϕitalic-ϕ\phiitalic_ϕ is perpendicular to z𝑧zitalic_z and η𝜂\etaitalic_η. For the number of voxels, the first (second) number is the number of bins in the ϕitalic-ϕ\phiitalic_ϕ (η𝜂\etaitalic_η) direction (e.g., 12×612612\times 612 × 6 refers to 12 ϕitalic-ϕ\phiitalic_ϕ bins and 6 η𝜂\etaitalic_η bins).

III.2 Calorimeter Shower simulation with Normalizing Flows

CaloFlow [42, 43] is an approach to fast calorimeter simulation based on conditional normalizing flows. In the context of fast calorimeter simulation, CaloFlow  is designed to generate the voxel level shower energies \vec{\mathcal{I}}over→ start_ARG caligraphic_I end_ARG conditioned on the corresponding incident energies of the showers Eincsubscript𝐸incE_{\text{inc}}italic_E start_POSTSUBSCRIPT inc end_POSTSUBSCRIPT denoted by p(|Einc)𝑝conditionalsubscript𝐸incp(\vec{\mathcal{I}}|E_{\text{inc}})italic_p ( over→ start_ARG caligraphic_I end_ARG | italic_E start_POSTSUBSCRIPT inc end_POSTSUBSCRIPT ). In particular, it uses a two-flow approach to perform this task: Flow-I is formulated to learn the probability density p1(E|Einc)subscript𝑝1conditional𝐸subscript𝐸incp_{1}(\vec{E}|E_{\text{inc}})italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_E end_ARG | italic_E start_POSTSUBSCRIPT inc end_POSTSUBSCRIPT ) of calorimeter layer energies666The layer energy of a given calorimeter layer is the sum of all the voxel energies in that layer. E𝐸\vec{E}over→ start_ARG italic_E end_ARG conditioned on the incident energy Eincsubscript𝐸incE_{\text{inc}}italic_E start_POSTSUBSCRIPT inc end_POSTSUBSCRIPT, while Flow-II is designed to learn the probability density of the normalized voxel level shower energies conditioned on the calorimeter layer energies p2(|E,Einc)subscript𝑝2conditional𝐸subscript𝐸incp_{2}\left(\vec{\mathcal{I}}|\vec{E},E_{\text{inc}}\right)italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over→ start_ARG caligraphic_I end_ARG | over→ start_ARG italic_E end_ARG , italic_E start_POSTSUBSCRIPT inc end_POSTSUBSCRIPT ).

For the regression task discussed in this section, we found that having layer energy information was sufficient and adding voxel level information did not significantly increase the regression performance. This is consistent with Ref. [10], which found that increasing longitudinal segmentation is more effective than increasing transverse segmentation (equivalent to adding voxel level information) in improving the energy resolution. As we solely used E𝐸\vec{E}over→ start_ARG italic_E end_ARG as input, only Flow-I of CaloFlow was required to perform the calibration task.

In our present dataset, E=(E0,E1,E2,E3,E4,E5)𝐸subscript𝐸0subscript𝐸1subscript𝐸2subscript𝐸3subscript𝐸4subscript𝐸5\vec{E}=(E_{0},E_{1},E_{2},E_{3},E_{4},E_{5})over→ start_ARG italic_E end_ARG = ( italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) is six-dimensional, twice the dimensionality found in previous versions of the dataset. Hence, there were some minor modifications made to the original CaloFlow primarily to deal with different dimensionality of the new dataset. Nevertheless, the main CaloFlow algorithm remains the same. The details of the architecture and training are outlined in Appendix A. It is known that using an ensemble of generative models generally helps to improve generation performance at the distribution level (e.g. Ref. [86]) and would ideally be done for the task of fast calorimeter surrogate modelling. Hence, we opted to train two ensembles of 10 independent Flow-I models each. In particular, 10 models are trained on uniformly distributed Eincsubscript𝐸incE_{\rm inc}italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT, and the other 10 models are trained on log-uniformly distributed Eincsubscript𝐸incE_{\rm inc}italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT.

When generating samples with an ensemble of models, each model in the ensemble is used to generate an equal fraction of the total number of samples. We show in Fig. 1 that an ensemble of Flow-I models is able to generate energy distributions that agree reasonably well with the reference distributions with some slight mismatch in tails and at places with sharp changes in the reference distributions.

Once CaloFlow is trained, it can be repurposed for the calibration of particle incident energy without any additional retraining. For each shower, the predicted particle incident energy corresponds to the incident energy which maximizes the likelihood function estimated by CaloFlow. The “full” likelihood function of the ensemble of Flow-I models is taken to be mean of the likelihood functions from each of the models in the ensemble. For the remainder of this paper, this “full” likelihood is what we use to perform the MLE-based calibration.

III.3 Prior-independent and less biased calibration

Using the dataset described in Sec. III.1 to train our models, we predict the incident energy of the incoming pion Eincsubscript𝐸incE_{\text{inc}}italic_E start_POSTSUBSCRIPT inc end_POSTSUBSCRIPT given the energy deposited in the six calorimeter layers E𝐸\vec{E}over→ start_ARG italic_E end_ARG. To test the prior dependence of these methods, we also compared the performance of these methods when trained on showers with uniformly distributed Eincsubscript𝐸incE_{\text{inc}}italic_E start_POSTSUBSCRIPT inc end_POSTSUBSCRIPT versus log-uniformly distributed Eincsubscript𝐸incE_{\text{inc}}italic_E start_POSTSUBSCRIPT inc end_POSTSUBSCRIPT in the range [1,100] GeV.

Our chosen direct regression model is a dense neural network (DNN) with two fully connected hidden layers, each comprising 256 nodes with ReLU activation functions. The direct regression DNN is implemented in PyTorch and optimized with Adam [87] using a batch size of 200 and 200 epochs. We found that using an ensemble of DNNs helped to reduce the uncertainty of the calibration. Like for CaloFlow , we opted to have two ensembles, each with 10 independent DNNs. In particular, 10 models are trained on uniformly distributed Eincsubscript𝐸incE_{\rm inc}italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT, and the other 10 models are trained on log-uniformly distributed Eincsubscript𝐸incE_{\rm inc}italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT. For each shower, the predicted incident energy of the ensemble is then taken as the mean predicted incident energy of all the individual DNNs in the ensemble. Each DNN has similar-to\sim134k model parameters, while each NF has similar-to\sim107k model parameters.

In Fig. 2, we show examples of the likelihood evaluated by CaloFlow for three given calorimeter showers that originate from a 30 GeV pion, 60 GeV pion and 90 GeV pion, respectively. Specifically, the difference between 2logp(E|Einc)2𝑝conditional𝐸subscript𝐸inc-2\log p(\vec{E}|E_{\rm inc})- 2 roman_log italic_p ( over→ start_ARG italic_E end_ARG | italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT ) and 2logpmax2logp(E|Einc)|Einc=Epred2subscript𝑝maxevaluated-at2𝑝conditional𝐸subscript𝐸incsubscript𝐸incsubscript𝐸pred-2\log p_{\rm max}\equiv-2\log p(\vec{E}|E_{\rm inc})|_{E_{\rm inc}=E_{\rm pred}}- 2 roman_log italic_p start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ≡ - 2 roman_log italic_p ( over→ start_ARG italic_E end_ARG | italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT end_POSTSUBSCRIPT is plotted for a range of Eincsubscript𝐸incE_{\text{inc}}italic_E start_POSTSUBSCRIPT inc end_POSTSUBSCRIPT, where Epredsubscript𝐸predE_{\rm pred}italic_E start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT is the MLE prediction from CaloFlow. Note that the value of Epredsubscript𝐸predE_{\rm pred}italic_E start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT corresponds to the minimum of the blue curve in each of the three plots in Fig. 2.

To evaluate the bias, the mode of the Epred/Etruesubscript𝐸predsubscript𝐸trueE_{\rm pred}/E_{\rm true}italic_E start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT / italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT distribution has to be computed for each fixed Etruesubscript𝐸trueE_{\rm true}italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT. In this study, the mode is estimated using kernel density estimation (KDE)777We used the scipy.stats.gaussian_kde [88] function to perform the KDE. and the uncertainty of the estimate is determined by bootstrap** [89]. To obtain the mode of the Epred/Etruesubscript𝐸predsubscript𝐸trueE_{\rm pred}/E_{\rm true}italic_E start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT / italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT distribution at each fixed Etruesubscript𝐸trueE_{\rm true}italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT, the steps we took are as follows:

  1. 1.

    Draw with replacement N𝑁Nitalic_N samples from N𝑁Nitalic_N values of Epredsubscript𝐸predE_{\rm pred}italic_E start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT, where N𝑁Nitalic_N is the number of showers in the evaluation dataset for a given fixed Etruesubscript𝐸trueE_{\rm true}italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT.

  2. 2.

    Perform kernel density estimation of the drawn samples with kernel bandwith determined using Scott’s rule [90].

  3. 3.

    Identify the position of the mode of the estimated density.

  4. 4.

    Repeat steps 1-3 for a total of 20 times.

  5. 5.

    Compute the mean and standard deviation of the 20 estimated values of the mode.

Refer to caption
Refer to caption
Refer to caption
Figure 2: Plot of likelihood evaluated using CaloFlow for a given calorimeter shower originating from a 30 GeV pion (left), 60 GeV pion (middle) and 90 GeV pion (right), respectively. In each plot, the true incident energy Etruesubscript𝐸trueE_{\rm true}italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT is shown by the vertical black line, and the evaluated likelihood by the NF is shown in blue. The lower (upper) bounds of the 68% confidence interval about the MLE is shown by the red vertical dashed lines.

The mean of the multiple mode estimates is taken to be the final mode estimate at each fixed Etruesubscript𝐸trueE_{\rm true}italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT. The corresponding standard deviation is taken to be the uncertainty of the final mode estimate. The mode estimation is performed for both the uniform and log-uniform cases. In Fig. 3, we show the the mode of Epred/Etruesubscript𝐸predsubscript𝐸trueE_{\rm pred}/E_{\rm true}italic_E start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT / italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT distribution at Etrue{10,20,30,40,50,60,70,80,90}subscript𝐸true102030405060708090E_{\rm true}\in\{10,20,30,40,50,60,70,80,90\}italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT ∈ { 10 , 20 , 30 , 40 , 50 , 60 , 70 , 80 , 90 } GeV for the different regression methods. Here we used nine evaluation datasets, each with 100k π+superscript𝜋\pi^{+}italic_π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT showers at one of the nine fixed Etruesubscript𝐸trueE_{\rm true}italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT values.

Refer to caption
Figure 3: Plot of the mode of Epred/Etruesubscript𝐸predsubscript𝐸trueE_{\rm pred}/E_{\rm true}italic_E start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT / italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT distribution, denoted by “Mode(Epred/Etrue)subscript𝐸predsubscript𝐸true(E_{\rm pred}/E_{\rm true})( italic_E start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT / italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT )”, and the corresponding uncertainty at different fixed values of Etrue{10,20,30,40,50,60,70,80,90}subscript𝐸true102030405060708090E_{\rm true}\in\{10,20,30,40,50,60,70,80,90\}italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT ∈ { 10 , 20 , 30 , 40 , 50 , 60 , 70 , 80 , 90 } GeV. Results for models trained on (log-)uniformly distributed Eincsubscript𝐸incE_{\text{inc}}italic_E start_POSTSUBSCRIPT inc end_POSTSUBSCRIPT are shown in solid (dashed) lines.

We observe that the flow-based calibration is generally less biased (closer to unity) compared to the MSE-based calibration. The larger bias of the MSE-based calibration is clearly seen at the edges of Etruesubscript𝐸trueE_{\rm true}italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT range. Furthermore, we see that the MSE-based calibration corresponding to the log-uniform prior is generally more biased than the calibration corresponding the uniform prior at intermediate energies.

The prior independence of the flow-based calibration is confirmed by observing consistent biases of CaloFlow trained on uniform and log-uniform priors. For the MSE-based calibration, there is prior dependence at many of the Etruesubscript𝐸trueE_{\rm true}italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT values used in the evaluation as seen from the difference in the biases corresponding to the uniform and log-uniform priors. For CaloFlow , the origin of the dip at Etrue=60subscript𝐸true60E_{\rm true}=60italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT = 60 GeV is unclear. However, we observe that the biases of CaloFlow trained on uniform and log-uniform priors agree with each other.

III.4 Resolution estimation

MLE-based calibration methods, such as the NF and Gaussian Ansatz, are able to provide estimates of the per-shower resolution and this allows these methods to go beyond a point estimate by providing per-event resolutions (i.e. per-shower resolutions in calorimeter context). This gives MLE-based calibration an important advantage over direct regression. Unlike the Gaussian Ansatz, the NF serves as a generative model, presenting a notable advantage over the Gaussian Ansatz: a NF utilized for surrogate modeling can subsequently be repurposed for calibration tasks without additional retraining.

Using CaloFlow, we perform per-shower resolution estimation by computing the two Eincsubscript𝐸incE_{\rm inc}italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT values corresponding to the lower (upper) bound of 68% confidence interval of the MLE prediction. Equivalently, these are the two Eincsubscript𝐸incE_{\rm inc}italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT values for which 2logp+2logpmax=12𝑝2subscript𝑝max1-2\log p+2\log p_{\rm max}=1- 2 roman_log italic_p + 2 roman_log italic_p start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 1 as shown by the vertical dashed lines in Fig. 2. The per-shower resolution prediction is then the average difference of the two Eincsubscript𝐸incE_{\rm inc}italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT values from Epredsubscript𝐸predE_{\rm pred}italic_E start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT. The full resolution is computed as half of the 68% confidence interval about the mode of the Epred/Etruesubscript𝐸predsubscript𝐸trueE_{\rm pred}/E_{\rm true}italic_E start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT / italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT distribution. In Fig. 4, we compare the mean predicted per-shower resolution and the full resolution at each of the nine fixed Etruesubscript𝐸trueE_{\rm true}italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT from the NF. The results in this section were produced using the NF trained on showers with uniformly distributed incident energies. We observe that the predicted resolution from the NF closely matches the full resolution. Also, we found that the uncertainty in the determination of the full resolution to be small and almost unnoticeable in Fig. 4.

Although not shown in Fig. 4, we found that the MSE-based calibration obtains a full resolution that closely matches that from the NF. However, it is unable to provide per-shower resolutions.

Refer to caption
Figure 4: Plot of full resolution (solid) and mean predicted per-shower resolution (dashed) at Etrue{10,20,30,40,50,60,70,80,90}subscript𝐸true102030405060708090E_{\rm true}\in\{10,20,30,40,50,60,70,80,90\}italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT ∈ { 10 , 20 , 30 , 40 , 50 , 60 , 70 , 80 , 90 } GeV.

Next we show that the per-shower resolution is a reliable guide to the calibration accuracy. In Fig. 5, we plot 2D histograms of |Epred/Etrue1|subscript𝐸predsubscript𝐸true1\absolutevalue{E_{\rm pred}/E_{\rm true}-1}| start_ARG italic_E start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT / italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT - 1 end_ARG | versus the per-shower resolution at three different Etruesubscript𝐸trueE_{\rm true}italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT values. We observe that showers with small per-shower resolution tend to have small |Epred/Etrue1|subscript𝐸predsubscript𝐸true1\absolutevalue{E_{\rm pred}/E_{\rm true}-1}| start_ARG italic_E start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT / italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT - 1 end_ARG |, while those with large per-shower resolutions appear to be evenly distributed across different values of |Epred/Etrue1|subscript𝐸predsubscript𝐸true1\absolutevalue{E_{\rm pred}/E_{\rm true}-1}| start_ARG italic_E start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT / italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT - 1 end_ARG |. In other words, a smaller per-shower resolution is indicative of a more accurately calibrated shower. For Etrue=20subscript𝐸true20E_{\rm true}=20italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT = 20 GeV, there is a drop in the per-shower resolution for |Epred/Etrue1|0.75greater-than-or-equivalent-tosubscript𝐸predsubscript𝐸true10.75\absolutevalue{E_{\rm pred}/E_{\rm true}-1}\gtrsim 0.75| start_ARG italic_E start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT / italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT - 1 end_ARG | ≳ 0.75 where the NF is relatively confident that the shower originates from an Einc>Etruesubscript𝐸incsubscript𝐸trueE_{\rm inc}>E_{\rm true}italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT > italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT. Nevertheless, we note that the per-shower resolution in this range of |Epred/Etrue1|subscript𝐸predsubscript𝐸true1\absolutevalue{E_{\rm pred}/E_{\rm true}-1}| start_ARG italic_E start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT / italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT - 1 end_ARG | is still larger than that at low values of |Epred/Etrue1|subscript𝐸predsubscript𝐸true1\absolutevalue{E_{\rm pred}/E_{\rm true}-1}| start_ARG italic_E start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT / italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT - 1 end_ARG |.

Refer to caption
Refer to caption
Refer to caption
Figure 5: 2D histograms of |Epred/Etrue1|subscript𝐸predsubscript𝐸true1\absolutevalue{E_{\rm pred}/E_{\rm true}-1}| start_ARG italic_E start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT / italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT - 1 end_ARG | versus per-shower resolution at Etrue=20subscript𝐸true20E_{\rm true}=20italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT = 20 GeV (left), Etrue=50subscript𝐸true50E_{\rm true}=50italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT = 50 GeV (middle) and Etrue=80subscript𝐸true80E_{\rm true}=80italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT = 80 GeV (right).

Since the NF is able to access the complete likelihood, this enables it to capture non-Gaussian888In principle, the Gaussian Ansatz [17, 18] should also be able to capture non-Gaussian components of the resolution. However, this has yet to be shown in a worked example. Previous works [18, 91] only attempted to compute the Gaussian resolution. Also, Gaussian Ansatz is trained in a different way (see Ref. [17]) compared to the NF and is more sensitive to hyperparameter choices. components of the resolution. We quantify this by defining an asymmetry observable based on the upper and lower bounds of the 68% confidence interval of the MLE prediction. In particular, the asymmetry is computed as the ratio ΔLΔUsubscriptΔ𝐿subscriptΔ𝑈\frac{\Delta_{L}}{\Delta_{U}}divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_ARG, where ΔLsubscriptΔ𝐿\Delta_{L}roman_Δ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT (ΔUsubscriptΔ𝑈\Delta_{U}roman_Δ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT) is the difference between the lower (upper) bound of the 68% confidence interval and the MLE.

An example is included in Fig. 6 which shows the per-shower resolution asymmetry predicted by the NF for 100k showers with Etrue=50subscript𝐸true50E_{\rm true}=50italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT = 50 GeV. We observe that ΔLΔUsubscriptΔ𝐿subscriptΔ𝑈\frac{\Delta_{L}}{\Delta_{U}}divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_ARG is largely greater than unity. This implies that for most showers, lower Eincsubscript𝐸incE_{\rm inc}italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT values tend to have larger likelihoods. While one might suspect that this resolution asymmetry is merely an artifact of the flow, we observe in Fig. 7 that the distribution of Eincsubscript𝐸incE_{\rm inc}italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT at fixed vertical slices of Edepsubscript𝐸depE_{\rm dep}italic_E start_POSTSUBSCRIPT roman_dep end_POSTSUBSCRIPT is also asymmetric towards lower values of Eincsubscript𝐸incE_{\rm inc}italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT. The particular shape of the 2D histogram in Fig. 7 is due to the nature of the π+superscript𝜋\pi^{+}italic_π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT showers. At fixed Eincsubscript𝐸incE_{\rm inc}italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT, fully electromagnetic events will have larger Edepsubscript𝐸depE_{\rm dep}italic_E start_POSTSUBSCRIPT roman_dep end_POSTSUBSCRIPT, since all the energy is deposited via electromagnetic interactions. On the other hand, fully hadronic events will have significantly lower Edepsubscript𝐸depE_{\rm dep}italic_E start_POSTSUBSCRIPT roman_dep end_POSTSUBSCRIPT due to energy carried away invisibly by particles like neutrons and neutrinos produced in nuclear interactions. The π+superscript𝜋\pi^{+}italic_π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT showers considered in this work contain both electromagnetic and hadronic components, so on a shower-by-shower basis, the Edepsubscript𝐸depE_{\rm dep}italic_E start_POSTSUBSCRIPT roman_dep end_POSTSUBSCRIPT can range anywhere between these two extremes of a fully electromagnetic shower and a fully hadronic shower [92]. This results in a tail at high energies in the Edepsubscript𝐸depE_{\rm dep}italic_E start_POSTSUBSCRIPT roman_dep end_POSTSUBSCRIPT distribution for fixed Eincsubscript𝐸incE_{\rm inc}italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT, presumably because the typical π+superscript𝜋\pi^{+}italic_π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT shower is mostly hadronic in nature. Thinking of p(Edep|Einc)𝑝conditionalsubscript𝐸depsubscript𝐸incp(E_{\rm dep}|E_{\rm inc})italic_p ( italic_E start_POSTSUBSCRIPT roman_dep end_POSTSUBSCRIPT | italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT ) as a proxy for the likelihood suggests that the resolution asymmetry is not merely an artifact of the flow but a genuine characteristic to be expected.

Refer to caption
Figure 6: Histogram of asymmetry ΔLΔUsubscriptΔ𝐿subscriptΔ𝑈\frac{\Delta_{L}}{\Delta_{U}}divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_ARG predicted by CaloFlow for 100k showers with Etrue=50subscript𝐸true50E_{\rm true}=50italic_E start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT = 50 GeV.
Refer to caption
Figure 7: 2D histogram of the deposited energy Edepsubscript𝐸depE_{\rm dep}italic_E start_POSTSUBSCRIPT roman_dep end_POSTSUBSCRIPT versus Eincsubscript𝐸incE_{\rm inc}italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT for 100k showers with uniformly distributed Eincsubscript𝐸incE_{\rm inc}italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT. A solid line is drawn across the mode of Eincsubscript𝐸incE_{\rm inc}italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT at each Edepsubscript𝐸depE_{\rm dep}italic_E start_POSTSUBSCRIPT roman_dep end_POSTSUBSCRIPT to guide the eye.

IV Conclusions

Using CaloFlow as an example, we show that normalizing flow-based deep generative models can be repurposed for prior-independent detector calibration.

We compared the calibration performance of CaloFlow and a DNN direct regression model on π+superscript𝜋\pi^{+}italic_π start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT calorimeter showers from our new sampling calorimeter dataset [73], and found that the direct regression method is clearly more biased relative to the flow-based calibration method. This highlights the advantage of utilizing prior-independent MLE-based calibration for tasks such as particle energy regression.

We demonstrate a second advantage of using a MLE-based calibration methods over direct regression by estimating per-shower resolutions. The average estimated per-shower resolution obtained from CaloFlow closely aligns with the full resolution. In contrast, the direct regression approach yields only point estimates and lacks the capacity to make per-shower resolution predictions. For CaloFlow, we found that smaller per-shower resolutions tend to coincide with more accurately calibrated showers. This is evidence that the per-shower resolutions from CaloFlow are reliable and have the potential to be leveraged for improved calibration. Notably, utilizing a NF for calibration grants access to the complete resolution function. This characteristic enables the NF to capture asymmetries in the resolution. In future work, it would be interesting to study how the per-shower resolution information obtained by the NF can be utilized to further improve the calibration.

Data and code availability

The datasets used in this study can be found at Ref. [73] and the software to generate these datasets are located at https://github.com/hep-lbdl/CaloGAN/tree/two_layer. The machine learning software is at https://github.com/Ian-Pang/regression_with_CF.

Acknowledgements

We would like to thank Rikab Gambhir and Jesse Thaler for the helpful discussions related to the Gaussian Ansatz model and for feedback on the manuscript. CK would like to thank the Baden-Württemberg-Stiftung for financing through the program Internationale Spitzenforschung, project Uncertainties – Teaching AI its Limits (BWST_IF2020-010). IP and DS are supported by the U.S. Department of Energy (DOE), Office of Science grant DOE-SC0010008 and HD, VM, and BN are supported by the DOE under contract DE-AC02-05CH11231.

Appendix A Architecture and training

Here we briefly describe the architecture and training procedure used for CaloFlow (see Refs. [42, 43] for more details). There are some differences compared to the implementation in the original CaloFlow papers [42, 43], but most of main algorithm remains the same. One main difference is that only Flow-I is used in this study.

The flows used in this work are Masked Autoregressive Flows (MAFs) [93] with compositions of Rational Quadratic Splines (RQS) [94] as transformations. The RQS transformations are parameterized using neural networks known as MADE blocks [95]. Identical flow architectures are used in each of the two cases with uniformly and log-uniformly distributed Eincsubscript𝐸incE_{\rm inc}italic_E start_POSTSUBSCRIPT roman_inc end_POSTSUBSCRIPT. Each flow consists of six MADE blocks, each with two hidden layers of 64 nodes. The RQS transformations are defined with 8 bins and a tail bound of 14.

The incident energy of the incoming photon is preprocessed as

Einclog10(Einc/10GeV).subscript𝐸incsubscript10subscript𝐸inc10GeVE_{\text{inc}}\to\log_{10}(E_{\text{inc}}/10\ \text{GeV})\,.italic_E start_POSTSUBSCRIPT inc end_POSTSUBSCRIPT → roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT inc end_POSTSUBSCRIPT / 10 GeV ) . (6)

The layer energies are preprocessed as

Ei2(log10(Ei+1keV)1).subscript𝐸𝑖2subscript10subscript𝐸𝑖1keV1E_{i}\to 2\left(\log_{10}(E_{i}+1\ \text{keV})-1\right)\,.italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → 2 ( roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 keV ) - 1 ) . (7)

The index i𝑖iitalic_i denotes the layer number. In the original CaloFlow, a different preprocessing was used for the layer energies Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in Flow-I where Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT were transformed to unit-space (see [42]).

Uniform noise in the range [0,0.1] keV was applied to the voxel energies during training and evaluation. The same range was used for the pion dataset in [44]. The range of uniform noise used in the original CaloFlow [42, 43] was [0,1] keV. The addition of noise was found to prevent the flow from fitting unimportant features. The training of flows in this work is optimized using independent Adam optimizers [87]. The flows were trained by minimizing logp1(E|Einc)subscript𝑝1conditional𝐸subscript𝐸inc-\log p_{1}(\vec{E}|E_{\text{inc}})- roman_log italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_E end_ARG | italic_E start_POSTSUBSCRIPT inc end_POSTSUBSCRIPT ) for 150 epochs with a batch size of 200. The initial learning of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT was chosen and a multi-step learning schedule was used when training the flow which halves the learning rate after each selected epoch milestone during the training.

References

  • [1] L. de Oliveira, B. Nachman and M. Paganini, Electromagnetic Showers Beyond Shower Shapes, Nucl. Instrum. Meth. A 951 (2020) 162879 [1806.05667].
  • [2] CMS collaboration, The Phase-2 Upgrade of the CMS Endcap Calorimeter, .
  • [3] ATLAS Collaboration, Deep Learning for Pion Identification and Energy Calibration with the ATLAS Detector, ATL-PHYS-PUB-2020-018 (2020) .
  • [4] C. Neubüser, J. Kieseler and P. Lujan, Optimising longitudinal and lateral calorimeter granularity for software compensation in hadronic showers using deep neural networks, Eur. Phys. J. C 82 (2022) 92 [2101.08150].
  • [5] N. Akchurin, C. Cowden, J. Damgov, A. Hussain and S. Kunori, On the Use of Neural Networks for Energy Reconstruction in High-granularity Calorimeters, JINST 16 (2021) P12036 [2107.10207].
  • [6] J. Kieseler, G.C. Strong, F. Chiandotto, T. Dorigo and L. Layer, Calorimetric Measurement of Multi-TeV Muons via Deep Regression, Eur.Phys.J.C 82 (2021) 79 [2107.02119].
  • [7] N. Akchurin, C. Cowden, J. Damgov, A. Hussain and S. Kunori, Perspectives on the Calibration of CNN Energy Reconstruction in Highly Granular Calorimeters, 2108.10963.
  • [8] ATLAS collaboration, Point Cloud Deep Learning Methods for Pion Reconstruction in the ATLAS Experiment, .
  • [9] S.R. Qasim, N. Chernyavskaya, J. Kieseler, K. Long, O. Viazlo, M. Pierini et al., End-to-end multi-particle reconstruction in high occupancy imaging calorimeters with graph neural networks, Eur.Phys.J.C 82 (2022) 753 [2204.01681].
  • [10] F.T. Acosta, B. Karki, P. Karande, A. Angerami, M. Arratia, K. Barish et al., The Optimal use of Segmentation for Sampling Calorimeters, 2310.04442.
  • [11] ATLAS collaboration, The application of neural networks for the calibration of topological cell clusters in the ATLAS calorimeters, .
  • [12] R. Haake and C. Loizides, Machine Learning based jet momentum reconstruction in heavy-ion collisions, Phys. Rev. C 99 (2019) 064904 [1810.06324].
  • [13] CMS collaboration, A Deep Neural Network for Simultaneous Estimation of b Jet Energy and Resolution, Comput. Softw. Big Sci. 4 (2020) 10 [1912.06046].
  • [14] S. Cheong, A. Cukierman, B. Nachman, M. Safdari and A. Schwartzman, Parametrizing the Detector Response with Neural Networks, JINST 15 (2020) P01030 [1910.03773].
  • [15] ALICE collaboration, Machine Learning based jet momentum reconstruction in Pb-Pb collisions measured with the ALICE detector, PoS EPS-HEP2019 (2020) 312 [1909.01639].
  • [16] CMS collaboration, Mass regression of highly-boosted jets using graph neural networks, .
  • [17] R. Gambhir, B. Nachman and J. Thaler, Learning Uncertainties the Frequentist Way: Calibration and Correlation in High Energy Physics, Phys.Rev.Lett. 129 (2022) 082001 [2205.03413].
  • [18] R. Gambhir, B. Nachman and J. Thaler, Bias and Priors in Machine Learning Calibrations for High Energy Physics, Phys.Rev.D 106 (2022) 036011 [2205.05084].
  • [19] ATLAS Collaboration, New techniques for jet calibration with the ATLAS detector, Eur.Phys.J.C 83 (2023) 761 [2303.17312].
  • [20] ATLAS collaboration, Simultaneous energy and mass calibration of large-radius jets with the ATLAS detector using a deep neural network, 2311.08885.
  • [21] ALICE collaboration, Measurement of the radius dependence of charged-particle jet suppression in Pb-Pb collisions at sNNsubscript𝑠NN\sqrt{s_{\rm NN}}square-root start_ARG italic_s start_POSTSUBSCRIPT roman_NN end_POSTSUBSCRIPT end_ARG = 5.02 TeV, 2303.00592.
  • [22] M. Diefenthaler, A. Farhat, A. Verbytskyi and Y. Xu, Deeply Learning Deep Inelastic Scattering Kinematics, Eur.Phys.J.C 82 (2021) 1064 [2108.11638].
  • [23] M. Arratia, D. Britzger, O. Long and B. Nachman, Reconstructing the Kinematics of Deep Inelastic Scattering with Deep Learning, Nucl.Instrum.Meth.A 1025 (2021) 166164 [2110.05505].
  • [24] M. Leigh, J.A. Raine and T. Golling, ν𝜈\nuitalic_ν-Flows: conditional neutrino regression, SciPost Phys. 14 (2022) 159 [2207.00664].
  • [25] J.A. Raine, M. Leigh, K. Zoch and T. Golling, ν2superscript𝜈2\nu^{2}italic_ν start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-Flows: Fast and improved neutrino reconstruction in multi-neutrino final states with conditional normalizing flows, 2307.02405.
  • [26] A. Cukierman and B. Nachman, Mathematical Properties of Numerical Inversion for Jet Calibrations, Nucl. Instrum. Meth. A 858 (2017) 1 [1609.05195].
  • [27] ATLAS Collaboration, Simultaneous Jet Energy and Mass Calibrations with Neural Networks, Tech. Rep. ATL-PHYS-PUB-2020-001, CERN, Geneva (Jan, 2020).
  • [28] ATLAS Collaboration, Generalized Numerical Inversion: A Neural Network Approach to Jet Calibration, Tech. Rep. ATL-PHYS-PUB-2018-013, CERN, Geneva (Jul, 2018).
  • [29] D.J. MacKay, Probable networks and plausible predictions-a review of practical bayesian methods for supervised neural networks, Network: computation in neural systems 6 (1995) 469.
  • [30] R.M. Neal, Bayesian learning for neural networks, vol. 118, Springer Science & Business Media (2012).
  • [31] Y. Gal et al., Uncertainty in deep learning, .
  • [32] G. Kasieczka, M. Luchmann, F. Otterpohl and T. Plehn, Per-Object Systematics using Deep-Learned Calibration, 2003.11099.
  • [33] M. Bellagente, M. Haußmann, M. Luchmann and T. Plehn, Understanding Event-Generation Networks via Uncertainties, SciPost Phys. 13 (2021) 003 [2104.04543].
  • [34] V. Mikuni and B. Nachman, High-dimensional and Permutation Invariant Anomaly Detection, 2306.03933.
  • [35] T. Finke, M. Krämer, A. Mück and J. Tönshoff, Learning the language of QCD jets with transformers, JHEP 06 (2023) 184 [2303.07364].
  • [36] E.G. Tabak and C.V. Turner, A family of nonparametric density estimation algorithms, Communications on Pure and Applied Mathematics 66 (2013) 145.
  • [37] L. Dinh, D. Krueger and Y. Bengio, Nice: Non-linear independent components estimation, arXiv preprint arXiv:1410.8516 (2014) .
  • [38] D.J. Rezende and S. Mohamed, Variational inference with normalizing flows, 2015. 10.48550/ARXIV.1505.05770.
  • [39] L. Dinh, J. Sohl-Dickstein and S. Bengio, Density estimation using real nvp, arXiv preprint arXiv:1605.08803 (2016) .
  • [40] G. Papamakarios, E. Nalisnick, D.J. Rezende, S. Mohamed and B. Lakshminarayanan, Normalizing flows for probabilistic modeling and inference, .
  • [41] C. Winkler, D.E. Worrall, E. Hoogeboom and M. Welling, Learning likelihoods with conditional normalizing flows, CoRR abs/1912.00042 (2019) [1912.00042].
  • [42] C. Krause and D. Shih, CaloFlow: Fast and Accurate Generation of Calorimeter Showers with Normalizing Flows, Phys.Rev.D 107 (2021) 113003 [2106.05285].
  • [43] C. Krause and D. Shih, CaloFlow II: Even Faster and Still Accurate Generation of Calorimeter Showers with Normalizing Flows, Phys.Rev.D 107 (2021) 113004 [2110.11377].
  • [44] C. Krause, I. Pang and D. Shih, CaloFlow for CaloChallenge Dataset 1, 2210.14245.
  • [45] M.R. Buckley, C. Krause, I. Pang and D. Shih, Inductive CaloFlow, 2305.11934.
  • [46] S. Diefenbacher, E. Eren, F. Gaede, G. Kasieczka, C. Krause, I. Shekhzadeh et al., L2LFlows: Generating High-Fidelity 3D Calorimeter Images, JINST 18 (2023) P10017 [2302.11594].
  • [47] A. Xu, S. Han, X. Ju and H. Wang, Generative Machine Learning for Detector Response Modeling with a Conditional Normalizing Flow, 2303.10148.
  • [48] I. Pang, J.A. Raine and D. Shih, SuperCalo: Calorimeter shower super-resolution, 2308.11700.
  • [49] F. Ernst, L. Favaro, C. Krause, T. Plehn and D. Shih, Normalizing Flows for High-Dimensional Detector Simulations, 2312.09290.
  • [50] B. Nachman and D. Shih, Anomaly Detection with Density Estimation, Phys. Rev. D 101 (2020) 075042 [2001.04990].
  • [51] C. Gao, J. Isaacson and C. Krause, i-flow: High-Dimensional Integration and Sampling with Normalizing Flows, 2001.05486.
  • [52] E. Bothmann, T. Janßen, M. Knobbe, T. Schmale and S. Schumann, Exploring phase space with Neural Importance Sampling, 2001.05478.
  • [53] C. Gao, S. Höche, J. Isaacson, C. Krause and H. Schulz, Event Generation with Normalizing Flows, Phys. Rev. D 101 (2020) 076002 [2001.10028].
  • [54] M. Bellagente, A. Butter, G. Kasieczka, T. Plehn, A. Rousselot, R. Winterhalder et al., Invertible Networks or Partons to Detector and Back Again, SciPost Phys. 9 (2020) 074 [2006.06685].
  • [55] B. Stienen and R. Verheyen, Phase space sampling and inference from weighted events with autoregressive flows, SciPost Phys. 10 (2021) 038 [2011.13445].
  • [56] S. Bieringer, A. Butter, T. Heimel, S. Höche, U. Köthe, T. Plehn et al., Measuring QCD Splittings with Invertible Networks, SciPost Phys. 10 (2020) 126 [2012.09873].
  • [57] A. Hallin, J. Isaacson, G. Kasieczka, C. Krause, B. Nachman, T. Quadfasel et al., Classifying Anomalies THrough Outer Density Estimation (CATHODE), Phys.Rev.D 106 (2021) 055006 [2109.00546].
  • [58] T. Bister, M. Erdmann, U. Köthe and J. Schulte, Inference of cosmic-ray source properties by conditional invertible neural networks, Eur.Phys.J.C 82 (2021) 171 [2110.09493].
  • [59] A. Butter, T. Heimel, S. Hummerich, T. Krebs, T. Plehn, A. Rousselot et al., Generative Networks for Precision Enthusiasts, SciPost Phys. 14 (2021) 078 [2110.13632].
  • [60] R. Winterhalder, V. Magerya, E. Villa, S.P. Jones, M. Kerner, A. Butter et al., Targeting Multi-Loop Integrals with Neural Networks, SciPost Phys. 12 (2022) 129 [2112.09145].
  • [61] A. Butter, S. Diefenbacher, G. Kasieczka, B. Nachman, T. Plehn, D. Shih et al., Ephemeral Learning – Augmenting Triggers with Online-Trained Normalizing Flows, SciPost Phys. 13 (2022) 087 [2202.09375].
  • [62] R. Verheyen, Event Generation and Density Estimation with Surjective Normalizing Flows, SciPost Phys. 13 (2022) 047 [2205.01697].
  • [63] A. Butter, T. Heimel, T. Martini, S. Peitzsch and T. Plehn, Two Invertible Networks for the Matrix Element Method, SciPost Phys. 15 (2022) 094 [2210.00019].
  • [64] A. Hallin, G. Kasieczka, T. Quadfasel, D. Shih and M. Sommerhalder, Resonant anomaly detection without background sculpting, Phys.Rev.D 107 (2022) 114012 [2210.14924].
  • [65] T. Heimel, R. Winterhalder, A. Butter, J. Isaacson, C. Krause, F. Maltoni et al., MadNIS – Neural Multi-Channel Importance Sampling, SciPost Phys. 15 (2022) 141 [2212.06172].
  • [66] M. Backes, A. Butter, M. Dunford and B. Malaescu, An unfolding method based on conditional Invertible Neural Networks (cINN) using iterative training, 2212.08674.
  • [67] D. Sengupta, S. Klein, J.A. Raine and T. Golling, CURTAINs Flows For Flows: Constructing Unobserved Regions with Maximum Likelihood Estimation, 2305.04646.
  • [68] J. Ackerschott, R.K. Barman, D. Gonçalves, T. Heimel and T. Plehn, Returning CP-Observables to The Frames They Belong, 2308.00027.
  • [69] T. Heimel, N. Huetsch, R. Winterhalder, T. Plehn and A. Butter, Precision-Machine Learning for the Matrix Element Method, 2310.07752.
  • [70] T. Heimel, N. Huetsch, F. Maltoni, O. Mattelaer, T. Plehn and R. Winterhalder, The MadNIS Reloaded, 2311.01548.
  • [71] C. Bierlich, P. Ilten, T. Menzo, S. Mrenna, M. Szewc, M.K. Wilkinson et al., Towards a data-driven model of hadronization using normalizing flows, 2311.09296.
  • [72] R. Das, G. Kasieczka and D. Shih, Residual ANODE, 2312.11629.
  • [73] H. Du, C. Krause, V. Mikuni, B. Nachman, I. Pang and D. Shih, Electromagnetic + Hadronic Sampling Calorimeter Shower Images, Apr., 2024. 10.5281/zenodo.11073232.
  • [74] B. Nachman, L. de Oliveira and M. Paganini, Electromagnetic calorimeter shower images, Mendeley Data (2017) .
  • [75] M. Paganini, L. de Oliveira and B. Nachman, Accelerating Science with Generative Adversarial Networks: An Application to 3D Particle Showers in Multilayer Calorimeters, Phys. Rev. Lett. 120 (2018) 042003 [1705.02355].
  • [76] M. Paganini, L. de Oliveira and B. Nachman, CaloGAN : Simulating 3D high energy particle showers in multilayer electromagnetic calorimeters with generative adversarial networks, Phys. Rev. D97 (2018) 014021 [1712.10321].
  • [77] C. Krause, B. Nachman, I. Pang, D. Shih and Y. Zhu, Anomaly detection with flow-based fast calorimeter simulators, 2312.11618.
  • [78] G. Papamakarios, D. Sterratt and I. Murray, Sequential neural likelihood: Fast likelihood-free inference with autoregressive flows, in The 22nd international conference on artificial intelligence and statistics, pp. 837–848, PMLR, 2019.
  • [79] G. Papamakarios, Neural density estimation and likelihood-free inference, arXiv preprint arXiv:1910.13233 (2019) .
  • [80] K. Cranmer, J. Brehmer and G. Louppe, The frontier of simulation-based inference, Proceedings of the National Academy of Sciences 117 (2020) 30055.
  • [81] S. Dirmeier, C. Albert and F. Perez-Cruz, Simulation-based inference using surjective sequential neural likelihood estimation, arXiv preprint arXiv:2308.01054 (2023) .
  • [82] GEANT4 collaboration, GEANT4–a simulation toolkit, Nucl. Instrum. Meth. A 506 (2003) 250.
  • [83] J. Allison, K. Amako, J. Apostolakis, H. Araujo, P. Arce Dubois, M. Asai et al., Geant4 developments and applications, IEEE Transactions on Nuclear Science 53 (2006) 270.
  • [84] J. Allison, K. Amako, J. Apostolakis, P. Arce, M. Asai, T. Aso et al., Recent developments in geant4, Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 835 (2016) 186.
  • [85] ATLAS collaboration, ATLAS liquid-argon calorimeter: Technical Design Report, Technical design report. ATLAS, CERN, Geneva (1996), 10.17181/CERN.FWRW.FOOQ.
  • [86] A. Butter, S. Diefenbacher, G. Kasieczka, B. Nachman and T. Plehn, Amplifying statistics with ensembles of generative models, in ICLR 2021 SimDL Workshop https://simdl. github. io/files/18. pdf, 2021.
  • [87] D.P. Kingma and J. Ba, Adam: A method for stochastic optimization, 2014.
  • [88] P. Virtanen, R. Gommers, T.E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau et al., SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python, Nature Methods 17 (2020) 261.
  • [89] B. Efron, Bootstrap Methods: Another Look at the Jackknife, The Annals of Statistics 7 (1979) 1 .
  • [90] D. Scott, Multivariate Density Estimation: Theory, Practice, and Visualization, John Wiley & Sons, New York, Chicester (1992).
  • [91] R. Gambhir and B. Nachman, Seeing Double: Calibrating Two Jets at Once, 2402.14067.
  • [92] C.W. Fabjan and F. Gianotti, Calorimetry for particle physics, Rev. Mod. Phys. 75 (2003) 1243.
  • [93] G. Papamakarios, T. Pavlakou and I. Murray, Masked autoregressive flow for density estimation, Advances in neural information processing systems 30 (2017) .
  • [94] C. Durkan, A. Bekasov, I. Murray and G. Papamakarios, Neural spline flows, Advances in Neural Information Processing Systems 32 (2019) 7511 [1906.04032].
  • [95] M. Germain, K. Gregor, I. Murray and H. Larochelle, Made: Masked autoencoder for distribution estimation, in International conference on machine learning, pp. 881–889, PMLR, 2015.