PITT-PACC-2311-v5 Nanosecond anomaly detection with decision trees
and real-time application to exotic Higgs decays

S. T. Roche School of Medicine, Saint Louis University, Saint Louis, MO, USA Q. Bayer Department of Physics and Astronomy, University of Pittsburgh, Pittsburgh, PA, USA B. T. Carlson Department of Physics and Astronomy, University of Pittsburgh, Pittsburgh, PA, USA W. C. Ouligian Department of Physics and Astronomy, University of Pittsburgh, Pittsburgh, PA, USA
P. Serhiayenka
Department of Physics and Astronomy, University of Pittsburgh, Pittsburgh, PA, USA
J. Stelzer T. M. Hong Corresponding author, [email protected]
(May 1, 2024)
Abstract

We present an interpretable implementation of the autoencoding algorithm, used as an anomaly detector, built with a forest of deep decision trees on FPGA, field programmable gate arrays. Scenarios at the Large Hadron Collider at CERN are considered, for which the autoencoder is trained using known physical processes of the Standard Model. The design is then deployed in real-time trigger systems for anomaly detection of unknown physical processes, such as the detection of rare exotic decays of the Higgs boson. The inference is made with a latency value of 30 ns at percent-level resource usage using the Xilinx Virtex UltraScale+ VU9P FPGA. Our method offers anomaly detection at low latency values for edge AI users with resource constraints.

Keywords: Data processing methods, Data reduction methods, Digital electronic circuits, Trigger algorithms, and Trigger concepts and systems (hardware and software).

Introduction

Unsupervised artificial intelligence (AI) algorithms enable signal-agnostic searches for beyond the Standard Model (BSM) physics at the Large Hadron Collider (LHC) at CERN [1]. The LHC is the highest energy proton and heavy ion collider that is designed to discover the Higgs boson [2, 3] and study its properties [4, 5] as well as to probe the unknown and undiscovered BSM physics (see, e.g., [6, 7, 8]). Due to the lack of signs of BSM in the collected data despite the plethora of searches conducted at the LHC, dedicated studies look for rare BSM events that are even more difficult to parse among the mountain of ordinary Standard Model processes [9, 10, 11, 12, 13]. An active area of AI research in high energy physics is in using autoencoders for anomaly detection, much of which provides methods to find rare and unanticipated BSM physics. Much of the existing literature, mostly using neural network-based approaches, focuses on identifying BSM physics in already collected data [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 42, 36, 37, 38, 39, 40, 41, 43, 44, 45, 46, 47, 70, 69, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 60, 59, 61, 62, 63, 64, 65, 66, 67, 68]. Such ideas have started to produce experimental results on the analysis of data collected at the LHC [71, 72, 73, 74]. A related, but separate endeavor, which is the subject of this paper, is enabling the identification of rare and anomalous data on the real-time trigger path for more detailed investigation offline.

The LHC offers an environment with an abundance of data at a 40 MHz collision rate, corresponding to the 25 ns time period between successive collisions. The real-time trigger path of the ATLAS and CMS experiments [75, 76], e.g., processes data using custom electronics using field programmable gate arrays (FPGA) followed by software trigger algorithms executed on a computing farm. The first-level FPGA portion of the trigger system accepts between 100 kHz to 1 MHz of collisions, discarding the remaining {\approx} 99% of the collisions. Therefore, it is essential for discovery that the FPGA-based trigger system is capable of triggering on potential BSM events. A previous study aimed for LHC data has shown that an anomaly detector based on neural networks can be implemented on FPGA with latency values between 80 to 1480 ns, depending on the design [77].

In this paper, we present an interpretable implementation of an autoencoder using deep decision trees that makes inferences in 30 ns. As discussed previously [78, 79], decision tree designs depend only on threshold comparisons resulting in fast and efficient FPGA implementation with minimal reliance on digital signal processors. We train the autoencoder on known Standard Model (SM) processes to help trigger on the rare events that may include BSM.

In scenarios for which a specific BSM model is targeted and its dynamics are known, a dedicated supervised training against the SM sample, i.e., BSM-vs-SM classification, would likely outperform an unsupervised approach of SM-only training. The physics scenarios considered in this paper are examples to demonstrate that our autoencoder is able to trigger on BSM scenarios as anomalies without this prior knowledge of the BSM specifics. Nevertheless, we consider a benchmark where our autoencoder outperforms the existing conventional cut-based algorithms.

Our focus is to search for Higgs bosons decaying to a pair of BSM pseudoscalars with a lack of sensitivity due to a bottleneck in the triggering step. We examine the scenario in which one pseudoscalar with ma=subscript𝑚𝑎absentm_{a}=italic_m start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 10 GeV subsequently decays to a pair of photons and the second pseudoscalar with a larger mass decays to a pair of hadronic jets, i.e., Haaγγjj𝐻𝑎superscript𝑎𝛾𝛾𝑗𝑗H\rightarrow aa^{\prime}\rightarrow\gamma\gamma jjitalic_H → italic_a italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_γ italic_γ italic_j italic_j [80], one of the channels of the so-called exotic Higgs decays [81]. The recent result for this final state [82] does not probe the phase space corresponding to ma<subscript𝑚𝑎absentm_{a}<italic_m start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT < 20 GeV due to a bottleneck from the trigger. The study presented here considers various general experimental aspects of the ATLAS and CMS experiments to show that our tool may benefit ATLAS, CMS, and other physics programs generally. We demonstrate that the use of our autoencoder can increase signal acceptance in this region with a minimal addition to the overall trigger bandwidth.

Beyond our benchmark study, we consider an existing dataset with a range of different BSM models, referred to here as the LHC physics dataset [83], to compare our tool with the results of the previously mentioned neural network-based autoencoder designed for FPGA [77]. Lastly, the robustness of our general method is considered by training with samples having varying levels of signal contamination.

This paper uses Higgs bosons to explore the unknown using real-time computing. But more generally, such inferences made on edge AI may be of interest in other experimental setups and situations with resource constraints and latency requirements. It may also be of interest in situations in which interpretability is desirable [84].

Results

We describe the design of a decision tree-based autoencoder and the training methodology. We then present our benchmark results of a scenario in which an anomaly detector could trigger on BSM exotic Higgs decays in the real-time trigger path. As a test case, we also consider the LHC physics dataset [83] with which our results are compared using a neural network implementation [77]. Lastly, a study showing our autoencoder’s effectiveness to signal contamination of training data is presented.

Autoencoder as anomaly detector

Our autoencoder (AE) is related to, and extends beyond, those based on random forests [85, 86]. We note that there are related concepts in the literature with various level of algorithmic sophistication [87, 88, 89, 90], but these approaches may be more challenging to implement on the FPGA. We build on the deep decision tree architecture that uses parallel decision paths of fwXmachina [78, 79]. A general discussion of the tree-based autoencoder is given below. The subsections that follow will detail the ML training, the firmware design, including verification and validation, and the simulation samples.

A tree of maximum depth D𝐷Ditalic_D takes an input vector 𝐱𝐱\mathbf{x}bold_x, encodes it to the latent space as 𝐰𝐰\mathbf{w}bold_w, then decodes 𝐰𝐰\mathbf{w}bold_w to an output vector 𝐱^^𝐱\hat{\mathbf{x}}over^ start_ARG bold_x end_ARG. Typically both 𝐱𝐱\mathbf{x}bold_x and 𝐱^^𝐱\hat{\mathbf{x}}over^ start_ARG bold_x end_ARG are elements of Vsuperscript𝑉\mathbb{R}^{V}blackboard_R start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT while 𝐰𝐰\mathbf{w}bold_w is an element of Tsuperscript𝑇\mathbb{R}^{T}blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where V𝑉Vitalic_V is the number of input variables and T𝑇Titalic_T is the number of trees, i.e.,

𝐱encoder𝐰decoderautoencoder𝐱^.𝐱superscriptencoder𝐰decoderautoencoder^𝐱\mathbf{x}\quad\overbrace{\xrightarrow{\textrm{encoder}}\quad\mathbf{w}\quad% \xrightarrow{\textrm{decoder}}}^{\textrm{autoencoder}}\quad\hat{\mathbf{x}}.bold_x over⏞ start_ARG start_ARROW overencoder → end_ARROW bold_w start_ARROW overdecoder → end_ARROW end_ARG start_POSTSUPERSCRIPT autoencoder end_POSTSUPERSCRIPT over^ start_ARG bold_x end_ARG . (1)

Typically the latent space is smaller than the input-output space, i.e., T<V𝑇𝑉T<Vitalic_T < italic_V, but it is not a requirement. A decision tree divides up the input space Vsuperscript𝑉\mathbb{R}^{V}blackboard_R start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT into a set of partitions {Pb}subscript𝑃𝑏\{P_{b}\}{ italic_P start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT } labeled by bin number b𝑏bitalic_b. The b𝑏bitalic_b is a B𝐵Bitalic_B-bit integer, where B2D𝐵superscript2𝐷B\leq 2^{D}italic_B ≤ 2 start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, since the tree is a sequence of binary splits.

The encoding occurs when the decision tree processes an input vector 𝐱𝐱\mathbf{x}bold_x to place it into a one of the partitions labeled by w𝑤witalic_w. If more than one tree is used, then w𝑤witalic_w generalizes to a vector 𝐰𝐰\mathbf{w}bold_w. The decoding occurs when 𝐰𝐰\mathbf{w}bold_w produces 𝐱^^𝐱\hat{\mathbf{x}}over^ start_ARG bold_x end_ARG using the same forest. The bin number b𝑏bitalic_b corresponds to a partition in Vsuperscript𝑉\mathbb{R}^{V}blackboard_R start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT, which is a hyperrectangle Pbsubscript𝑃𝑏P_{b}italic_P start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT defined by a set of extrema in V𝑉Vitalic_V dimensions.

A metric d𝑑ditalic_d provides an anomaly score calculated as a distance between the input and output, Δ=d(𝐱,𝐱^)Δ𝑑𝐱^𝐱\Delta{\,=\,}d(\mathbf{x},\hat{\mathbf{x}})roman_Δ = italic_d ( bold_x , over^ start_ARG bold_x end_ARG ), which is our analogue of the loss function used in neural network-based approaches. Our choice for the estimator of Pbsubscript𝑃𝑏P_{b}italic_P start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT is the dimension-wise central tendency of the training data sample in the considered bin, 𝐱^=median({𝐱})𝐱Pb^𝐱median𝐱for-all𝐱subscript𝑃𝑏\hat{\mathbf{x}}{\,=\,}\textrm{median}(\{\mathbf{x}\})\ \forall\ \mathbf{x}{\,% \in\,}P_{b}over^ start_ARG bold_x end_ARG = median ( { bold_x } ) ∀ bold_x ∈ italic_P start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. The median minimizes the L1superscript𝐿1L^{1}italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT norm, or Manhattan distance, with respect to input data resembling the training sample.

The encoding and decoding are conceptually two steps, with the latent space separating the two. But, as explained in the next section, our design executes both steps simultaneously and bypasses the latent space altogether by a process we call \starcoder (star-coder), i.e., x^=x\hat{\textbf{x}}=\star\textbf{x}over^ start_ARG x end_ARG = ⋆ x,

𝐱coder𝐱^.coder𝐱^𝐱\mathbf{x}~{}~{}\xrightarrow{~{}\textrm{$\star$coder}~{}}~{}~{}\hat{\mathbf{x}}.bold_x start_ARROW start_OVERACCENT ⋆ coder end_OVERACCENT → end_ARROW over^ start_ARG bold_x end_ARG . (2)

Finally, the anomaly score is the sum of the L1superscript𝐿1L^{1}italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT distances for each tree in the forest, i.e.,

Δ(𝐱)=d(𝐱,𝐱)=treestvarsv|xvxv,t|.\Delta(\mathbf{x})=d(\mathbf{x},\star\mathbf{x})=\sum_{\begin{subarray}{c}% \textrm{trees}\\ t\end{subarray}}\sum_{\begin{subarray}{c}\textrm{vars}\phantom{t}\!\!\\ v\phantom{t}\!\!\end{subarray}}\left|x_{v}-\star{x}_{v,t}\right|.roman_Δ ( bold_x ) = italic_d ( bold_x , ⋆ bold_x ) = ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL trees end_CELL end_ROW start_ROW start_CELL italic_t end_CELL end_ROW end_ARG end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL vars end_CELL end_ROW start_ROW start_CELL italic_v end_CELL end_ROW end_ARG end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - ⋆ italic_x start_POSTSUBSCRIPT italic_v , italic_t end_POSTSUBSCRIPT | . (3)

When the parameters of the autoencoder are trained on known SM events, the autoencoder ideally produces a relatively small ΔΔ\Deltaroman_Δ when it encounters an SM event and a relatively large ΔΔ\Deltaroman_Δ when it encounters a BSM event. The metric sums the individual distances for variables of different types, such as angles and momenta, so the ranges of each variable must be carefully considered. At the LHC they are naturally defined by the physical constraints, e.g., 0 to 2π𝜋\piitalic_π for angles and 0 to pTmaxsuperscriptsubscript𝑝Tmaxp_{\textrm{T}}^{\textrm{max}}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT max end_POSTSUPERSCRIPT, the kinematic endpoint, for momenta. The values are transformed to binary bits to design the firmware; see Appendix C.3 of Ref. [78] for a detailed discussion.

An illustrative example of the decision tree structure is given in Supplementary Figure 1 and a demonstration of the autoencoder using the MNIST dataset [91] is given in Supplementary Figure 2.

ML training

The machine learning (ML) training of the autoencoder described here is novel and is suitable for the physics problems at hand. Qualitatively, the training puts small-sized bins around regions with high event density and large-sized bins around regions of sparse event density. An illustration of the bin sizes is given with a 2d toy example in Supplementary Figure 3, which shows the decreasing sizes of bins as the tree depth increases.

The following steps are executed. To start, x={xv}={x0,x1,,xV1}xsubscript𝑥𝑣subscript𝑥0subscript𝑥1subscript𝑥𝑉1\textbf{x}=\{x_{v}\}=\{x_{0},x_{1},\ldots,x_{V-1}\}x = { italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } = { italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_V - 1 end_POSTSUBSCRIPT } is a vector of length V𝑉Vitalic_V, the number of input variables, that describes the training sample S𝑆Sitalic_S. (1) Initialize s𝑠sitalic_s with S𝑆Sitalic_S in steps 2–4 and depth d=1𝑑1d=1italic_d = 1. (2) For the sample s𝑠sitalic_s, the PDF pvsubscript𝑝𝑣p_{v}italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is the marginal distribution of bit-integer-valued input variable xvsubscript𝑥𝑣x_{v}italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT for a given v𝑣vitalic_v. The PDF pmsubscript𝑝𝑚p_{m}italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the distribution of the maximum values of the set {pv}subscript𝑝𝑣\{p_{v}\}{ italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT }. Sampling the maximum-weighted PDF mpm𝑚subscript𝑝𝑚m\cdot p_{m}italic_m ⋅ italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT gives m~=mv~~𝑚subscript𝑚~𝑣\tilde{m}=m_{\tilde{v}}over~ start_ARG italic_m end_ARG = italic_m start_POSTSUBSCRIPT over~ start_ARG italic_v end_ARG end_POSTSUBSCRIPT that corresponds to the xv~subscript𝑥~𝑣x_{\tilde{v}}italic_x start_POSTSUBSCRIPT over~ start_ARG italic_v end_ARG end_POSTSUBSCRIPT. (3) The PDF pv~subscript𝑝~𝑣p_{\tilde{v}}italic_p start_POSTSUBSCRIPT over~ start_ARG italic_v end_ARG end_POSTSUBSCRIPT is for the xv~subscript𝑥~𝑣x_{\tilde{v}}italic_x start_POSTSUBSCRIPT over~ start_ARG italic_v end_ARG end_POSTSUBSCRIPT under consideration. Sampling pv~subscript𝑝~𝑣p_{\tilde{v}}italic_p start_POSTSUBSCRIPT over~ start_ARG italic_v end_ARG end_POSTSUBSCRIPT yields a threshold value c~~𝑐\tilde{c}over~ start_ARG italic_c end_ARG. (4) The sample s𝑠sitalic_s is split by a cut g=(xv~<c~)𝑔subscript𝑥~𝑣~𝑐g=(x_{\tilde{v}}<\tilde{c})italic_g = ( italic_x start_POSTSUBSCRIPT over~ start_ARG italic_v end_ARG end_POSTSUBSCRIPT < over~ start_ARG italic_c end_ARG ). (5) The steps 2–4 are continued recursively for the two subsamples until one of two stop** conditions are met: (condition-i) the number of splits exceeds the maximum allowed depth D𝐷Ditalic_D, (condition-ii) the split in step 3 produces a sample that is below the smallest allowed fraction f𝑓fitalic_f of S𝑆Sitalic_S. (6) When stopped, the procedure breaks out of the recursion by appending the requirement g𝑔gitalic_g to the set G𝐺Gitalic_G. (7) In the end, the algorithm produces a partition G𝐺Gitalic_G of the training sample called the decision tree grid (DTG) that corresponds to a deep decision tree (DDT) illustrated in Figure 1. The pseudocode given below finds G=DTG(S,,1)𝐺DTG𝑆1G=\textrm{DTG}(S,\emptyset,1)italic_G = DTG ( italic_S , ∅ , 1 ).

Refer to caption
Figure 1: Illustration of the ML training. Data is represented as x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT vs. x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (leftmost). Recursive importance sampling considers the marginalized distributions (second). A decision tree grid is constructed (third). Deep decision trees with maximum depth of 4 corresponds to parallel decision paths (rightmost).
\TabPositions

0.35

0:  DTG(training sample s,partition G,depthd)DTGtraining sample 𝑠partition 𝐺depth𝑑\textrm{DTG}(\textrm{training~{}sample~{}}s,\textrm{partition~{}}G,\textrm{% depth}~{}d)DTG ( training sample italic_s , partition italic_G , depth italic_d )
1:  if (|s|/|S|<ford>D𝑠𝑆expectation𝑓or𝑑𝐷|s|/|S|<f~{}\textrm{or}~{}d>D| italic_s | / | italic_S | < italic_f or italic_d > italic_D) then
2:   return G𝐺Gitalic_G
3:  end if \tab\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.7}{$\displaystyle\bullet$}}}}}{% \mathbin{\vbox{\hbox{\scalebox{0.7}{$\textstyle\bullet$}}}}}{\mathbin{\vbox{% \hbox{\scalebox{0.7}{$\scriptstyle\bullet$}}}}}{\mathbin{\vbox{\hbox{\scalebox% {0.7}{$\scriptscriptstyle\bullet$}}}}} Identify the variable xv~subscript𝑥~𝑣x_{\tilde{v}}italic_x start_POSTSUBSCRIPT over~ start_ARG italic_v end_ARG end_POSTSUBSCRIPT to cut on
4:  pvPDF(xv)xv𝐱subscript𝑝𝑣PDFsubscript𝑥𝑣for-allsubscript𝑥𝑣𝐱p_{v}\leftarrow\textrm{PDF}(x_{v})\ \forall~{}x_{v}{\,\in\,}\mathbf{x}italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ← PDF ( italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) ∀ italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ bold_x \tab  Build set of pdfs for input variables
5:  pmPDF({max(pv)}vV)subscript𝑝𝑚PDFsubscript𝑝𝑣for-all𝑣𝑉p_{m}\leftarrow\textrm{PDF}(\{\max(p_{v})\}\ \forall~{}v{\,\in\,}V)italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ← PDF ( { roman_max ( italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) } ∀ italic_v ∈ italic_V ) \tab  Build pdf of max of input pdfs
6:  m~sample(mpm)~𝑚sample𝑚subscript𝑝𝑚\tilde{m}\leftarrow\textrm{sample}(m\cdot p_{m})over~ start_ARG italic_m end_ARG ← sample ( italic_m ⋅ italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) \tab  Sample max-weighted pdf
7:  v~vwheremv=m~~𝑣𝑣wheresubscript𝑚𝑣~𝑚\tilde{v}\leftarrow v\ \textrm{where}\ m_{v}=\tilde{m}over~ start_ARG italic_v end_ARG ← italic_v where italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = over~ start_ARG italic_m end_ARG \tab  Find variable index \tab\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.7}{$\displaystyle\bullet$}}}}}{% \mathbin{\vbox{\hbox{\scalebox{0.7}{$\textstyle\bullet$}}}}}{\mathbin{\vbox{% \hbox{\scalebox{0.7}{$\scriptstyle\bullet$}}}}}{\mathbin{\vbox{\hbox{\scalebox% {0.7}{$\scriptscriptstyle\bullet$}}}}} Find threshold t~~𝑡\tilde{t}over~ start_ARG italic_t end_ARG to cut on xv~subscript𝑥~𝑣x_{\tilde{v}}italic_x start_POSTSUBSCRIPT over~ start_ARG italic_v end_ARG end_POSTSUBSCRIPT
8:  c~sample(pv~)~𝑐samplesubscript𝑝~𝑣\tilde{c}\leftarrow\textrm{sample}(p_{\tilde{v}})over~ start_ARG italic_c end_ARG ← sample ( italic_p start_POSTSUBSCRIPT over~ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ) \tab  Sample variable pdf
9:  gxv~<c~𝑔subscript𝑥~𝑣~𝑐g\leftarrow x_{\tilde{v}}<\tilde{c}italic_g ← italic_x start_POSTSUBSCRIPT over~ start_ARG italic_v end_ARG end_POSTSUBSCRIPT < over~ start_ARG italic_c end_ARG \tab  Make selection \tab\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.7}{$\displaystyle\bullet$}}}}}{% \mathbin{\vbox{\hbox{\scalebox{0.7}{$\textstyle\bullet$}}}}}{\mathbin{\vbox{% \hbox{\scalebox{0.7}{$\scriptstyle\bullet$}}}}}{\mathbin{\vbox{\hbox{\scalebox% {0.7}{$\scriptscriptstyle\bullet$}}}}} Build partition
10:  Gappend g𝐺append 𝑔G\leftarrow\textrm{append~{}}gitalic_G ← append italic_g \tab  Add to G𝐺Gitalic_G the new selection g𝑔gitalic_g \tab\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.7}{$\displaystyle\bullet$}}}}}{% \mathbin{\vbox{\hbox{\scalebox{0.7}{$\textstyle\bullet$}}}}}{\mathbin{\vbox{% \hbox{\scalebox{0.7}{$\scriptstyle\bullet$}}}}}{\mathbin{\vbox{\hbox{\scalebox% {0.7}{$\scriptscriptstyle\bullet$}}}}} Recursively build the decision tree
11:  call DTG(s if g,g,d+1)DTG𝑠 if 𝑔𝑔𝑑1\textrm{DTG}(s\textrm{~{}if~{}}g,g,d+1)DTG ( italic_s if italic_g , italic_g , italic_d + 1 ) \tab  Call DTG on subset passing g𝑔gitalic_g
12:  call DTG(s if notg,notg,d+1)DTG𝑠 if not𝑔not𝑔𝑑1\textrm{DTG}(s\textrm{~{}if~{}not}\,g,\textrm{not}\,g,d+1)DTG ( italic_s if not italic_g , not italic_g , italic_d + 1 ) \tab  Call DTG on subset failing g𝑔gitalic_g
13:  return G𝐺Gitalic_G

Weighted randomness in both variable selection xv~subscript𝑥~𝑣x_{\tilde{v}}italic_x start_POSTSUBSCRIPT over~ start_ARG italic_v end_ARG end_POSTSUBSCRIPT and threshold selection c~~𝑐\tilde{c}over~ start_ARG italic_c end_ARG allow for the construction of a forest of non-identical decision trees to provide better accuracy in the aggregate. As our ML training is agnostic to the signal process, the so-called boost weights are not relevant because misclassification does not occur in one-sample training.

An information bottleneck may exist, where the input data is compressed in the latent layer of a given autoencoder design, then subsequently decompressed for the output. For our design, the latent layer is the output of the set of decision trees T𝑇Titalic_T in the forest. Accordingly, the latent data is the set of bin numbers from each decision tree, i.e., {b0,b1,,bT1}subscript𝑏0subscript𝑏1subscript𝑏𝑇1\{b_{0},b_{1},\ldots,b_{T-1}\}{ italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT }. Compression occurs if T𝑇Titalic_T is smaller than the number of input variables V𝑉Vitalic_V, i.e., T/V<𝑇𝑉absentT/V<italic_T / italic_V < 1. We will see later that the benchmark physics process is not compressed with T/V𝑇𝑉T/Vitalic_T / italic_V of about four, while the LHC Physics problem is compressed with T/V𝑇𝑉T/Vitalic_T / italic_V of about half. This demonstrates that the autoencoder does not necessarily rely on the information bottleneck, but rather on the density estimation of the feature space.

Simulated training and testing samples

The training and testing samples are generated using the Monte Carlo method that is standard practice in high energy physics. In our study, we use offline quantities for physics objects to approximate the input values provided at the trigger level, as offline-like reconstruction will be available after the High Luminosity LHC (HL-LHC) upgrade of the level-1 trigger systems of the experiments [92, 93]. A brief summary of the samples is given below (see Methods for technical details).

The training sample consist of half a million simulated proton-proton collision events at 13 TeV. It is comprised of a cocktail of SM processes that produce a γγjj𝛾𝛾𝑗𝑗\gamma\gamma jjitalic_γ italic_γ italic_j italic_j final state, where j𝑗jitalic_j represents light flavor hadronic jets, weighted according to the the SM cross sections.

The testing is done on half a million of the above process as the background sample as well as on a signal sample for the benchmark of the Higgs decay process H125a10a70γγjjsubscript𝐻125subscript𝑎10subscript𝑎70𝛾𝛾𝑗𝑗H_{125}\rightarrow a_{10}\,a_{70}\rightarrow\gamma\gamma jjitalic_H start_POSTSUBSCRIPT 125 end_POSTSUBSCRIPT → italic_a start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 70 end_POSTSUBSCRIPT → italic_γ italic_γ italic_j italic_j with asymmetric pseudoscalar masses of 10 and 70 GeV, respectively. To show that our training is more generally applicable to other signal models beyond the benchmark, we consider an alternate cross-check scenario with a Higgs-like scalar of a smaller mass at 70 GeV, H70a5a50γγjjsubscript𝐻70subscript𝑎5subscript𝑎50𝛾𝛾𝑗𝑗H_{70}\rightarrow a_{5}\,a_{50}\rightarrow\gamma\gamma jjitalic_H start_POSTSUBSCRIPT 70 end_POSTSUBSCRIPT → italic_a start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT → italic_γ italic_γ italic_j italic_j, decaying to pseudoscalars with masses of 5 and 50 GeV, respectively.

The benchmark and the alternate cross-check sample consists of 100 k events each. The H125subscript𝐻125H_{125}italic_H start_POSTSUBSCRIPT 125 end_POSTSUBSCRIPT and H70subscript𝐻70H_{70}italic_H start_POSTSUBSCRIPT 70 end_POSTSUBSCRIPT bosons are produced by gluon-gluon fusion. MadGraph5__\__aMC 2.9.5 is used for event generation at leading order [94]. Decay and showers are done with Pythia8 [95]. Detector simulation and event reconstruction are done with Delphes 3.5.0 [96, 97] using the CMS card [98].

The input variables to the autoencoder depends only on the two photons and the two jets. The photons are denoted as γ1subscript𝛾1\gamma_{1}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and γ2subscript𝛾2\gamma_{2}italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, which are the two photons with the highest momenta transverse to the beam direction (pTsubscript𝑝Tp_{\textrm{T}}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT) in the event. Similarly, the two leading jets are denoted as j1subscript𝑗1j_{1}italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and j2subscript𝑗2j_{2}italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Photons are reconstructed in Delphes with a minimum pTsubscript𝑝Tp_{\textrm{T}}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT of 0.5 GeV. Jets are reconstructed with the anti-ktt{}_{\textrm{t}}start_FLOATSUBSCRIPT t end_FLOATSUBSCRIPT algorithm with a minimum pTsubscript𝑝Tp_{\textrm{T}}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT of 20 GeV. The input variables to the autoencoder include the pTsubscript𝑝Tp_{\textrm{T}}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT of these four objects, along with invariant masses of the diphoton (mγγsubscript𝑚𝛾𝛾m_{\gamma\gamma}italic_m start_POSTSUBSCRIPT italic_γ italic_γ end_POSTSUBSCRIPT) and dijet (mjjsubscript𝑚𝑗𝑗m_{jj}italic_m start_POSTSUBSCRIPT italic_j italic_j end_POSTSUBSCRIPT) subsystems, and the Cartesian η𝜂\etaitalic_η-ϕitalic-ϕ\phiitalic_ϕ distance (ΔRΔ𝑅\Delta Rroman_Δ italic_R), where η𝜂\etaitalic_η is the pseudorapidity variable defined using polar angle θ𝜃\thetaitalic_θ and ϕitalic-ϕ\phiitalic_ϕ is the azimuthal angle.

The input variable distributions for the full list of eight variables—pTγ1superscriptsubscript𝑝T𝛾1p_{\textrm{T}}^{\gamma 1}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ 1 end_POSTSUPERSCRIPT, pTγ2superscriptsubscript𝑝T𝛾2p_{\textrm{T}}^{\gamma 2}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ 2 end_POSTSUPERSCRIPT, pTj1superscriptsubscript𝑝T𝑗1p_{\textrm{T}}^{j1}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j 1 end_POSTSUPERSCRIPT, pTj2superscriptsubscript𝑝T𝑗2p_{\textrm{T}}^{j2}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j 2 end_POSTSUPERSCRIPT, ΔRγγΔsubscript𝑅𝛾𝛾\Delta R_{\gamma\gamma}roman_Δ italic_R start_POSTSUBSCRIPT italic_γ italic_γ end_POSTSUBSCRIPT, ΔRjjΔsubscript𝑅𝑗𝑗\Delta R_{jj}roman_Δ italic_R start_POSTSUBSCRIPT italic_j italic_j end_POSTSUBSCRIPT, mγγsubscript𝑚𝛾𝛾m_{\gamma\gamma}italic_m start_POSTSUBSCRIPT italic_γ italic_γ end_POSTSUBSCRIPT, mjjsubscript𝑚𝑗𝑗m_{jj}italic_m start_POSTSUBSCRIPT italic_j italic_j end_POSTSUBSCRIPT—are shown in five plots with white background in Figure 2. The left-most plots show the pTsubscript𝑝Tp_{\textrm{T}}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT distribution for the jets and photons, along with the cuts imposed in Delphes for object reconstruction. The middle column plots show the mjjsubscript𝑚𝑗𝑗m_{jj}italic_m start_POSTSUBSCRIPT italic_j italic_j end_POSTSUBSCRIPT and two ΔRΔ𝑅\Delta Rroman_Δ italic_R distributions; the ΔRjjΔsubscript𝑅𝑗𝑗\Delta R_{jj}roman_Δ italic_R start_POSTSUBSCRIPT italic_j italic_j end_POSTSUBSCRIPT distribution shows a peak at π𝜋\piitalic_π for SM processes, which reveals the back-to-back signature in the azimuthal ϕitalic-ϕ\phiitalic_ϕ coordinate of the dijet system with respect to the beam direction. The top-right plot shows the mγγsubscript𝑚𝛾𝛾m_{\gamma\gamma}italic_m start_POSTSUBSCRIPT italic_γ italic_γ end_POSTSUBSCRIPT distribution with the pre-selection requirement discussed in the next section; the peak at 10 GeV for H125subscript𝐻125H_{125}italic_H start_POSTSUBSCRIPT 125 end_POSTSUBSCRIPT corresponds to the a10subscript𝑎10a_{10}italic_a start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT in the intermediate state. The bottom-right plot with the shaded background shows the mγγsubscript𝑚𝛾𝛾m_{\gamma\gamma}italic_m start_POSTSUBSCRIPT italic_γ italic_γ end_POSTSUBSCRIPT distribution after a cut on the anomaly score from the autoencoder, which is described in the next section.

Refer to caption
Figure 2: Input variable distributions for H125a10a70γγjjsubscript𝐻125subscript𝑎10subscript𝑎70𝛾𝛾𝑗𝑗H_{125}\rightarrow a_{10}a_{70}\rightarrow\gamma\gamma jjitalic_H start_POSTSUBSCRIPT 125 end_POSTSUBSCRIPT → italic_a start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 70 end_POSTSUBSCRIPT → italic_γ italic_γ italic_j italic_j and SM γγjj𝛾𝛾𝑗𝑗\gamma\gamma jjitalic_γ italic_γ italic_j italic_j showing (top-left) pTsubscript𝑝Tp_{\textrm{T}}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT for the leading and subleading jet, (top-middle) mjjsubscript𝑚𝑗𝑗m_{jj}italic_m start_POSTSUBSCRIPT italic_j italic_j end_POSTSUBSCRIPT for the dijet subsystem, (top-right) mγγsubscript𝑚𝛾𝛾m_{\gamma\gamma}italic_m start_POSTSUBSCRIPT italic_γ italic_γ end_POSTSUBSCRIPT for the diphoton subsystem, (bottom-left) pTsubscript𝑝Tp_{\textrm{T}}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT for the leading and subleading photon, and (bottom-middle) ΔRΔ𝑅\Delta{R}roman_Δ italic_R distance for the dijet and diphoton subsystem. The shaded panel (bottom-right) is the mγγsubscript𝑚𝛾𝛾m_{\gamma\gamma}italic_m start_POSTSUBSCRIPT italic_γ italic_γ end_POSTSUBSCRIPT distribution after a cut on the anomaly score of the autoencoder; this plot is normalized relative to the top-right plot before the cut.

Benchmark: Exotic Higgs decays

In order to define and quantify the gain using the autoencoder trigger in the FPGA-based systems over conventional approaches, we consider the threshold-based algorithm typically deployed at the LHC, such as at the ATLAS and CMS experiments. The most recent analysis of the γγjj𝛾𝛾𝑗𝑗\gamma\gamma jjitalic_γ italic_γ italic_j italic_j final state [82] used the diphoton (γγ𝛾𝛾\gamma\gammaitalic_γ italic_γ) trigger so we take this to be representative of the conventional approach. Moreover, as trigger performance is generally comparable between the ATLAS and CMS experiments, we take the ATLAS results from the Run-2 data taking period (2015–2018) as typical of the situation at the LHC. ATLAS reports a peak event rate of 3 kHz for a diphoton trigger in the FPGA-based first level trigger system in 2018 out of a peak total rate of about 90 kHz [99]. The threshold is pT>subscript𝑝Tabsentp_{\textrm{T}}>italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT > 20 GeV for each photon at the first level trigger, but the refined threshold is 35 and 25 GeV for the leading and subleading photon, respectively, in the subsequent CPU-based high level trigger [100]. The high level values are more representative of the thresholds for which the first level trigger becomes fully efficient, so we approximate the situation by requiring 25 GeV for each of the two reconstructed photons. We consider this to be the ATLAS-inspired cut-based diphoton trigger.

The events of interest containing γγjj𝛾𝛾𝑗𝑗\gamma\gamma jjitalic_γ italic_γ italic_j italic_j constitutes a subset of all events that pass the diphoton requirement, as γγ𝛾𝛾\gamma\gammaitalic_γ italic_γ events accompanied with zero or one jet (γγ𝛾𝛾\gamma\gammaitalic_γ italic_γ or γγj𝛾𝛾𝑗\gamma\gamma jitalic_γ italic_γ italic_j, respectively) would also pass. However, determining the precise composition of the events passing the diphoton trigger is a nontrivial task. So for our comparisons below we consider the worst case scenario to assume that the γγjj𝛾𝛾𝑗𝑗\gamma\gamma jjitalic_γ italic_γ italic_j italic_j event rate equals the entire event rate of the diphoton trigger. It is considered the worst case scenario because the more likely case that the γγjj𝛾𝛾𝑗𝑗\gamma\gamma jjitalic_γ italic_γ italic_j italic_j rate is less than the γγ𝛾𝛾\gamma\gammaitalic_γ italic_γ rate would give a more favorable result for the autoencoder in comparison.

The overall rate is estimated by comparing the fraction of the γγjj𝛾𝛾𝑗𝑗\gamma\gamma jjitalic_γ italic_γ italic_j italic_j simulated background sample accepted by the autoencoder with the diphoton trigger, which has a known event rate. The SM processes that contribute to this trigger rate have been studied using a procedure similar to the one we describe [101]. The study identifies two dominant scenarios that yield two reconstructed photons: (1) the SM process in which γγ𝛾𝛾\gamma\gammaitalic_γ italic_γ originate from the interaction vertex and (2) the SM process in which one photon is accompanied by a jet that has photon-like characteristics (γj𝛾𝑗\gamma jitalic_γ italic_j). The study shows that the shape of the mγγsubscript𝑚𝛾𝛾m_{\gamma\gamma}italic_m start_POSTSUBSCRIPT italic_γ italic_γ end_POSTSUBSCRIPT distribution for events from the γγ𝛾𝛾\gamma\gammaitalic_γ italic_γ process and γj𝛾𝑗\gamma jitalic_γ italic_j are similar. Therefore, we conclude that a comparison of equal acceptance using a sample dominated by the γγ𝛾𝛾\gamma\gammaitalic_γ italic_γ is a conservative approximation for the totality of these SM processes, comprised of both γγ𝛾𝛾\gamma\gammaitalic_γ italic_γ and γj𝛾𝑗\gamma jitalic_γ italic_j, corresponding to the above-mentioned 3 kHz.

The diphoton trigger performance is approximated by applying the pTγ2>superscriptsubscript𝑝T𝛾2absentp_{\textrm{T}}^{\gamma 2}>italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ 2 end_POSTSUPERSCRIPT > 25 GeV threshold as discussed above, to the subleading reconstructed photon in the simulated sample described in the previous section. Compared to the previous results [82], we note that non-negligible amount of H125subscript𝐻125H_{125}italic_H start_POSTSUBSCRIPT 125 end_POSTSUBSCRIPT passes the diphoton trigger in this study in the ma<subscript𝑚𝑎absentm_{a}<italic_m start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT < 20 GeV region because we are assuming an offline-like reconstruction after the HL-LHC upgrade of the level-1 trigger systems of the experiments [92, 93]. In the SM sample, 0.31% of events passed this ATLAS-inspired diphoton trigger. For the benchmark Higgs H125subscript𝐻125H_{125}italic_H start_POSTSUBSCRIPT 125 end_POSTSUBSCRIPT decay, 2.2% of the events passed. For the alternate cross-check H70subscript𝐻70H_{70}italic_H start_POSTSUBSCRIPT 70 end_POSTSUBSCRIPT decay, 0.01% passed; the small acceptance is due to the soft photon spectrum from the a5subscript𝑎5a_{5}italic_a start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT decay.

The autoencoder trigger performance is evaluated after the following pre-selection. In both training and testing, the autoencoder is exposed only to events that (1) have two or more reconstructed photons and two or more reconstructed jets and (2) have two photons that fall within the previously unexamined range mγγ<subscript𝑚𝛾𝛾absentm_{\gamma\gamma}<italic_m start_POSTSUBSCRIPT italic_γ italic_γ end_POSTSUBSCRIPT < 20 GeV. Events that do not meet these requirements are discarded. A total of 38% of the SM background sample pass the pre-selection, as did 53% of the H125subscript𝐻125H_{125}italic_H start_POSTSUBSCRIPT 125 end_POSTSUBSCRIPT sample and 29% of the H70subscript𝐻70H_{70}italic_H start_POSTSUBSCRIPT 70 end_POSTSUBSCRIPT sample.

The autoencoder is trained using a forest of 30 decision trees at a maximum depth of 6 on the training sample of the SM process. In the training step, measured quantities corresponding to the offline reconstruction of physics objects are used as input variables. The trained autoencoder model is applied to both the testing sample of the SM considered as the background process and the benchmark H125subscript𝐻125H_{125}italic_H start_POSTSUBSCRIPT 125 end_POSTSUBSCRIPT sample as the signal process. In the evaluation step, offline quantities are converted to bitwise values to mimic the firmware [78]. The cross-check H70subscript𝐻70H_{70}italic_H start_POSTSUBSCRIPT 70 end_POSTSUBSCRIPT sample is also considered as an alternate signal process to demonstrate that the autoencoder is effective over a wide kinematic range.

Anomaly scores for each event are calculated and their distributions are shown in the top-left plot of Figure 3. The corresponding ROC curves are shown on the top-right plot in the same figure. The plots in the bottom row are for a different physics scenario, which is discussed in the next section.

Refer to caption
Figure 3: Physics performance results. The distribution are given for anomaly scores ΔΔ\Deltaroman_Δ (left column) and the ROC curves (right column) for the Haaγγjj𝐻𝑎superscript𝑎𝛾𝛾𝑗𝑗H\rightarrow aa^{\prime}\rightarrow\gamma\gamma jjitalic_H → italic_a italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_γ italic_γ italic_j italic_j scenario (top row) and the LHC physics dataset [83] (bottom row). Along with the ROC curves for the γγjj𝛾𝛾𝑗𝑗\gamma\gamma jjitalic_γ italic_γ italic_j italic_j dataset (top right), the operating points of the pTγ2>superscriptsubscript𝑝T𝛾2absentp_{\textrm{T}}^{\gamma 2}>italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ 2 end_POSTSUPERSCRIPT > 25 GeV trigger are shown, with numerical values to compare it to the autoencoder’s performance. Values shown are fractions of all events in the sample. The autoencoder is trained only on the respective Standard Model (SMγγjj and SMcocktailcocktail{}_{\textrm{cocktail}}start_FLOATSUBSCRIPT cocktail end_FLOATSUBSCRIPT) processes. TPR and FPR represent true and false positive rates, respectively. The plots are software-simulated results using bit integers as done in the firmware.

The autoencoder trigger achieves 6.1% acceptance for the benchmark H125subscript𝐻125H_{125}italic_H start_POSTSUBSCRIPT 125 end_POSTSUBSCRIPT signal at the 3 kHz SM rate, nearly triple the 2.2% value using the diphoton trigger. Similarly, the acceptance of the cross-check H70subscript𝐻70H_{70}italic_H start_POSTSUBSCRIPT 70 end_POSTSUBSCRIPT sample is 1.4%, drastically increased from negligible value of the diphoton trigger at 0.01% for the same rate.

For the FPGA cost, the configuration is run on an Xilinx Virtex UltraScale+ FPGA VCU118 Evaluation Kit (with FPGA model xcvu9p) with a clock speed of 200 MHz. Algorithm latency is 10 clock ticks (30 ns) and the interval is 1 clock tick (5 ns). About 7% of available look up tables (LUT) are used; 1% of flip flops (FF) are used; a negligible number of digital signal processors (DSP) is used; no BRAM or URAM is used. The results are summarized in the first column of Table 1.

Table 1: FPGA specifications and cost. The first column describes the design for γγjj𝛾𝛾𝑗𝑗\gamma\gamma jjitalic_γ italic_γ italic_j italic_j; see text for details of the signal model on which the design is tested. The second column compares our result for the LHC physics problem given in the third column [77]. For the third column, the result listed is for DNN VAE PTQ 8-bit, the highlighted configuration in Ref. [77]; the indicates that the numbers are converted from the published percentages.
This paper This paper Govorkova et al. [77]
ML training and setup
   Framework fwXmachina fwXmachina hls4ml
   Architecture Deep decision tree Deep decision tree Variational autoencoder
   Dataset γγjj𝛾𝛾𝑗𝑗\gamma\gamma jjitalic_γ italic_γ italic_j italic_j LHC physics [83] LHC physics [83]
   Input variables 8 56 56
   No. of trees T𝑇Titalic_T 30 30 NA for neural networks
   Max. depth D𝐷Ditalic_D 6 4 NA for neural networks
   Phys. performance See text Comparable to [77] [77]
FPGA and firmware setup
   Chip family Xilinx Virtex UltraSc+ Xilinx Virtex UltraSc+ Xilinx Virtex UltraScale+
   Chip model xcvu9p-flga2104-2L-e​ xcvu9p-flga2104-2L-e xcvu9p-flgb2104-2-e
   Platform Vivado 2019.2 Vitis 2022.2 Vivado 2020.1
   Clock 200 MHz, 5 ns 200 MHz, 5 ns 200 MHz, 5 ns
   Precision ap_int8delimited-⟨⟩8\langle 8\rangle⟨ 8 ⟩ ap_int8delimited-⟨⟩8\langle 8\rangle⟨ 8 ⟩ ap_fixedvariesdelimited-⟨⟩varies\langle\textrm{varies}\rangle⟨ varies ⟩
FPGA cost
   Latency 6 ticks, 30 ns 6 ticks, 30 ns 16 ticks, 80 ns​​
   Interval 1 tick, 5 ns 1 tick, 5 ns 1 tick, 5 ns
   FF 15k, 0.6% 15k, 0.6% 12k, 0.5%
   LUT 63k, 5.4% 109k, 9.2% 35k, 3%
   DSP 8, 0.1% 56, 0.8% 68, 1%
   BRAM 0, 0% 0, 0% 13, 0.3%

Comparison: LHC physics dataset

Our autoencoder is applied to the LHC physics dataset [83] and compared to the results of the neural network implementation [77] that involves discrimination of several different BSM signals from a mixture of SM background. In this dataset, all events include the existence of an electron with momentum transverse to the beam axis pT>subscript𝑝Tabsentp_{\textrm{T}}>italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT > 23 GeV and pseudorapidity |η|<𝜂absent|\eta|<| italic_η | < 3.0 or a muon with pT>subscript𝑝Tabsentp_{\textrm{T}}>italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT > 23 GeV and |η|<𝜂absent|\eta|<| italic_η | < 2.1. This preselection is designed to limit the data to events that would already pass a real-time single-lepton trigger. We note that this requirement limits the ability of the study to be generalized for events that do not pass an existing real-time algorithm.

The background is composed of a cocktail of Standard Model processes (SMcocktailcocktail{}_{\textrm{cocktail}}start_FLOATSUBSCRIPT cocktail end_FLOATSUBSCRIPT) that would pass the above-mentioned preselection composed of Wν𝑊𝜈W\rightarrow\ell\nuitalic_W → roman_ℓ italic_ν, Z𝑍Z\rightarrow\ell\ellitalic_Z → roman_ℓ roman_ℓ, tt¯𝑡¯𝑡t\bar{t}italic_t over¯ start_ARG italic_t end_ARG, and QCD multijet in proportions similar to that of pp𝑝𝑝ppitalic_p italic_p collisions at the LHC. The dataset’s features are 56 variables consisting of sets of (pT,η,ϕ)subscript𝑝T𝜂italic-ϕ(p_{\textrm{T}},\eta,\phi)( italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT , italic_η , italic_ϕ ) from the 10 leading hadronic jets, 4 leading electrons, and 4 leading muons, along with ETmisssuperscriptsubscript𝐸TmissE_{\textrm{T}}^{\textrm{miss}}italic_E start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT miss end_POSTSUPERSCRIPT and its ϕitalic-ϕ\phiitalic_ϕ orientation. A cross-check using only 26 of these training variables is presented later in the section.

In our training, a forest of 30 trees at a maximum depth of 4 is trained on a training set of the SM cocktail and evaluated on both a testing portion of the SM cocktail each of the BSM samples. As the plots in the bottom row of Figure 3 show, the anomaly detector is able to isolate all signal samples from background. The areas under the ROC curves (AUC) demonstrate comparable performance. For TPR-FPR convention chosen in Figure 3, the area under the curve in the plot corresponds to 1AUC1AUC1-\textrm{AUC}1 - AUC, i.e., an AUC of 1 is an ideal classifier. Our AUC values are listed for the four signal scenarios and neural network-based results for DNN VAE PTQ 8-bit, the configuration highlighted in Ref. [77], in parentheses. \TabPositions0.06.15.30

  • LQ80subscriptLQ80\textrm{LQ}_{80}LQ start_POSTSUBSCRIPT 80 end_POSTSUBSCRIPT\tabbτabsent𝑏𝜏\rightarrow b\tau→ italic_b italic_τ\tabAUC = 0.93 \tab(0.92 [77]),

  • A50subscript𝐴50A_{50}italic_A start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT\tab4absent4\rightarrow 4\ell→ 4 roman_ℓ \tabAUC = 0.93 \tab(0.94 [77]),

  • h600subscriptsuperscript060h^{0}_{60}italic_h start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT\tabττabsent𝜏𝜏\rightarrow\tau\tau→ italic_τ italic_τ \tabAUC = 0.85 \tab(0.81 [77]), and

  • h60±subscriptsuperscriptplus-or-minus60h^{\pm}_{60}italic_h start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT\tabτνabsent𝜏𝜈\rightarrow\tau\nu→ italic_τ italic_ν \tabAUC = 0.94 \tab(0.94 [77]).

For the scenarios, the masses of the resonances are given in the subscript. Like the background, each signal scenario requires at least one electron or muon above the above-mentioned trigger threshold in the final state. The samples with τ𝜏\tauitalic_τ lepton final states are dominated by the leptonic decays because of the trigger selection. Our AUC performance is comparable to the range of previous results [77].

For the FPGA cost, the configuration is run on an xcvu9p FPGA with a clock speed of 200 MHz. With similar physics performance compared to previous results [77], our FPGA resource utilization is at comparable values to the low end of the range of FF and LUT usage, but fewer DSP and BRAM usage. Our design yields a lower latency value at six clock ticks (30 ns) and the lower bound of the range given at one clock tick (5 ns) for the interval. The results are summarized in the second column of Table 1.

As a cross-check of our FPGA cost, we implemented the two additional designs. The first cross-check uses only 26 variables on the same xcvu9p FPGA at 200 MHz. Due to the nature of the samples, many of the features are zero-valued, e.g., very few events have more than 3 jets. Therefore, we train with a subset of 26 input variables consisting of the (pT,η,ϕ)subscript𝑝T𝜂italic-ϕ(p_{\textrm{T}},\eta,\phi)( italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT , italic_η , italic_ϕ ) for the 4 leading jets, 2 leading electrons, and 2 leading muons, along with ETmisssuperscriptsubscript𝐸TmissE_{\textrm{T}}^{\textrm{miss}}italic_E start_POSTSUBSCRIPT T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT miss end_POSTSUPERSCRIPT and its ϕitalic-ϕ\phiitalic_ϕ orientation. There is no difference in AUC using only 26 variables to within a percent of the 56 variable result above. The design is executed with a similar latency of seven ticks (35 ns) and the same interval of one tick (5 ns). However, the resource usage is significantly less than the 56 variable configuration at 9k FF, 61k LUT, 26 DSP, and no BRAM.

The second cross-check uses the 26 variable configuration on a smaller FPGA, on Xilinx Zynq UltraScale+ xczu7ev. The FPGA cost is nearly identical as reported above. The design is executed with the same latency and interval; the resource usage is within 5% of the above values.

We note that the differences in the FPGA cost with respect to previous results [77] may be due to a number of factors. The factors include differences in the ML architecture as well as details about the FPGA configuration such as model compression methods, the number bits per input, type of input representation, such as fixed-point precision, and Xilinx versions.

With respect to the last item in the list, both Vivado HLS and Vitis HLS have been used to synthesize our designs with the latter being the more recent version of the same platform. Both are platforms that synthesize C code into an RTL implementation. For the benchmark scenario, the Vivado result is given in Table 1. The corresponding result using Vitis produced a increased latency value of 4 more ticks at the same clock speed and an increase 50% increase in flip flops and an increase of 30% in LUT with no change in DSP or BRAM. We have generally used Vivado to synthesize our designs, but it had difficulty with large designs such as the second configuration in Table 1. Although Vitis yielded a a less performant FPGA design compared to Vivado for the benchmark, Vitis was able to synthesize the larger configuration for the comparison.

Signal-contaminated training

A promising use case of the anomaly detector is to use collected data to train the autoencoder itself, rather than to use simulated samples, and to deploy it on subsequent incoming data. In this scenario, while the majority of the training sample would remain background, a fraction would consist of signal since the data would contain the signal that would cause the anomaly. To study the autoencoder’s performance using incoming data, we consider the results from the models trained with various levels of signal-contaminated simulated SM samples.

In Figure 4, we show a family of ROC curves with varying levels of signal contamination in the training sample from 1% to a third of the total number of events. As expected, there is degradation of performance with increasing fraction of the signal contamination in the training dataset. Nevertheless, training the autoencoder with a sample that has 33% contamination still outperforms the ATLAS-inspired diphoton trigger with about a factor of two higher H125subscript𝐻125H_{125}italic_H start_POSTSUBSCRIPT 125 end_POSTSUBSCRIPT acceptance at the same SM rate. Our findings are consistent with the anomaly detection study that reported a similar behavior for percent-level signal contamination [19].

Refer to caption
Figure 4: ROC curves showing the SMγγjj acceptance vs. H125subscript𝐻125H_{125}italic_H start_POSTSUBSCRIPT 125 end_POSTSUBSCRIPT efficiency for different contaminated mixtures of H125subscript𝐻125H_{125}italic_H start_POSTSUBSCRIPT 125 end_POSTSUBSCRIPT that is used to train the autoencoder. The legend indicates the percentage of the training sample consisting of H125subscript𝐻125H_{125}italic_H start_POSTSUBSCRIPT 125 end_POSTSUBSCRIPT with the rest consisting of the SM sample, i.e., the uncontaminated is trained only on the SM sample. The 3 kHz line and the values for the uncontaminated autoencoder trigger and the 2γ2𝛾2\gamma2 italic_γ trigger matches that of the top-right plot in Figure 3. The plot is software-simulated results using bit integers as done in the firmware.

For the benchmark physics process, an approximate upper bound of the signal contamination is estimated to be 1%. This bound considers known SM processes [94] and assumes that all Higgs bosons [102] decay to the γγjj𝛾𝛾𝑗𝑗\gamma\gamma jjitalic_γ italic_γ italic_j italic_j final state. Therefore, the resistance to contamination at the percent level—like that demonstrated in the study above—is promising for the rare BSM signals sought in high energy physics experiments. A possible experimental setup to prepare for varying levels of contaminated data could be to employ a set of autoencoder triggers trained with varying levels of simulated signal contamination. A sketch of the setup is given in the Supplementary Figure 4.

Discussion

An implementation of a decision tree-based autoencoder anomaly detector was presented. The fwXmachina framework is used to implement the algorithm on FPGA with the goal of conducting real-time anomaly detection for physics beyond the Standard Model at real-time trigger systems at high energy physics experiments. The implementation is tested on two problems: detection of exotic Higgs decays to γγjj𝛾𝛾𝑗𝑗\gamma\gamma jjitalic_γ italic_γ italic_j italic_j through pseudoscalar intermediates and an LHC physics anomaly detection dataset [83]. In both problems, the ML is trained only on background processes and evaluated on both signal and background. The anomaly detector shows the promise to identify several different realistic exotic signals that may be seen at a trigger system with comparable physics performance to existing neural network-based anomaly detectors. The efficient firmware implementation and low latency of 30 ns is well suited for the timing constraints of FPGA-based first level triggers at LHC experiments.

A study of classifier performance with signal contamination shows the promise for the possibility to train on the collected data at the LHC. If the collected data already has BSM processes mixed in that we are trying to discover, then this possibility allows one to train the ML with the data anyway then deploy it on future data to detect the BSM signal [103]. These approaches may also be of interest at the HL-LHC, which will increase the rate of proton collisions at the cost of higher background levels.

Existing approaches of the real-time trigger path anomaly detector, including the one in this paper, make assumptions about the availability of the preprocessed objects such as electrons that are reconstructed from more basic inputs such as calorimetric data. The next step would consider such inputs ranging from 1 k to 100 M channels, depending on the experimental setup, which may require a drastic redesign of existing approaches.

An added advantage of using decision tree-based anomaly detectors such as the algorithm presented here is that it allows for interpretability. As Figure 1 and Supplementary Figure 3 demonstrate, it is possible to examine the cuts used to construct the decision trees either by examining the feature space or the constructed trees. This enables visual interpretation of the anomaly detection. The large majority of autoencoders rely on neural networks and other black box models that have resisted easy interpretation [84] of the latent space and intermediate node values. Interpretability may be desirable in understanding trigger behavior in high energy physics when disentangling BSM events from flaws in the apparatus leading to similar anomalous signals. Fields in which black box models are undesirable may also find our tool useful.

A challenging aspect of the analysis of anomalous events, which may affect other methods as well, is that the map** of the input space to the anomaly score is not necessarily unique due to the Jacobian arising from the coordinate transformation [66]. That is, how rare a given event is depends on the choice of variables. In such cases, the events selected by a threshold on the score can be studied with variables orthogonal to the input space [74] or the latent space of the autoencoder [48]. Adding to the difficulty is what to do with the selected anomalous sample. We list three ideas in the literature that may help identify the BSM events in this sample. The first two methods use variables orthogonal to the input space. First, a bump hunt was conducted using invariant masses in [74]. Second, a control sample could be obtained using a sideband to help identify the BSM events in the sample of anomalous events [70, 69]. Lastly, an analysis of the latent space could help separate BSM from the other events [48]. For any of these methods, the BSM may not populate smoothly across the anomalous score distribution, so the BSM fraction would likely be extracted by a statistical treatment. As is commonly done in high energy physics, e.g., [104], a simultaneous maximum likelihood fit can extract the BSM composition in the various subsamples.

Methods

Details of simulated samples

Samples of the multistage process of simulating the proton collisions that produce our final state followed by the simulation of the detector effects, so called Monte Carlo samples, are considered in order to test the autoencoder’s performance in real-time triggers.

We produced a sample of one million simulated proton-proton collision events in the SM composed of all processes that produce the γγjj𝛾𝛾𝑗𝑗\gamma\gamma jjitalic_γ italic_γ italic_j italic_j final state, which we consider the background process during the evaluation of physics performance.

Additionally, two signal samples of one hundred thousand events each that simulate the production and decay of scalar bosons are generated, which we consider the anomaly processes. Scalar bosons produced from the gluon-gluon fusion production mode in proton-proton collisions are decayed as H125a10a70subscript𝐻125subscript𝑎10subscript𝑎70H_{125}\rightarrow a_{10}a_{70}italic_H start_POSTSUBSCRIPT 125 end_POSTSUBSCRIPT → italic_a start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 70 end_POSTSUBSCRIPT and H70a5a50subscript𝐻70subscript𝑎5subscript𝑎50H_{70}\rightarrow a_{5}a_{50}italic_H start_POSTSUBSCRIPT 70 end_POSTSUBSCRIPT → italic_a start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT. The lighter a𝑎aitalic_a decays to γγ𝛾𝛾\gamma\gammaitalic_γ italic_γ and the heavier asuperscript𝑎a^{\prime}italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT decays to jj𝑗𝑗jjitalic_j italic_j. All samples, both background and anomaly, use the Higgs effective field theory model in MadGraph5__\__aMC 2.9.5 [94].

The input variables are the reconstructed values calculated by Delphes 3.5.0 [96, 97]. Jets are reconstructed with the anti-ktsubscript𝑘tk_{\textrm{t}}italic_k start_POSTSUBSCRIPT t end_POSTSUBSCRIPT algorithm with a radius parameter R=𝑅absentR=italic_R = 0.4 and a minimum pTsubscript𝑝Tp_{\textrm{T}}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT of 20 GeV [107]. Photons are reconstructed with a radius parameter of R=𝑅absentR=italic_R = 0.2 and a minimum pTsubscript𝑝Tp_{\textrm{T}}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT of 0.5 GeV. All samples are produced with the above-mentioned MadGraph5 and decayed and showered with Pythia8 [95]. Detector simulation and event reconstruction is simulated with Delphes, which uses the CMS card to simulate the behavior of the CMS detector [98]. We note the similarities between the physics capabilities of the CMS and ATLAS detectors allow a generic interpretation of the results presented in the next section. Without mitigation, multiple proton-proton interactions (pileup) impact the number of jets reconstructed in each event. Due to the importance of hadronic jets in the HL-LHC, a variety of algorithms have been proposed for removing pileup contributions in jets [108, 109, 110], and therefore we neglect the effects of pileup. More details can be found with the samples [105]. The input variable distributions are given in Figure 2.

Firmware design

The structure of the firmware is based on fwXmachina [78, 79]. The Autoencoder Processor, whose block diagram is shown in Figure 5, takes in input data and outputs the anomaly score. In the firmware implementation, we approximate \mathbb{R}blackboard_R of the input-output space by N𝑁Nitalic_N-bit integers Nsubscript𝑁\mathbb{Z}_{N}blackboard_Z start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT.

Refer to caption
Figure 5: Block diagram of the Autoencoder Processor for anomaly detection with a T𝑇Titalic_T-dimensional latent layer corresponding to a forest of T𝑇Titalic_T decision trees. The design uses Deep Decision Tree Engine [79], as both encoder and decoder with the bin index shown only schematically, as the latent data is implicit.

In the diagram, input enters from the left and copies are distributed to T𝑇Titalic_T deep decision trees, each tree corresponding to one latent dimension. Once the outputs of the engine are available, the distance processor computes the ΔΔ\Deltaroman_Δ with respect to the input. The Deep Decision Tree Engine (DDTE) [79] is modified to output a vector of values. The Distance Processor takes the outputs of DDTE and computes the distance for each set of outputs followed by a sum.

We note that further modification of DDTE would allow for efficient transmission of compressed data [111], but is beyond the scope of this paper.

Verification and validation

We validate and verify our design using the benchmark physics scenario.

For validation of our algorithm, first we run 𝒪(105)𝒪superscript105\mathcal{O}(10^{5})caligraphic_O ( 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT ) test vectors through our design using C simulation in Vivado HLS and compare the outputs to that of the expected firmware outputs simulated in Python. Then co-simulation is done, which creates an RTL model of the design, simulates it, and compares the RTL model against the C design. In all cases, the simulation outputs match the expected outputs.

For the physical verification of our algorithm, we program select configurations onto the xcvu9p at a clock speed of 200 MHz, which is the setup used for the benchmark results in this paper. We test a handful of test vector inputs and use the Xilinx Integrated Logic Analyzer IP core to observe the outputs. In all cases, the outputs match the expected outputs received from software and co-simulation.

Data availability

Two datasets were used in this paper. The γγjj𝛾𝛾𝑗𝑗\gamma\gamma jjitalic_γ italic_γ italic_j italic_j data generated by us for this study have been deposited in Mendeley Datasets under DOI 10.17632/44t976dyrj.1 and is cited as Ref. [105]. The LHC physics dataset was taken from Ref. [83] and is publicly available in Zenodo under DOIs 10.5281/zenodo.3675210, 10.5281/zenodo.3675206, 10.5281/zenodo.3675203, 10.5281/zenodo.3675199, and 10.5281/zenodo.5046388.

Code availability

The repository with the files to evaluate the FPGA performance is publicly available at D-Scholarship@Pitt, which is an institutional repository for the research output of the University of Pittsburgh [106]. More specifically, the IP core design for the benchmark scenario is available along with a testbench and associated test vectors.

General information about fwXmachina can be found at http://fwx.pitt.edu.

References

Acknowledgements

We thank David Shih, Matthew Low, Joseph Boudreau, James Mueller, Elliot Lipeles, Dylan Rankin, and Eli Ullman-Kissel for the physics discussions. We thank Kushal Parekh, Stephen Racz, Brandon Eubanks, Yuvaraj Elangovan, and Kemal Emre Ercikti for the firmware discussions. We thank Santiago Cané for assistance in testing. We thank Gracie Jane Gollinger for computing infrastructure support. TMH was supported by the US Department of Energy [award no. DE-SC0007914]. JS was supported by the US Department of Energy [award no. DE-SC0012704]. BTC was supported by the National Science Foundation [award no. NSF-2209370]. STR was supported by the Emil Sanielevici Undergraduate Research Scholarship.

Author contribution statement

STR, JS, WCO, and TMH designed the ML training algorithm. QB and PS implemented and tested the firmware design. STR, BTC, and TMH created the simulated dataset and performed the physics analysis. STR led the project execution while TMH managed and coordinated the overall effort. TMH and STR drafted and edited the manuscript with significant inputs from BTC. All authors reviewed the manuscript.

Competing interests statement

The authors declare competing interests. TMH, BTC, JS, STR, and QB have filed a patent on the firmware design of the autoencoder with the University of Pittsburgh. It is currently pending as US Patent Application Publication No. US 2024/0054399. Other authors declare no competing interests.

Supplementary Information

Refer to caption
Supplementary Figure 1: Illustrative example of \starcoder as two visual representations of the same decision tree. Deep decision tree (left) rendered as the decision tree grid (center) and implemented by the parallel decision paths (right). Two-depth deep decision tree (DDT) is the encoder (step 1) shown as a conventional binary split diagram; the latent space is the bin number (step 2); the latent space data is decoded using the decision tree grid (DTG) (step 3); and the simultaneous encoding and decoding with \starcoder (star-coder) architecture (right) represented by parallel decision paths (PDP) of Ref. [79]. The DTG is the visualization as a grid of partitions in V𝑉Vitalic_V-dimensional space. In this example, the input 𝐱=(55,70)𝐱5570\mathbf{x}{\,=\,}(55,70)bold_x = ( 55 , 70 ) yields the output x^=(27,25)^x2725\hat{\textbf{x}}{\,=\,}(27,25)over^ start_ARG x end_ARG = ( 27 , 25 ) without needing to explicitly produce the latent layer.
Refer to caption
Supplementary Figure 2: Demonstration of decision tree-based autoencoder and a demonstration of data transmission / anomaly detection using the MNIST dataset, which is a set of images of handwritten numbers converted to 28×28282828\times 2828 × 28 pixels, or 784784784784-length input vector V=784𝑉784V=784italic_V = 784, with N=8𝑁8N=8italic_N = 8 bits per pixel. The ML training is done on 15151515k images of handwritten 00 to 4444, but not 5555 to 9999, on one tree T=1𝑇1T=1italic_T = 1 at a maximum depth of D=20𝐷20D=20italic_D = 20. The output is a 784784784784-length vector with 8888 bits per pixel. The data compression-decompression factor, the ratio of input-output bits to the latent space dimensions, VN/(TD)=7848/(120)𝑉𝑁𝑇𝐷7848120V\cdot N/(T\cdot D)=784\cdot 8/(1\cdot 20)italic_V ⋅ italic_N / ( italic_T ⋅ italic_D ) = 784 ⋅ 8 / ( 1 ⋅ 20 ), is about 300. The figure shows two input-output pairs as examples. The output of 4444 resembles 4444 while the output of 6666 is garbled. The former yields a smaller input-output distance relative to the latter case. The input data shown here are not part of the training sample.
Refer to caption
Supplementary Figure 3: Toy dataset and ML training with varying maximum depth D𝐷Ditalic_D. The top-left plot shows training sample where each data point is represented by a 2d coordinate. The top-right plot shows input-output distance ΔΔ\Deltaroman_Δ for various D𝐷Ditalic_D. The anomaly score distribution shows RMS shrinking with D𝐷Ditalic_D when evaluated on a sample similar to the training sample. The bottom rows of plots shows the result of the ML training. In each partition, a dot (\bullet) indicates the estimate 𝐱^^𝐱\hat{\mathbf{x}}over^ start_ARG bold_x end_ARG, the location of the median in each dimension of the data in that bin, corresponding to the bin that 𝐱𝐱\mathbf{x}bold_x resides in. With the median points one can visualize the refinement of the reconstruction of the original dataset with increasing D𝐷Ditalic_D.
Refer to caption
Supplementary Figure 4: Illustration of the ML training with varying levels of signal contamination (top) and the real-time inference (bottom). This setup can help prepare the scenario where the autoencoder is trained using the incoming data itself.