HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: centernot
  • failed: axessibility
  • failed: pgfcalendar
  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by selecting from this list of supported packages.

License: arXiv.org perpetual non-exclusive license
arXiv:2312.06645v1 [cs.CV] 11 Dec 2023

Beyond Classification: Definition and Density-based Estimation of Calibration in Object Detection

Teodora Popordanoska
ESAT-PSI, KU Leuven
[email protected]
   Aleksei Tiulpin
HST Research Unit, University of Oulu
[email protected]
   Matthew B. Blaschko
ESAT-PSI, KU Leuven
[email protected]
Abstract

Despite their impressive predictive performance in various computer vision tasks, deep neural networks (DNNs) tend to make overly confident predictions, which hinders their widespread use in safety-critical applications. While there have been recent attempts to calibrate DNNs, most of these efforts have primarily been focused on classification tasks, thus neglecting DNN-based object detectors. Although several recent works addressed calibration for object detection and proposed differentiable penalties, none of them are consistent estimators of established concepts in calibration. In this work, we tackle the challenge of defining and estimating calibration error specifically for this task. In particular, we adapt the definition of classification calibration error to handle the nuances associated with object detection, and predictions in structured output spaces more generally. Furthermore, we propose a consistent and differentiable estimator of the detection calibration error, utilizing kernel density estimation. Our experiments demonstrate the effectiveness of our estimator against competing train-time and post-hoc calibration methods, while maintaining similar detection performance.

1 Introduction

Calibration is a property of a model, which directly translates to the ability to estimate its own predictive uncertainty, thereby facilitating safe and responsible deployment. Intuitively, a well-calibrated model produces confidence scores that accurately reflect the uncertainty associated with its predictions. In deep neural networks (DNNs), which are driving most of the current progress in machine learning and computer vision, this has become an emerging concern, as they can yield wrong predictions with high confidence [8, 28]. Unreliable uncertainty estimates produced by such models can lead to erroneous decision-making, which is especially risky in fields like autonomous driving [39] and medicine [6]. The field of model calibration typically studies three sub-problems: estimating calibration error, regularization during training, and post-hoc calibration. While in recent years there have been multiple advances in all of these domains, most of the existing literature studies calibration in the context of classification. Calibration in structured prediction problems, such as object detection (OD), have received substantially less attention, despite their increased integration in many safety-critical applications. While some attempts have been made to define, measure, enforce, and adjust calibration [15, 24, 27, 25], we argue that the field of OD lacks a solid mathematical foundation in both definition of calibration and estimation of calibration errors.

BaseTSTCDOurs1212121214141414161616161818181820202020Calibration error ×\displaystyle\times× 100
Figure 1: Calibration error of RetinaNet on Pascal VOC. The model without calibration (Base) is compared with post-hoc (temperature scaling; TS) and train-time calibration methods (TCD and Ours). Our KDE-based estimator effectively reduces the calibration error. The error bars represent 95% CI.

In spite of its importance, defining calibration for object detection is a challenging problem due to the variability of DNN-based detectors. In particular, there are ambiguities related to how many bounding boxes are returned, how to select a detection confidence threshold below which detections are rejected, what is the intersection over union (IoU) threshold for considering a “correct“ detection, and so forth. There have been a few recent attempts to define calibration by either replacing accuracy with precision in the definition of calibration for classification [15, 29, 25], or by requiring that both classification and localization performances jointly match the confidence score [27].

In this paper, we propose a general framework that unifies existing definitions of calibration in OD, and is flexible to parametrize the notion of a “correct“ detection. Beyond defining calibration, we derive a consistent and differentiable estimator of calibration error in OD, relying on a kernel density estimator (KDE) recently proposed for classification [30], and recommended among the best practices [21] for assessing calibration of classifiers. Due to its consistency and differentiability, our estimator of calibration error can be used not only as a reliable tool to assess calibration, but also in calibration-regularized training of popular detectors. We perform extensive experiments on MS COCO [18], Cityscapes [4], and PASCAL VOC [7], and demonstrate that our estimator, when incorporated as an auxiliary loss during training, consistently reduces the calibration error across several object detectors. Furthermore, we show that finetuning with our loss is an effective way to improve calibration, while maintaining comparable average precision (AP), and without the need to re-train the models from scratch.

In summary, our contributions are:

  1. 1.

    We propose a general unifying definition of calibration in OD, which addresses the nuances related to assessing the “correctness“ of a detection.

  2. 2.

    We develop a consistent and differentiable estimator of calibration error based on KDE [30] for OD.

  3. 3.

    We perform rigorous empirical analysis of calibration of popular object detectors on several datasets.

  4. 4.

    We demonstrate that our estimator can be used as an auxiliary train-time loss, which effectively reduces the calibration error, while maintaining similar average precision (AP).

2 Related work

2.1 Calibration in classification

Most of the prior work on develo** techniques for improving calibration in the vision domain target the image classification task. Post-hoc methods [8, 20, 41, 40] re-scale the output of a trained model using parameters learned on a validation set. The most successful method of this group is temperature scaling [8], where a single parameter T𝑇Titalic_T is learned, usually by minimizing NLL, to scale the logits before applying the softmax function. Train-time calibration methods [30, 14, 10, 22, 23, 17] typically incorporate calibration-related auxiliary loss directly into the training process of a neural network. Even though these methods may introduce computational overhead, they do not require a hold-out validation set and have demonstrated superior performance in handling domain shift [12], compared to post-hoc methods.

2.2 Calibration in object detection

Recent works have shown that miscalibration is a concern not only in classification tasks, but also in the domain of object detection [15, 29]. Pathiraja et al. [29] showed that existing calibration methods for classification are not as effective for calibrating object detectors. As a result, there have been a few efforts to establish the definition of calibration [15, 27] and to devise techniques to mitigate the issue. Oksuz et al. [27] propose a binning-based metric called Localisation-aware Expected Calibration Error (LaECELaECE\operatorname{LaECE}roman_LaECE), and use it as part of existing post-hoc calibration approaches like linear regression, histogram binning [41] and isotonic regression [40]. In the realm of train-time calibration methods, several auxiliary loss terms have been proposed to be used in addition to the detection-specific loss. For instance, TCD [24] and MCCL [29] were proposed to jointly calibrate the class-wise confidences and localization performance. Another recently introduced auxiliary loss is BPC [25], which is based on a heuristic that maximizes confidence scores for accurate predictions while minimizing scores for inaccurate predictions. Harakeh and Waslander [9] propose an energy score to train probabalistic detectors, which is empirically shown to improve calibration.

However, none of the proposed binning-based metrics and trainable auxiliary losses fulfill both requirements of being consistent and differentiable estimators of calibration error in the context of object detection. In this paper, we derive a novel estimator for calibration error in object detection with those desired properties.

3 Methods

3.1 Defining calibration

Several notions of different strength exist for multi-class calibration, including top-label, class-wise, and canonical calibration [8, 35, 36]. However, following recent trends of training the classification branch of object detectors in a multi-label setting, i.e. by using K𝐾Kitalic_K independent binary classifiers [19, 34, 31], we also favor the approach of defining calibration by considering K𝐾Kitalic_K binary object detectors, as the mutual-exclusion principle of multi-class classification does not hold (e.g. the area of an image occupied by a person may overlap with that of a bicycle).

Intuitively, calibration requires that the confidence score of a prediction should be aligned with some notion of correctness. In classification tasks, a correct prediction is determined by the match between the predicted and ground truth labels. However, in object detection, assessing correctness is more nuanced, since it involves evaluating the degree of overlap between two sets of pixels (e.g. bounding boxes). To account for this, we will introduce a similarity measure (e.g. IoUIoU\operatorname{IoU}roman_IoU) and a so-called link function, so that we can define a family of notions of calibration that also consider the degree of correctness of a given prediction. The formal definitions are given below.

Binary classification

Let X𝒳𝑋𝒳X\in\mathcal{X}italic_X ∈ caligraphic_X and Y𝒴={0,1}𝑌𝒴01Y\in\mathcal{Y}=\{0,1\}italic_Y ∈ caligraphic_Y = { 0 , 1 } be random variables denoting the input and target, given by a joint distribution P𝑃Pitalic_P. Let f𝑓fitalic_f be a neural network with f(X)=S^𝑓𝑋^𝑆f(X)=\hat{S}italic_f ( italic_X ) = over^ start_ARG italic_S end_ARG, where S^[0,1]^𝑆01\hat{S}\in[0,1]over^ start_ARG italic_S end_ARG ∈ [ 0 , 1 ] denotes the model’s confidence that the label is 1. Then f𝑓fitalic_f is said to satisfy binary calibration [2, 40, 13] if:

(Y=1S^=s)=s,s[0,1].formulae-sequence𝑌conditional1^𝑆𝑠𝑠for-all𝑠01\mathbb{P}\left(Y=1\mid\hat{S}=s\right)=s,\quad\forall s\in[0,1].blackboard_P ( italic_Y = 1 ∣ over^ start_ARG italic_S end_ARG = italic_s ) = italic_s , ∀ italic_s ∈ [ 0 , 1 ] . (1)

Object detection

Given a set of images {Xi}1insubscriptsubscript𝑋𝑖1𝑖𝑛\{X_{i}\}_{1\leq i\leq n}{ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_n end_POSTSUBSCRIPT, the goal of an object detector is to predict bounding boxes, class labels and confidence scores for the objects in {Xi}subscript𝑋𝑖\{X_{i}\}{ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }. Let gksubscript𝑔𝑘g_{k}italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT be a binary object detector for class k𝑘kitalic_k with gk(Xi)={(S^ij,B^ij)}1jmiksubscript𝑔𝑘subscript𝑋𝑖subscriptsubscript^𝑆𝑖𝑗subscript^𝐵𝑖𝑗1𝑗subscript𝑚𝑖𝑘g_{k}(X_{i})=\{(\hat{S}_{ij},\hat{B}_{ij})\}_{1\leq j\leq m_{ik}}italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = { ( over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_m start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT, where miksubscript𝑚𝑖𝑘m_{ik}italic_m start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT boxes are predicted for image Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and class k𝑘kitalic_k, B^ij=4subscript^𝐵𝑖𝑗superscript4\hat{B}_{ij}\in\mathcal{B}=\mathbb{R}^{4}over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ caligraphic_B = blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT denotes a predicted bounding box,111We note that in general this can be any set of pixels, not necessarily limited to the ones defined by a bounding box. We adopt this notation for convenience, but it does not change the mathematical principles involved. and S^ij[0,1]subscript^𝑆𝑖𝑗01\hat{S}_{ij}\in[0,1]over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ [ 0 , 1 ] its corresponding score.

Let L:×[0,1]:𝐿01L:\mathcal{B}\times\mathcal{B}\rightarrow[0,1]italic_L : caligraphic_B × caligraphic_B → [ 0 , 1 ] denote a similarity measure, for example IoUIoU\operatorname{IoU}roman_IoU, Dice score, Hamming similarity etc. Let ψ:[0,1][0,1]:𝜓0101\psi:[0,1]\rightarrow[0,1]italic_ψ : [ 0 , 1 ] → [ 0 , 1 ] denote a monotonic function, which we will refer to as a link function. Some choices for the link function in the context of object detectors may include an identity function ψ(L)=L𝜓𝐿𝐿\psi(L)=Litalic_ψ ( italic_L ) = italic_L, or a piece-wise linear ramp function parametrized with αβ[0,1]𝛼𝛽01\alpha\leq\beta\in[0,1]italic_α ≤ italic_β ∈ [ 0 , 1 ] as:

ψ(L)={0LαLαβαα<L<β1Lβ.𝜓𝐿cases0𝐿𝛼𝐿𝛼𝛽𝛼𝛼𝐿𝛽1𝐿𝛽\psi(L)=\begin{cases}0&L\leq\alpha\\ \frac{L-\alpha}{\beta-\alpha}&\alpha<L<\beta\\ 1&L\geq\beta\end{cases}.italic_ψ ( italic_L ) = { start_ROW start_CELL 0 end_CELL start_CELL italic_L ≤ italic_α end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_L - italic_α end_ARG start_ARG italic_β - italic_α end_ARG end_CELL start_CELL italic_α < italic_L < italic_β end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL italic_L ≥ italic_β end_CELL end_ROW . (2)

Two special cases of this function that could be of interest are a step (threshold) function (0<α=β1)0<\alpha=\beta\leq 1)0 < italic_α = italic_β ≤ 1 ), and a piece-wise linear (hinge) function (α=0.5,β=1formulae-sequence𝛼0.5𝛽1\alpha=0.5,\beta=1italic_α = 0.5 , italic_β = 1).

Let Z:=ψ(L(B^,B*))assign𝑍𝜓𝐿^𝐵superscript𝐵Z:=\psi(L(\hat{B},B^{*}))italic_Z := italic_ψ ( italic_L ( over^ start_ARG italic_B end_ARG , italic_B start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ) be a random variable that denotes the “correctness“ of a detection, and B*superscript𝐵B^{*}italic_B start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT be the ground-truth box that B^^𝐵\hat{B}over^ start_ARG italic_B end_ARG matches with. If ψ𝜓\psiitalic_ψ is a threshold function then Z{0,1}𝑍01Z\in\{0,1\}italic_Z ∈ { 0 , 1 } denotes whether the predicted bounding box matched a ground truth box with Lβ𝐿𝛽L\geq\betaitalic_L ≥ italic_β, and if ψ𝜓\psiitalic_ψ is an identity function then Z[0,1]𝑍01Z\in[0,1]italic_Z ∈ [ 0 , 1 ] represents the degree of “correctness“ of the detection, e.g.  the IoU between predicted and ground truth box. Then we define calibration for object detection as:

(ZS^=s)=s,s[0,1].formulae-sequenceconditional𝑍^𝑆𝑠𝑠for-all𝑠01\mathbb{P}\left(Z\mid\hat{S}=s\right)=s,\quad\forall s\in[0,1].blackboard_P ( italic_Z ∣ over^ start_ARG italic_S end_ARG = italic_s ) = italic_s , ∀ italic_s ∈ [ 0 , 1 ] . (3)

The introduction of a link function ψ𝜓\psiitalic_ψ, provides a concise way to relate the notion of calibration to existing literature. For example, by letting ψ𝜓\psiitalic_ψ be a threshold function and taking L=IoU𝐿IoUL=\operatorname{IoU}italic_L = roman_IoU, we recover the definition of calibration used in [15, 24, 25, 29], i.e., (Z=1S^=s)=s,s[0,1]formulae-sequence𝑍conditional1^𝑆𝑠𝑠for-all𝑠01\mathbb{P}(Z=1\mid\hat{S}=s)=s,\forall s\in[0,1]blackboard_P ( italic_Z = 1 ∣ over^ start_ARG italic_S end_ARG = italic_s ) = italic_s , ∀ italic_s ∈ [ 0 , 1 ], with Z=1𝑍1Z=1italic_Z = 1 denoting an accurate detection in the sense that IoU(B^,B*)βIoU^𝐵superscript𝐵𝛽\operatorname{IoU}(\hat{B},B^{*})\geq\betaroman_IoU ( over^ start_ARG italic_B end_ARG , italic_B start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ≥ italic_β. Similarly, by using an identity function for ψ𝜓\psiitalic_ψ, we obtain the definition proposed in [27, Equation (3)], i.e., (Z=IoU(B^,B*)S^=s)=s,s[0,1]formulae-sequence𝑍conditionalIoU^𝐵superscript𝐵^𝑆𝑠𝑠for-all𝑠01\mathbb{P}(Z=\operatorname{IoU}(\hat{B},B^{*})\mid\hat{S}=s)=s,\forall s\in[0,1]blackboard_P ( italic_Z = roman_IoU ( over^ start_ARG italic_B end_ARG , italic_B start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ∣ over^ start_ARG italic_S end_ARG = italic_s ) = italic_s , ∀ italic_s ∈ [ 0 , 1 ]. Note that in both cases the requirement that the class prediction matches the ground truth class is simplified since we’re considering a binary detector and all predictions share the same class as the ground truth.

3.2 Measuring calibration

A predictive function trained with risk minimization on a continuous domain will be exactly calibrated with probability 0. We therefore need to measure a notion of calibration error. Analogous to the definition of calibration error for classification [8], we define the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT calibration error for a binary object detector gksubscript𝑔𝑘g_{k}italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for the k𝑘kitalic_kth class as:

CE(gk)=𝔼[|𝔼[ZS^=s]s|].\displaystyle\operatorname{CE}(g_{k})=\mathbb{E}\left[\left|\mathbb{E}[Z\mid% \hat{S}=s]-s\right|\right].roman_CE ( italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = blackboard_E [ | blackboard_E [ italic_Z ∣ over^ start_ARG italic_S end_ARG = italic_s ] - italic_s | ] . (4)

Estimating calibration error

In practice, we rely on finite samples to estimate the quantities of interest. One of the desired properties of such estimators is consistency, i.e., the estimator CE^(gk)\widehat{\operatorname{CE}}_{(}g_{k})over^ start_ARG roman_CE end_ARG start_POSTSUBSCRIPT ( end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) should converge in probability to the true value CE(gk)CEsubscript𝑔𝑘\operatorname{CE}(g_{k})roman_CE ( italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) [11]: plimnCE^(gk)=CE(gk).\underset{n\to\infty}{\operatorname{plim}}\widehat{\operatorname{CE}}_{(}g_{k}% )=\operatorname{CE}(g_{k}).start_UNDERACCENT italic_n → ∞ end_UNDERACCENT start_ARG roman_plim end_ARG over^ start_ARG roman_CE end_ARG start_POSTSUBSCRIPT ( end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = roman_CE ( italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) . Another desired property for calibration error estimators is differentiability, as it allows the estimator to be directly optimized alongside the task-specific loss function. Before presenting our estimator, which possesses both properties, we will introduce existing estimators and discuss their shortcomings.

In classification, one of the most widely used estimators of calibration error is the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT binned estimator, commonly referred to as ECE [8]. Recent studies have proposed extensions of this estimator to the detection task, called detection ECE (DECEDECE\operatorname{D-ECE}roman_D - roman_ECE) [15] and Localization-aware ECE [27] (LaECELaECE\operatorname{LaECE}roman_LaECE), to be used as finite sample estimators of the notion of calibration as defined by a threshold link and an identity link, respectively. Analogous to ECE, both DECEDECE\operatorname{D-ECE}roman_D - roman_ECE and LaECELaECE\operatorname{LaECE}roman_LaECE partition the predictions into M𝑀Mitalic_M equally-spaced bins. DECEDECE\operatorname{D-ECE}roman_D - roman_ECE is computed by taking a weighted average of the difference between precision and confidence in each bin:

DECE=m=1M|𝒟m||𝒟||prec(m)conf(m)|,DECEsuperscriptsubscript𝑚1𝑀subscript𝒟𝑚𝒟prec𝑚conf𝑚\operatorname{D-ECE}=\sum_{m=1}^{M}\frac{|\mathcal{D}_{m}|}{|\mathcal{D}|}% \left|\mathrm{prec}(m)-\mathrm{conf}(m)\right|,start_OPFUNCTION roman_D - roman_ECE end_OPFUNCTION = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | end_ARG start_ARG | caligraphic_D | end_ARG | roman_prec ( italic_m ) - roman_conf ( italic_m ) | , (5)

where 𝒟msubscript𝒟𝑚\mathcal{D}_{m}caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the set of detections in mthsuperscript𝑚𝑡m^{th}italic_m start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT bin, |𝒟|𝒟|\mathcal{D}|| caligraphic_D | is the total number of detections, prec(m)prec𝑚\mathrm{prec}(m)roman_prec ( italic_m ) denotes the precision and conf(m)conf𝑚\mathrm{conf}(m)roman_conf ( italic_m ) the average confidence over samples in the mthsuperscript𝑚𝑡m^{th}italic_m start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT bin. LaECELaECE\operatorname{LaECE}roman_LaECE is computed as an average of k𝑘kitalic_k per-class calibration errors obtained as:

LaECEk=m=1M|𝒟mk||𝒟k||preck(m)×IoUk(m)confk(m)|,superscriptLaECE𝑘superscriptsubscript𝑚1𝑀superscriptsubscript𝒟𝑚𝑘superscript𝒟𝑘superscriptpreck𝑚superscriptIoU𝑘𝑚superscriptconfk𝑚\operatorname{LaECE}^{k}=\sum_{m=1}^{M}\frac{|\mathcal{D}_{m}^{k}|}{|\mathcal{% D}^{k}|}\left|\mathrm{prec^{k}}(m)\times\mathrm{IoU}^{k}(m)-\mathrm{conf^{k}}(% m)\right|,roman_LaECE start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | end_ARG start_ARG | caligraphic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | end_ARG | roman_prec start_POSTSUPERSCRIPT roman_k end_POSTSUPERSCRIPT ( italic_m ) × roman_IoU start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_m ) - roman_conf start_POSTSUPERSCRIPT roman_k end_POSTSUPERSCRIPT ( italic_m ) | , (6)

Several works have discussed the numerous flaws of binning estimators [35, 36, 5, 1], such as sensitivity to the binning scheme, asymptotic inconsistency in many cases, and curse of dimensionality. Moreover, these estimators are not differentiable and require additional approximations to be integrated as part of the optimization process.

As a remedy, Zhang et al. [42], Popordanoska et al. [30] recently proposed kernel density-based [33] estimators of calibration error, which are consistent, differentiable, and have more favorable statistical and computational properties. Therefore, we extend the estimator of [30] to the detection task. That is, we wish to find an estimator:

CE(gk)^=1wv=1w|𝔼[ZS^=sv]^sv|,^CEsubscript𝑔𝑘1𝑤superscriptsubscript𝑣1𝑤^𝔼delimited-[]conditional𝑍^𝑆subscript𝑠𝑣subscript𝑠𝑣\displaystyle\widehat{\operatorname{CE}(g_{k})}=\frac{1}{w}\sum_{v=1}^{w}\left% |\widehat{\mathbb{E}[Z\mid\hat{S}=s_{v}]}-s_{v}\right|,over^ start_ARG roman_CE ( italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG = divide start_ARG 1 end_ARG start_ARG italic_w end_ARG ∑ start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT | over^ start_ARG blackboard_E [ italic_Z ∣ over^ start_ARG italic_S end_ARG = italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ] end_ARG - italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT | , (7)

where w𝑤witalic_w denotes the number of detections across images.

We will focus on deriving an estimator of the conditional expectation, depending on the definition of calibration resulting from using two distinct link functions ψ𝜓\psiitalic_ψ in Equation (3). In case ψ𝜓\psiitalic_ψ is a threshold function, Z is a discrete random variable, and the estimator obtained from this link function is identical to the one proposed in [30]. However, in case ψ𝜓\psiitalic_ψ is a continuous function such as the identity, Z is a continuous random variable, for which we derive the estimator below.

Let Z𝑍Zitalic_Z and S^^𝑆\hat{S}over^ start_ARG italic_S end_ARG be continuous random variables with joint density fZ,S^(z,s^)subscript𝑓𝑍^𝑆𝑧^𝑠f_{Z,\hat{S}}(z,\hat{s})italic_f start_POSTSUBSCRIPT italic_Z , over^ start_ARG italic_S end_ARG end_POSTSUBSCRIPT ( italic_z , over^ start_ARG italic_s end_ARG ). We wish to find an estimator for the conditional expectation:

𝔼[ZS^=s]𝔼delimited-[]conditional𝑍^𝑆𝑠\displaystyle\mathbb{E}[Z\mid\hat{S}=s]blackboard_E [ italic_Z ∣ over^ start_ARG italic_S end_ARG = italic_s ] =01zpZ|S^(zs)𝑑zabsentsuperscriptsubscript01𝑧subscript𝑝conditional𝑍^𝑆conditional𝑧𝑠differential-d𝑧\displaystyle=\int_{0}^{1}z\,p_{Z|\hat{S}}(z\mid s)\,dz= ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_z italic_p start_POSTSUBSCRIPT italic_Z | over^ start_ARG italic_S end_ARG end_POSTSUBSCRIPT ( italic_z ∣ italic_s ) italic_d italic_z (8)
=1pS^(s)01zpZ,S^(z,s)𝑑z.absent1subscript𝑝^𝑆𝑠superscriptsubscript01𝑧subscript𝑝𝑍^𝑆𝑧𝑠differential-d𝑧\displaystyle=\frac{1}{p_{\hat{S}}(s)}\int_{0}^{1}{z\,p_{Z,\hat{S}}(z,s)\,dz}.= divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT over^ start_ARG italic_S end_ARG end_POSTSUBSCRIPT ( italic_s ) end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_z italic_p start_POSTSUBSCRIPT italic_Z , over^ start_ARG italic_S end_ARG end_POSTSUBSCRIPT ( italic_z , italic_s ) italic_d italic_z . (9)

We derive an estimator for pZ,S^(z,s)subscript𝑝𝑍^𝑆𝑧𝑠p_{Z,\hat{S}}(z,s)italic_p start_POSTSUBSCRIPT italic_Z , over^ start_ARG italic_S end_ARG end_POSTSUBSCRIPT ( italic_z , italic_s ) using KDE:

pZ,S^(z,s)subscript𝑝𝑍^𝑆𝑧𝑠\displaystyle p_{Z,\hat{S}}(z,s)italic_p start_POSTSUBSCRIPT italic_Z , over^ start_ARG italic_S end_ARG end_POSTSUBSCRIPT ( italic_z , italic_s ) 1wu=1wks(S^,su)kz(Z,zu)k(Z,S^,zu,su).absent1𝑤superscriptsubscript𝑢1𝑤subscriptsubscript𝑘𝑠^𝑆subscript𝑠𝑢subscript𝑘𝑧𝑍subscript𝑧𝑢𝑘𝑍^𝑆subscript𝑧𝑢subscript𝑠𝑢\displaystyle\approx\frac{1}{w}\sum_{u=1}^{w}\underbrace{k_{s}(\hat{S},s_{u})k% _{z}(Z,z_{u})}_{k(Z,\hat{S},z_{u},s_{u})}.≈ divide start_ARG 1 end_ARG start_ARG italic_w end_ARG ∑ start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT under⏟ start_ARG italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( over^ start_ARG italic_S end_ARG , italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) italic_k start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_Z , italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT italic_k ( italic_Z , over^ start_ARG italic_S end_ARG , italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT . (10)

where kssubscript𝑘𝑠k_{s}italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and kzsubscript𝑘𝑧k_{z}italic_k start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT denote any consistent kernels over their respective domains. The resulting density estimate remains consistent [37, Theorem 6]. Therefore, for the integral in Equation (9), we have:

01zpZ,S^(z,s)𝑑zsuperscriptsubscript01𝑧subscript𝑝𝑍^𝑆𝑧𝑠differential-d𝑧\displaystyle\int_{0}^{1}{z\,p_{Z,\hat{S}}(z,s)\,dz}∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_z italic_p start_POSTSUBSCRIPT italic_Z , over^ start_ARG italic_S end_ARG end_POSTSUBSCRIPT ( italic_z , italic_s ) italic_d italic_z 01z1wu=1wks(S^,su)kz(Z,zu)dzabsentsuperscriptsubscript01𝑧1𝑤superscriptsubscript𝑢1𝑤subscript𝑘𝑠^𝑆subscript𝑠𝑢subscript𝑘𝑧𝑍subscript𝑧𝑢𝑑𝑧\displaystyle\approx\int_{0}^{1}z\,\frac{1}{w}\sum_{u=1}^{w}k_{s}(\hat{S},s_{u% })k_{z}(Z,z_{u})\,dz≈ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_z divide start_ARG 1 end_ARG start_ARG italic_w end_ARG ∑ start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( over^ start_ARG italic_S end_ARG , italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) italic_k start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_Z , italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) italic_d italic_z (11)
=1wu=1wks(S^,su)01zkz(Z,zu)𝑑z.absent1𝑤superscriptsubscript𝑢1𝑤subscript𝑘𝑠^𝑆subscript𝑠𝑢superscriptsubscript01𝑧subscript𝑘𝑧𝑍subscript𝑧𝑢differential-d𝑧\displaystyle=\frac{1}{w}\sum_{u=1}^{w}k_{s}(\hat{S},s_{u})\int_{0}^{1}z\,k_{z% }(Z,z_{u})dz.= divide start_ARG 1 end_ARG start_ARG italic_w end_ARG ∑ start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( over^ start_ARG italic_S end_ARG , italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_z italic_k start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_Z , italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) italic_d italic_z . (12)

The final integral 01zkz(Z,zu)𝑑zsuperscriptsubscript01𝑧subscript𝑘𝑧𝑍subscript𝑧𝑢differential-d𝑧\int_{0}^{1}z\,k_{z}(Z,z_{u})dz∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_z italic_k start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_Z , italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) italic_d italic_z converges to zusubscript𝑧𝑢z_{u}italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT as the bandwidth of kzsubscript𝑘𝑧k_{z}italic_k start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT approaches zero, the density estimate approaches the true density, and the kernel kzsubscript𝑘𝑧k_{z}italic_k start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT approaches a Dirac delta function centered at zusubscript𝑧𝑢z_{u}italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. Finally, the resulting estimator can be written as:

𝔼[ZS^]^=u=1wk(S^,su)zuu=1wk(S^,su).^𝔼delimited-[]conditional𝑍^𝑆superscriptsubscript𝑢1𝑤𝑘^𝑆subscript𝑠𝑢subscript𝑧𝑢superscriptsubscript𝑢1𝑤𝑘^𝑆subscript𝑠𝑢\displaystyle\widehat{\mathbb{E}[Z\mid\hat{S}]}=\frac{\sum_{u=1}^{w}k(\hat{S},% s_{u})\,z_{u}}{\sum_{u=1}^{w}k(\hat{S},s_{u})}.over^ start_ARG blackboard_E [ italic_Z ∣ over^ start_ARG italic_S end_ARG ] end_ARG = divide start_ARG ∑ start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT italic_k ( over^ start_ARG italic_S end_ARG , italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT italic_k ( over^ start_ARG italic_S end_ARG , italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) end_ARG . (13)

Plugging this back into Equation (7), and generalizing zusubscript𝑧𝑢z_{u}italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT with a link function, we obtain the following estimator of L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT calibration error for a binary object detector gksubscript𝑔𝑘g_{k}italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, a given link function ψ𝜓\psiitalic_ψ and a similarity measure L𝐿Litalic_L:

CE(gk)^=1wv=1w|uvk(sv,su)ψ(L(bu,bu*))uvk(sv,su)sv|.^CEsubscript𝑔𝑘1𝑤superscriptsubscript𝑣1𝑤subscript𝑢𝑣𝑘subscript𝑠𝑣subscript𝑠𝑢𝜓𝐿subscript𝑏𝑢superscriptsubscript𝑏𝑢subscript𝑢𝑣𝑘subscript𝑠𝑣subscript𝑠𝑢subscript𝑠𝑣\displaystyle\widehat{\operatorname{CE}(g_{k})}=\frac{1}{w}\sum_{v=1}^{w}\left% |\frac{\sum_{u\neq v}k(s_{v},s_{u})\,\psi(L(b_{u},b_{u}^{*}))}{\sum_{u\neq v}k% (s_{v},s_{u})}-s_{v}\right|.over^ start_ARG roman_CE ( italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG = divide start_ARG 1 end_ARG start_ARG italic_w end_ARG ∑ start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT | divide start_ARG ∑ start_POSTSUBSCRIPT italic_u ≠ italic_v end_POSTSUBSCRIPT italic_k ( italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) italic_ψ ( italic_L ( italic_b start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_u ≠ italic_v end_POSTSUBSCRIPT italic_k ( italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) end_ARG - italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT | . (14)

The estimator has values [0,1]absent01\in[0,1]∈ [ 0 , 1 ], it is consistent, asymptotically unbiased [30], and differentiable almost everywhere, as desired.

3.3 Train-time calibration

A common approach to designing calibration methods is to include a calibration-related auxiliary loss to the task-specific loss to be minimized during training [24, 25, 29]. In this section, we will show that existing auxiliary losses are not consistent estimators of calibration error. Instead, we propose to use our estimator derived in the previous section.

Munir et al. [24] proposed an auxiliary loss TCD=12(dcls+ddet)TCD12subscript𝑑𝑐𝑙𝑠subscript𝑑𝑑𝑒𝑡\operatorname{TCD}=\frac{1}{2}(d_{cls}+d_{det})roman_TCD = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_d start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT ). Adapting the notation from [24, Equations (8) and (9)], we have that dcls=𝔼[|S^Y|]subscript𝑑𝑐𝑙𝑠𝔼delimited-[]^𝑆𝑌d_{cls}=\mathbb{E}[|\hat{S}-Y|]italic_d start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT = blackboard_E [ | over^ start_ARG italic_S end_ARG - italic_Y | ] and ddet=𝔼[|IoUS^|]subscript𝑑𝑑𝑒𝑡𝔼delimited-[]IoU^𝑆d_{det}=\mathbb{E}[|\operatorname{IoU}-\hat{S}|]italic_d start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT = blackboard_E [ | roman_IoU - over^ start_ARG italic_S end_ARG | ]. We will derive a relationship between the classification term dclssubscript𝑑𝑐𝑙𝑠d_{cls}italic_d start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT and the classification calibration error, as well as between the localization term ddetsubscript𝑑𝑑𝑒𝑡d_{det}italic_d start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT and the detection calibration error. The calibration error for binary classification is given by CEcls:=𝔼[|𝔼[Y=1|S^]S^|]\operatorname{CE}_{cls}:=\mathbb{E}[|\mathbb{E}[Y=1|\hat{S}]-\hat{S}|]roman_CE start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT := blackboard_E [ | blackboard_E [ italic_Y = 1 | over^ start_ARG italic_S end_ARG ] - over^ start_ARG italic_S end_ARG | ], whereas CEdet:=𝔼[|𝔼[IoU|S^]S^|]\operatorname{CE}_{det}:=\mathbb{E}[|\mathbb{E}[IoU|\hat{S}]-\hat{S}|]roman_CE start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT := blackboard_E [ | blackboard_E [ italic_I italic_o italic_U | over^ start_ARG italic_S end_ARG ] - over^ start_ARG italic_S end_ARG | ] defines the calibration error for detection when ψ𝜓\psiitalic_ψ is an identity function.

Proposition 3.1.

The classification calibration error upper bounds dclssubscript𝑑𝑐𝑙𝑠d_{cls}italic_d start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT: CEclsdclssubscriptnormal-CE𝑐𝑙𝑠subscript𝑑𝑐𝑙𝑠\operatorname{CE}_{cls}\geq d_{cls}roman_CE start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT ≥ italic_d start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT.

Proposition 3.2.

The detection calibration error is upper bounded by ddetsubscript𝑑𝑑𝑒𝑡d_{det}italic_d start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT: CEdetddetsubscriptnormal-CE𝑑𝑒𝑡subscript𝑑𝑑𝑒𝑡\operatorname{CE}_{det}\leq d_{det}roman_CE start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT ≤ italic_d start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT.

The proofs are given in Appendix. Thus, through Jensen’s inequality, we showed that neither of these terms is a consistent estimator for any established concept of calibration error. We note that the auxiliary loss components defined in [29, Equations (8) and (9)] have the same functional form as dclssubscript𝑑𝑐𝑙𝑠d_{cls}italic_d start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT and ddetsubscript𝑑𝑑𝑒𝑡d_{det}italic_d start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT. BPC [25, Equation (9)] is based on a heuristic to maximize the confidence scores for accurate predictions and minimize scores for inaccurate ones. Therefore, none of the existing auxiliary losses are based on a consistent estimator of calibration error. Consequently, a principled way to minimize calibration error during training is to integrate our KDE-based estimator CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION, as the only consistent and differentiable estimator of calibration error for object detection, with the task specific-loss detsubscript𝑑𝑒𝑡\mathcal{L}_{det}caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT as:

=det+λCE^,subscript𝑑𝑒𝑡𝜆^CE\displaystyle\mathcal{L}=\mathcal{L}_{det}+\lambda\operatorname{\widehat{% \operatorname{CE}}},caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT + italic_λ start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION , (15)

where λ𝜆\lambdaitalic_λ is a regularization parameter set by cross-validation.

4 Experiments

In this section, we investigate the performance of our estimator, both as a metric to evaluate calibration error, and as an auxiliary loss function used for train-time calibration.

4.1 Setup

Datasets

We use the following three datasets:

  1. 1.

    MS COCO [18] is a widely used object detection dataset that consists of more than 330000 object instances belonging to 80 object categories. It contains 118K training images and 5K validation images.

  2. 2.

    Cityscapes [4] is an urban driving scene dataset, consisting of 2975 training and 500 validation images split into 8 object categories.

  3. 3.

    PASCAL VOC [7] has 20 classes, which are a subset of those in MS COCO. Following other works, for training we combine the VOC 2007 and VOC 2012 trainval sets, resulting in a total of 16551 images and more than 40000 objects.

We create separate validation sets with three different seeds, by spliting the original train set into new train and validation sets with 90:10 ratio. As the labels for the test set of COCO and Cityscapes are not publicly available, the original val set is used for reporting the results, and for VOC we evaluate on the VOC 2007 test set.

Object detectors

We use three popular object detectors with different architectures, which have achieved strong performance in this task:

  1. 1.

    Faster R-CNN (F-RCNN) [32] is a popular two-stage detector with softmax classifier. The two-stage approach involves first generating a sparse set of object proposals, then refining them to accurately locate and classify the objects in the image. Typically they perform better compared to one-stage detectors, but they are slower and require more computational resources.

  2. 2.

    RetinaNet [19] is a one-stage detector with sigmoid classifiers. Using this approach, objects are detected with a single pass through the network. RetinaNet applies a set of predefined anchor boxes to each feature level and predicts the objectness score and class probabilities for each anchor. In spite of their higher computation efficiency, one-stage detectors may have lower accuracy and struggle with small objects.

  3. 3.

    FCOS [34] is another common one-stage detector with sigmoid classifiers. Compared to RetinaNet, FCOS predicts the object bounding boxes and their class probabilities without requiring an anchor-based mechanism. Both FCOS and RetinaNet use focal loss, which aims to address the class imbalance problem, and it is also known to help with calibration.

Baselines

Every object detector we consider has a combination of classification and regression loss, which we refer to as task-specific loss, and represents the first baseline we compare to. Due to its strong performance, we include TS as a representative of the post-hoc calibration methods. As a train-time baseline method, we compare with the recently proposed auxiliary loss function TCD [24], which is added to the task-specific loss.

Metrics

For reporting detection performance we use the COCO-defined metrics. Namely, average precision AP denotes an average across categories and 10 (.50:.05:.95) IoUIoU\operatorname{IoU}roman_IoU overlap thresholds used for matching. In some experiments, we also report [email protected], [email protected], and AP for small, medium and large objects, as defined in the COCO challenge [18]. For measuring calibration error, we use our KDE-based estimator CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION (Equation (14)) with different link functions throughout the experiments. We use the same breakdown across object sizes and IoUIoU\operatorname{IoU}roman_IoU overlaps as the one used for AP. In addition, we report the DECEDECE\operatorname{D-ECE}roman_D - roman_ECE metric (Equation (5)) used in [24, 25, 29], both at IoUIoU\operatorname{IoU}roman_IoU@0.5 and averaged over 10 thresholds (.50:.05:.95). In our synthetic experiment, we also compare with LaECELaECE\operatorname{LaECE}roman_LaECE [27].

Implementation

Traditionally, detections are obtained in two steps: generating predictions and post-processing. The latter step performs further processing to filter out redundant or low-confidence detections, through procedures like Non-Maximum Suppression (NMS) and top-k𝑘kitalic_k selection, where typically k=100𝑘100k=100italic_k = 100 for COCO. Choosing the optimal value for k𝑘kitalic_k, or a threshold γ𝛾\gammaitalic_γ below which detections are rejected is an open area of research [26, 27]. For simplicity, in our experiments we set the score threshold γ=0.5𝛾0.5\gamma=0.5italic_γ = 0.5 for our main results, and show an ablation study of the effect of this parameter on the reported metrics. Our implementation for evaluating CE on a test set is based on the official API for assessing detection performance on COCO, i.e.  we utilize the same information about matching predicted and ground truth boxes and IoUIoU\operatorname{IoU}roman_IoU overlaps as for evaluating AP. We use a Beta kernel [3] and set the bandwidth parameter with a leave-one-out maximum likelihood procedure. For DECEDECE\operatorname{D-ECE}roman_D - roman_ECE we adopt the implementation from Küppers et al. [15]. We rely on Detectron2 [38] for the implementation and training configuration of the detectors. The code and trained models are available at https://github.com/tpopordanoska/calibration-object-detection.

4.2 Results

Here we empirically showcase the utilities of our proposed estimator and compare it with related work. The main purpose of an estimator of calibration error is to assess the extent to which a notion of calibration is violated. Therefore, we start our experiments by comparing the different variants of our KDE-based estimator (obtained with different choices for the link function) with the binning-based estimators DECEDECE\operatorname{D-ECE}roman_D - roman_ECE and LaECELaECE\operatorname{LaECE}roman_LaECE. Subsequently, we provide a comprehensive evaluation of CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION and the standard AP metric, at different IoUIoU\operatorname{IoU}roman_IoU thresholds and object sizes. Finally, we demonstrate the differentiability of our estimator by incorporating it as an auxiliary loss function during training, both from scratch and for fine-tuning. We note that the estimator could also be integrated as part of a post-hoc method.

Comparison of metrics

DECEDECE\operatorname{D-ECE}roman_D - roman_ECE measures the notion of calibration obtained by letting ψ𝜓\psiitalic_ψ be a threshold function, whereas LaECELaECE\operatorname{LaECE}roman_LaECE is used for assessing calibration as obtained through an identity function. We compare these binning-based estimators to the corresponding KDE-based estimators on a synthetic binary problem, where the ground truth calibration error is known. Similar to [30], we apply temperature scaling with t1=0.6subscript𝑡10.6t_{1}=0.6italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.6 to a uniform sample of numbers in [0,1]01[0,1][ 0 , 1 ]. The purpose of this step is to ensure that the confidence scores are concentrated around 00 and 1111, as in a realistic scenario. Then, we sample the labels, denoting 𝟙[IoU(B^,B*)β]\mathbbm{1}[\operatorname{IoU}(\hat{B},B*)\geq\beta]blackboard_1 [ roman_IoU ( over^ start_ARG italic_B end_ARG , italic_B * ) ≥ italic_β ], according to that distribution, and therefore obtain perfect calibration. In order to simulate miscalibration, we perform another temperature scaling with t2=0.6subscript𝑡20.6t_{2}=0.6italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.6. DECEDECE\operatorname{D-ECE}roman_D - roman_ECE is calculated using 20 equally-spaced bins, as in [15]. For LaECELaECE\operatorname{LaECE}roman_LaECE we use 25 bins, following [27], and compute the weighted difference between t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT scaled scores in each bin. As shown in Figure 2, DECEDECE\operatorname{D-ECE}roman_D - roman_ECE and CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION have similar performance and both yield good CE estimates with a few thousand points. The comparison with LaECELaECE\operatorname{LaECE}roman_LaECE can be found in Appendix.

02000400060008000100000.0600.0600.0600.0600.0800.0800.0800.0800.1000.1000.1000.1000.1200.1200.1200.1200.1400.1400.1400.140Number of samplesCalibration error
Figure 2: Comparison of DECEDECE\operatorname{D-ECE}roman_D - roman_ECE vs. CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION (threshold link) as a function of the number of points used for the estimation.

Evaluating a model zoo

We conduct a rigorous analysis to determine the calibration of F-RCNN, RetinaNet and FCOS on popular benchmark datasets like COCO, Cityscapes and Pascal VOC. Analogous to the evaluation on the COCO challenge, we estimate calibration error across different sizes of objects, and for different thresholds for the IoUIoU\operatorname{IoU}roman_IoU overlap required to determine the matching between predicted and ground truth box. For this evaluation, we use a threshold link function. The results are summarized in Tables 1 and 2. The first thing to note is that modern object detectors are intrinsically miscalibrated. When comparing the different architectures, an observation can be made that RetinaNet and FCOS have lower calibration errors than F-RCNN. This may be attributed to the use of focal loss for the classification branch of RetinaNet and FCOS, which has been empirically shown to improve calibration. Moreover, different metrics reveal complementary information about the calibration of the detector. For instance, CE5050{}_{50}start_FLOATSUBSCRIPT 50 end_FLOATSUBSCRIPT on Pascal VOC shows similar performance of F-RCNN and FCOS, whereas CE7575{}_{75}start_FLOATSUBSCRIPT 75 end_FLOATSUBSCRIPT reveals that RetinaNet and FCOS have much better performance than F-RCNN for a calibration error that requires an IoUIoU\operatorname{IoU}roman_IoU overlap of 0.75 for a correct detection. Similarly, both RetinaNet and FCOS demonstrate similar calibration errors when detecting small objects. However, FCOS exhibits superior calibration scores for medium-sized objects, despite both detectors achieving the same APM𝑀{}_{M}start_FLOATSUBSCRIPT italic_M end_FLOATSUBSCRIPT.

Table 1: Detection performance of object detectors on three popular datasets.
Dataset Model AP AP5050{}_{50}start_FLOATSUBSCRIPT 50 end_FLOATSUBSCRIPT AP7575{}_{75}start_FLOATSUBSCRIPT 75 end_FLOATSUBSCRIPT APS𝑆{}_{S}start_FLOATSUBSCRIPT italic_S end_FLOATSUBSCRIPT APM𝑀{}_{M}start_FLOATSUBSCRIPT italic_M end_FLOATSUBSCRIPT APL𝐿{}_{L}start_FLOATSUBSCRIPT italic_L end_FLOATSUBSCRIPT
COCO F-RCNN 36.11±0.10subscript36.11plus-or-minus0.1036.11_{\pm 0.10}36.11 start_POSTSUBSCRIPT ± 0.10 end_POSTSUBSCRIPT 53.35±0.15subscript53.35plus-or-minus0.1553.35_{\pm 0.15}53.35 start_POSTSUBSCRIPT ± 0.15 end_POSTSUBSCRIPT 40.04±0.12subscript40.04plus-or-minus0.1240.04_{\pm 0.12}40.04 start_POSTSUBSCRIPT ± 0.12 end_POSTSUBSCRIPT 18.88±0.10subscript18.88plus-or-minus0.1018.88_{\pm 0.10}18.88 start_POSTSUBSCRIPT ± 0.10 end_POSTSUBSCRIPT 39.40±0.11subscript39.40plus-or-minus0.1139.40_{\pm 0.11}39.40 start_POSTSUBSCRIPT ± 0.11 end_POSTSUBSCRIPT 47.98±0.08subscript47.98plus-or-minus0.0847.98_{\pm 0.08}47.98 start_POSTSUBSCRIPT ± 0.08 end_POSTSUBSCRIPT
RetinaNet 30.83±0.12subscript30.83plus-or-minus0.1230.83_{\pm 0.12}30.83 start_POSTSUBSCRIPT ± 0.12 end_POSTSUBSCRIPT 43.33±0.22subscript43.33plus-or-minus0.2243.33_{\pm 0.22}43.33 start_POSTSUBSCRIPT ± 0.22 end_POSTSUBSCRIPT 34.03±0.07subscript34.03plus-or-minus0.0734.03_{\pm 0.07}34.03 start_POSTSUBSCRIPT ± 0.07 end_POSTSUBSCRIPT 13.59±0.28subscript13.59plus-or-minus0.2813.59_{\pm 0.28}13.59 start_POSTSUBSCRIPT ± 0.28 end_POSTSUBSCRIPT 34.13±0.13subscript34.13plus-or-minus0.1334.13_{\pm 0.13}34.13 start_POSTSUBSCRIPT ± 0.13 end_POSTSUBSCRIPT 43.16±0.24subscript43.16plus-or-minus0.2443.16_{\pm 0.24}43.16 start_POSTSUBSCRIPT ± 0.24 end_POSTSUBSCRIPT
FCOS 34.02±0.06subscript34.02plus-or-minus0.0634.02_{\pm 0.06}34.02 start_POSTSUBSCRIPT ± 0.06 end_POSTSUBSCRIPT 48.84±0.04subscript48.84plus-or-minus0.0448.84_{\pm 0.04}48.84 start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT 37.37±0.10subscript37.37plus-or-minus0.1037.37_{\pm 0.10}37.37 start_POSTSUBSCRIPT ± 0.10 end_POSTSUBSCRIPT 17.29±0.17subscript17.29plus-or-minus0.1717.29_{\pm 0.17}17.29 start_POSTSUBSCRIPT ± 0.17 end_POSTSUBSCRIPT 37.84±0.17subscript37.84plus-or-minus0.1737.84_{\pm 0.17}37.84 start_POSTSUBSCRIPT ± 0.17 end_POSTSUBSCRIPT 45.15±0.17subscript45.15plus-or-minus0.1745.15_{\pm 0.17}45.15 start_POSTSUBSCRIPT ± 0.17 end_POSTSUBSCRIPT
Cityscapes F-RCNN 35.36±0.65subscript35.36plus-or-minus0.6535.36_{\pm 0.65}35.36 start_POSTSUBSCRIPT ± 0.65 end_POSTSUBSCRIPT 54.46±0.82subscript54.46plus-or-minus0.8254.46_{\pm 0.82}54.46 start_POSTSUBSCRIPT ± 0.82 end_POSTSUBSCRIPT 37.90±0.79subscript37.90plus-or-minus0.7937.90_{\pm 0.79}37.90 start_POSTSUBSCRIPT ± 0.79 end_POSTSUBSCRIPT 11.54±0.31subscript11.54plus-or-minus0.3111.54_{\pm 0.31}11.54 start_POSTSUBSCRIPT ± 0.31 end_POSTSUBSCRIPT 35.34±0.51subscript35.34plus-or-minus0.5135.34_{\pm 0.51}35.34 start_POSTSUBSCRIPT ± 0.51 end_POSTSUBSCRIPT 56.62±0.48subscript56.62plus-or-minus0.4856.62_{\pm 0.48}56.62 start_POSTSUBSCRIPT ± 0.48 end_POSTSUBSCRIPT
RetinaNet 34.60±0.23subscript34.60plus-or-minus0.2334.60_{\pm 0.23}34.60 start_POSTSUBSCRIPT ± 0.23 end_POSTSUBSCRIPT 52.60±0.81subscript52.60plus-or-minus0.8152.60_{\pm 0.81}52.60 start_POSTSUBSCRIPT ± 0.81 end_POSTSUBSCRIPT 36.41±0.35subscript36.41plus-or-minus0.3536.41_{\pm 0.35}36.41 start_POSTSUBSCRIPT ± 0.35 end_POSTSUBSCRIPT 10.39±0.15subscript10.39plus-or-minus0.1510.39_{\pm 0.15}10.39 start_POSTSUBSCRIPT ± 0.15 end_POSTSUBSCRIPT 35.56±0.15subscript35.56plus-or-minus0.1535.56_{\pm 0.15}35.56 start_POSTSUBSCRIPT ± 0.15 end_POSTSUBSCRIPT 57.47±0.39subscript57.47plus-or-minus0.3957.47_{\pm 0.39}57.47 start_POSTSUBSCRIPT ± 0.39 end_POSTSUBSCRIPT
FCOS 34.81±0.08subscript34.81plus-or-minus0.0834.81_{\pm 0.08}34.81 start_POSTSUBSCRIPT ± 0.08 end_POSTSUBSCRIPT 52.31±0.29subscript52.31plus-or-minus0.2952.31_{\pm 0.29}52.31 start_POSTSUBSCRIPT ± 0.29 end_POSTSUBSCRIPT 36.62±0.23subscript36.62plus-or-minus0.2336.62_{\pm 0.23}36.62 start_POSTSUBSCRIPT ± 0.23 end_POSTSUBSCRIPT 10.66±0.60subscript10.66plus-or-minus0.6010.66_{\pm 0.60}10.66 start_POSTSUBSCRIPT ± 0.60 end_POSTSUBSCRIPT 32.84±0.44subscript32.84plus-or-minus0.4432.84_{\pm 0.44}32.84 start_POSTSUBSCRIPT ± 0.44 end_POSTSUBSCRIPT 56.90±0.81subscript56.90plus-or-minus0.8156.90_{\pm 0.81}56.90 start_POSTSUBSCRIPT ± 0.81 end_POSTSUBSCRIPT
Pascal VOC F-RCNN 52.96±0.05subscript52.96plus-or-minus0.0552.96_{\pm 0.05}52.96 start_POSTSUBSCRIPT ± 0.05 end_POSTSUBSCRIPT 75.88±0.08subscript75.88plus-or-minus0.0875.88_{\pm 0.08}75.88 start_POSTSUBSCRIPT ± 0.08 end_POSTSUBSCRIPT 59.67±0.05subscript59.67plus-or-minus0.0559.67_{\pm 0.05}59.67 start_POSTSUBSCRIPT ± 0.05 end_POSTSUBSCRIPT 18.48±0.39subscript18.48plus-or-minus0.3918.48_{\pm 0.39}18.48 start_POSTSUBSCRIPT ± 0.39 end_POSTSUBSCRIPT 40.90±0.11subscript40.90plus-or-minus0.1140.90_{\pm 0.11}40.90 start_POSTSUBSCRIPT ± 0.11 end_POSTSUBSCRIPT 61.43±0.08subscript61.43plus-or-minus0.0861.43_{\pm 0.08}61.43 start_POSTSUBSCRIPT ± 0.08 end_POSTSUBSCRIPT
RetinaNet 53.20±0.02subscript53.20plus-or-minus0.0253.20_{\pm 0.02}53.20 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 73.36±0.02subscript73.36plus-or-minus0.0273.36_{\pm 0.02}73.36 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 58.75±0.11subscript58.75plus-or-minus0.1158.75_{\pm 0.11}58.75 start_POSTSUBSCRIPT ± 0.11 end_POSTSUBSCRIPT 16.39±0.42subscript16.39plus-or-minus0.4216.39_{\pm 0.42}16.39 start_POSTSUBSCRIPT ± 0.42 end_POSTSUBSCRIPT 40.19±0.26subscript40.19plus-or-minus0.2640.19_{\pm 0.26}40.19 start_POSTSUBSCRIPT ± 0.26 end_POSTSUBSCRIPT 62.06±0.02subscript62.06plus-or-minus0.0262.06_{\pm 0.02}62.06 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT
FCOS 52.13±0.02subscript52.13plus-or-minus0.0252.13_{\pm 0.02}52.13 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 73.50±0.07subscript73.50plus-or-minus0.0773.50_{\pm 0.07}73.50 start_POSTSUBSCRIPT ± 0.07 end_POSTSUBSCRIPT 57.63±0.04subscript57.63plus-or-minus0.0457.63_{\pm 0.04}57.63 start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT 19.23±0.23subscript19.23plus-or-minus0.2319.23_{\pm 0.23}19.23 start_POSTSUBSCRIPT ± 0.23 end_POSTSUBSCRIPT 40.52±0.12subscript40.52plus-or-minus0.1240.52_{\pm 0.12}40.52 start_POSTSUBSCRIPT ± 0.12 end_POSTSUBSCRIPT 60.28±0.06subscript60.28plus-or-minus0.0660.28_{\pm 0.06}60.28 start_POSTSUBSCRIPT ± 0.06 end_POSTSUBSCRIPT
Table 2: Calibration performance (CE ×\times× 100) of object detectors on three popular datasets. The calibration error is broken down into variants depending on the IoUIoU\operatorname{IoU}roman_IoU threshold and the size of the objects, same as for AP.
Dataset Model CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION CE^50subscript^CE50\operatorname{\widehat{\operatorname{CE}}}_{50}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT CE^75subscript^CE75\operatorname{\widehat{\operatorname{CE}}}_{75}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT CE^Ssubscript^CE𝑆\operatorname{\widehat{\operatorname{CE}}}_{S}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT CE^Msubscript^CE𝑀\operatorname{\widehat{\operatorname{CE}}}_{M}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT CE^Lsubscript^CE𝐿\operatorname{\widehat{\operatorname{CE}}}_{L}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT
COCO F-RCNN 37.33±0.09subscript37.33plus-or-minus0.0937.33_{\pm 0.09}37.33 start_POSTSUBSCRIPT ± 0.09 end_POSTSUBSCRIPT 20.48±0.10subscript20.48plus-or-minus0.1020.48_{\pm 0.10}20.48 start_POSTSUBSCRIPT ± 0.10 end_POSTSUBSCRIPT 32.56±0.15subscript32.56plus-or-minus0.1532.56_{\pm 0.15}32.56 start_POSTSUBSCRIPT ± 0.15 end_POSTSUBSCRIPT 35.72±0.51subscript35.72plus-or-minus0.5135.72_{\pm 0.51}35.72 start_POSTSUBSCRIPT ± 0.51 end_POSTSUBSCRIPT 38.94±0.10subscript38.94plus-or-minus0.1038.94_{\pm 0.10}38.94 start_POSTSUBSCRIPT ± 0.10 end_POSTSUBSCRIPT 40.81±0.12subscript40.81plus-or-minus0.1240.81_{\pm 0.12}40.81 start_POSTSUBSCRIPT ± 0.12 end_POSTSUBSCRIPT
RetinaNet 21.89±0.36subscript21.89plus-or-minus0.3621.89_{\pm 0.36}21.89 start_POSTSUBSCRIPT ± 0.36 end_POSTSUBSCRIPT 12.03±0.19subscript12.03plus-or-minus0.1912.03_{\pm 0.19}12.03 start_POSTSUBSCRIPT ± 0.19 end_POSTSUBSCRIPT 14.70±0.41subscript14.70plus-or-minus0.4114.70_{\pm 0.41}14.70 start_POSTSUBSCRIPT ± 0.41 end_POSTSUBSCRIPT 26.73±0.52subscript26.73plus-or-minus0.5226.73_{\pm 0.52}26.73 start_POSTSUBSCRIPT ± 0.52 end_POSTSUBSCRIPT 26.04±0.13subscript26.04plus-or-minus0.1326.04_{\pm 0.13}26.04 start_POSTSUBSCRIPT ± 0.13 end_POSTSUBSCRIPT 27.58±0.17subscript27.58plus-or-minus0.1727.58_{\pm 0.17}27.58 start_POSTSUBSCRIPT ± 0.17 end_POSTSUBSCRIPT
FCOS 24.40±0.13subscript24.40plus-or-minus0.1324.40_{\pm 0.13}24.40 start_POSTSUBSCRIPT ± 0.13 end_POSTSUBSCRIPT 16.55±0.21subscript16.55plus-or-minus0.2116.55_{\pm 0.21}16.55 start_POSTSUBSCRIPT ± 0.21 end_POSTSUBSCRIPT 19.78±0.17subscript19.78plus-or-minus0.1719.78_{\pm 0.17}19.78 start_POSTSUBSCRIPT ± 0.17 end_POSTSUBSCRIPT 26.34±0.39subscript26.34plus-or-minus0.3926.34_{\pm 0.39}26.34 start_POSTSUBSCRIPT ± 0.39 end_POSTSUBSCRIPT 26.33±0.24subscript26.33plus-or-minus0.2426.33_{\pm 0.24}26.33 start_POSTSUBSCRIPT ± 0.24 end_POSTSUBSCRIPT 29.68±0.03subscript29.68plus-or-minus0.0329.68_{\pm 0.03}29.68 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT
Cityscapes F-RCNN 37.45±0.46subscript37.45plus-or-minus0.4637.45_{\pm 0.46}37.45 start_POSTSUBSCRIPT ± 0.46 end_POSTSUBSCRIPT 17.00±0.46subscript17.00plus-or-minus0.4617.00_{\pm 0.46}17.00 start_POSTSUBSCRIPT ± 0.46 end_POSTSUBSCRIPT 32.55±0.48subscript32.55plus-or-minus0.4832.55_{\pm 0.48}32.55 start_POSTSUBSCRIPT ± 0.48 end_POSTSUBSCRIPT 40.85±5.08subscript40.85plus-or-minus5.0840.85_{\pm 5.08}40.85 start_POSTSUBSCRIPT ± 5.08 end_POSTSUBSCRIPT 41.15±1.57subscript41.15plus-or-minus1.5741.15_{\pm 1.57}41.15 start_POSTSUBSCRIPT ± 1.57 end_POSTSUBSCRIPT 38.22±0.78subscript38.22plus-or-minus0.7838.22_{\pm 0.78}38.22 start_POSTSUBSCRIPT ± 0.78 end_POSTSUBSCRIPT
RetinaNet 32.22±0.29subscript32.22plus-or-minus0.2932.22_{\pm 0.29}32.22 start_POSTSUBSCRIPT ± 0.29 end_POSTSUBSCRIPT 12.91±0.95subscript12.91plus-or-minus0.9512.91_{\pm 0.95}12.91 start_POSTSUBSCRIPT ± 0.95 end_POSTSUBSCRIPT 28.38±0.74subscript28.38plus-or-minus0.7428.38_{\pm 0.74}28.38 start_POSTSUBSCRIPT ± 0.74 end_POSTSUBSCRIPT 29.99±0.57subscript29.99plus-or-minus0.5729.99_{\pm 0.57}29.99 start_POSTSUBSCRIPT ± 0.57 end_POSTSUBSCRIPT 38.16±1.27subscript38.16plus-or-minus1.2738.16_{\pm 1.27}38.16 start_POSTSUBSCRIPT ± 1.27 end_POSTSUBSCRIPT 35.91±0.46subscript35.91plus-or-minus0.4635.91_{\pm 0.46}35.91 start_POSTSUBSCRIPT ± 0.46 end_POSTSUBSCRIPT
FCOS 26.29±0.98subscript26.29plus-or-minus0.9826.29_{\pm 0.98}26.29 start_POSTSUBSCRIPT ± 0.98 end_POSTSUBSCRIPT 13.91±1.28subscript13.91plus-or-minus1.2813.91_{\pm 1.28}13.91 start_POSTSUBSCRIPT ± 1.28 end_POSTSUBSCRIPT 21.09±1.57subscript21.09plus-or-minus1.5721.09_{\pm 1.57}21.09 start_POSTSUBSCRIPT ± 1.57 end_POSTSUBSCRIPT 28.44±1.66subscript28.44plus-or-minus1.6628.44_{\pm 1.66}28.44 start_POSTSUBSCRIPT ± 1.66 end_POSTSUBSCRIPT 32.71±0.78subscript32.71plus-or-minus0.7832.71_{\pm 0.78}32.71 start_POSTSUBSCRIPT ± 0.78 end_POSTSUBSCRIPT 28.13±0.93subscript28.13plus-or-minus0.9328.13_{\pm 0.93}28.13 start_POSTSUBSCRIPT ± 0.93 end_POSTSUBSCRIPT
Pascal VOC F-RCNN 34.88±0.10subscript34.88plus-or-minus0.1034.88_{\pm 0.10}34.88 start_POSTSUBSCRIPT ± 0.10 end_POSTSUBSCRIPT 17.03±0.16subscript17.03plus-or-minus0.1617.03_{\pm 0.16}17.03 start_POSTSUBSCRIPT ± 0.16 end_POSTSUBSCRIPT 28.75±0.17subscript28.75plus-or-minus0.1728.75_{\pm 0.17}28.75 start_POSTSUBSCRIPT ± 0.17 end_POSTSUBSCRIPT 47.67±0.91subscript47.67plus-or-minus0.9147.67_{\pm 0.91}47.67 start_POSTSUBSCRIPT ± 0.91 end_POSTSUBSCRIPT 41.99±0.33subscript41.99plus-or-minus0.3341.99_{\pm 0.33}41.99 start_POSTSUBSCRIPT ± 0.33 end_POSTSUBSCRIPT 30.84±0.07subscript30.84plus-or-minus0.0730.84_{\pm 0.07}30.84 start_POSTSUBSCRIPT ± 0.07 end_POSTSUBSCRIPT
RetinaNet 24.11±0.09subscript24.11plus-or-minus0.0924.11_{\pm 0.09}24.11 start_POSTSUBSCRIPT ± 0.09 end_POSTSUBSCRIPT 7.95±0.19subscript7.95plus-or-minus0.197.95_{\pm 0.19}7.95 start_POSTSUBSCRIPT ± 0.19 end_POSTSUBSCRIPT 17.89±0.14subscript17.89plus-or-minus0.1417.89_{\pm 0.14}17.89 start_POSTSUBSCRIPT ± 0.14 end_POSTSUBSCRIPT 36.31±1.22subscript36.31plus-or-minus1.2236.31_{\pm 1.22}36.31 start_POSTSUBSCRIPT ± 1.22 end_POSTSUBSCRIPT 31.53±0.06subscript31.53plus-or-minus0.0631.53_{\pm 0.06}31.53 start_POSTSUBSCRIPT ± 0.06 end_POSTSUBSCRIPT 21.73±0.11subscript21.73plus-or-minus0.1121.73_{\pm 0.11}21.73 start_POSTSUBSCRIPT ± 0.11 end_POSTSUBSCRIPT
FCOS 22.32±0.05subscript22.32plus-or-minus0.0522.32_{\pm 0.05}22.32 start_POSTSUBSCRIPT ± 0.05 end_POSTSUBSCRIPT 15.77±0.10subscript15.77plus-or-minus0.1015.77_{\pm 0.10}15.77 start_POSTSUBSCRIPT ± 0.10 end_POSTSUBSCRIPT 15.55±0.11subscript15.55plus-or-minus0.1115.55_{\pm 0.11}15.55 start_POSTSUBSCRIPT ± 0.11 end_POSTSUBSCRIPT 36.84±0.65subscript36.84plus-or-minus0.6536.84_{\pm 0.65}36.84 start_POSTSUBSCRIPT ± 0.65 end_POSTSUBSCRIPT 27.12±0.04subscript27.12plus-or-minus0.0427.12_{\pm 0.04}27.12 start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT 20.97±0.02subscript20.97plus-or-minus0.0220.97_{\pm 0.02}20.97 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT

Calibration-regularized training

Another notion of calibration that might be of interest is when the link ψ𝜓\psiitalic_ψ is an identity function. Intuitively, this type of calibration requires that the predicted score corresponds to the IoUIoU\operatorname{IoU}roman_IoU overlap with a ground truth box. Using this setup for CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION, we performed extensive experiments to compare the performance of our estimator with a post-hoc method (TS), and with a train-time loss (TCD). The temperature is chosen on a validation set by minimizing NLL. Table 3 reports the AP, CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION, and DECEDECE\operatorname{D-ECE}roman_D - roman_ECE with IoUIoU\operatorname{IoU}roman_IoU overlap of 0.5, for RetinaNet and FCOS, trained on Cityscapes.

Table 3: Comparison of detection (AP) and calibration performance (CE ×\times× 100) of models trained with RetinaNet and FCOS on Cityscapes. Models with no calibration, with post-hoc (TS) and train-time (TCD) methods are compared with our approach.
Model AP \uparrow CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION \downarrow DECE50subscriptDECE50\operatorname{D-ECE}_{50}start_OPFUNCTION roman_D - roman_ECE end_OPFUNCTION start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT \downarrow
RetinaNet 34.60±0.23subscript34.60plus-or-minus0.2334.60_{\pm 0.23}34.60 start_POSTSUBSCRIPT ± 0.23 end_POSTSUBSCRIPT 23.25±0.39subscript23.25plus-or-minus0.3923.25_{\pm 0.39}23.25 start_POSTSUBSCRIPT ± 0.39 end_POSTSUBSCRIPT 12.54±0.66subscript12.54plus-or-minus0.6612.54_{\pm 0.66}12.54 start_POSTSUBSCRIPT ± 0.66 end_POSTSUBSCRIPT
RetinaNet + TS 34.60±0.23subscript34.60plus-or-minus0.2334.60_{\pm 0.23}34.60 start_POSTSUBSCRIPT ± 0.23 end_POSTSUBSCRIPT 18.67±2.07subscript18.67plus-or-minus2.0718.67_{\pm 2.07}18.67 start_POSTSUBSCRIPT ± 2.07 end_POSTSUBSCRIPT 11.05±1.25subscript11.05plus-or-minus1.2511.05_{\pm 1.25}11.05 start_POSTSUBSCRIPT ± 1.25 end_POSTSUBSCRIPT
RetinaNet + TCD 33.94±0.81subscript33.94plus-or-minus0.8133.94_{\pm 0.81}33.94 start_POSTSUBSCRIPT ± 0.81 end_POSTSUBSCRIPT 25.01±1.13subscript25.01plus-or-minus1.1325.01_{\pm 1.13}25.01 start_POSTSUBSCRIPT ± 1.13 end_POSTSUBSCRIPT 13.04±0.47subscript13.04plus-or-minus0.4713.04_{\pm 0.47}13.04 start_POSTSUBSCRIPT ± 0.47 end_POSTSUBSCRIPT
RetinaNet + CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION 32.59±0.71subscript32.59plus-or-minus0.7132.59_{\pm 0.71}32.59 start_POSTSUBSCRIPT ± 0.71 end_POSTSUBSCRIPT 17.33±3.15subscript17.33plus-or-minus3.1517.33_{\pm 3.15}17.33 start_POSTSUBSCRIPT ± 3.15 end_POSTSUBSCRIPT 11.07±0.22subscript11.07plus-or-minus0.2211.07_{\pm 0.22}11.07 start_POSTSUBSCRIPT ± 0.22 end_POSTSUBSCRIPT
FCOS 34.81±0.08subscript34.81plus-or-minus0.0834.81_{\pm 0.08}34.81 start_POSTSUBSCRIPT ± 0.08 end_POSTSUBSCRIPT 14.43±1.31subscript14.43plus-or-minus1.3114.43_{\pm 1.31}14.43 start_POSTSUBSCRIPT ± 1.31 end_POSTSUBSCRIPT 13.23±1.18subscript13.23plus-or-minus1.1813.23_{\pm 1.18}13.23 start_POSTSUBSCRIPT ± 1.18 end_POSTSUBSCRIPT
FCOS + TS 33.62±0.44subscript33.62plus-or-minus0.4433.62_{\pm 0.44}33.62 start_POSTSUBSCRIPT ± 0.44 end_POSTSUBSCRIPT 13.33±2.16subscript13.33plus-or-minus2.1613.33_{\pm 2.16}13.33 start_POSTSUBSCRIPT ± 2.16 end_POSTSUBSCRIPT 11.96±0.90subscript11.96plus-or-minus0.9011.96_{\pm 0.90}11.96 start_POSTSUBSCRIPT ± 0.90 end_POSTSUBSCRIPT
FCOS + TCD 35.57±0.23subscript35.57plus-or-minus0.2335.57_{\pm 0.23}35.57 start_POSTSUBSCRIPT ± 0.23 end_POSTSUBSCRIPT 16.85±1.27subscript16.85plus-or-minus1.2716.85_{\pm 1.27}16.85 start_POSTSUBSCRIPT ± 1.27 end_POSTSUBSCRIPT 13.68±0.86subscript13.68plus-or-minus0.8613.68_{\pm 0.86}13.68 start_POSTSUBSCRIPT ± 0.86 end_POSTSUBSCRIPT
FCOS + CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION 33.73±0.57subscript33.73plus-or-minus0.5733.73_{\pm 0.57}33.73 start_POSTSUBSCRIPT ± 0.57 end_POSTSUBSCRIPT 13.07±0.83subscript13.07plus-or-minus0.8313.07_{\pm 0.83}13.07 start_POSTSUBSCRIPT ± 0.83 end_POSTSUBSCRIPT 17.45±0.61subscript17.45plus-or-minus0.6117.45_{\pm 0.61}17.45 start_POSTSUBSCRIPT ± 0.61 end_POSTSUBSCRIPT

Fine-tuning

Since there already exist a variety of trained object detectors on COCO, we demonstrate that the benefits of adding our estimator as an auxiliary loss can be observed also for fine-tuning pre-trained models even for a few epochs. Table 4 shows the reduction in CE, achieved by fine-tuning for three epochs, by minimizing the loss defined in Equation (15) on COCO dataset.

Table 4: Performance (AP and CE ×\times× 100) of models without calibration and fine-tuned models using our auxiliary loss on COCO.
Model AP \uparrow CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION \downarrow
F-RCNN 36.11±0.10subscript36.11plus-or-minus0.1036.11_{\pm 0.10}36.11 start_POSTSUBSCRIPT ± 0.10 end_POSTSUBSCRIPT 31.76±0.05subscript31.76plus-or-minus0.0531.76_{\pm 0.05}31.76 start_POSTSUBSCRIPT ± 0.05 end_POSTSUBSCRIPT
F-RCNN + CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION 34.72±0.09subscript34.72plus-or-minus0.0934.72_{\pm 0.09}34.72 start_POSTSUBSCRIPT ± 0.09 end_POSTSUBSCRIPT 26.91±0.14subscript26.91plus-or-minus0.1426.91_{\pm 0.14}26.91 start_POSTSUBSCRIPT ± 0.14 end_POSTSUBSCRIPT
RetinaNet 30.83±0.12subscript30.83plus-or-minus0.1230.83_{\pm 0.12}30.83 start_POSTSUBSCRIPT ± 0.12 end_POSTSUBSCRIPT 21.89±0.36subscript21.89plus-or-minus0.3621.89_{\pm 0.36}21.89 start_POSTSUBSCRIPT ± 0.36 end_POSTSUBSCRIPT
RetinaNet + CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION 30.10±0.06subscript30.10plus-or-minus0.0630.10_{\pm 0.06}30.10 start_POSTSUBSCRIPT ± 0.06 end_POSTSUBSCRIPT 9.72±0.18subscript9.72plus-or-minus0.189.72_{\pm 0.18}9.72 start_POSTSUBSCRIPT ± 0.18 end_POSTSUBSCRIPT
FCOS 34.02±0.06subscript34.02plus-or-minus0.0634.02_{\pm 0.06}34.02 start_POSTSUBSCRIPT ± 0.06 end_POSTSUBSCRIPT 24.40±0.13subscript24.40plus-or-minus0.1324.40_{\pm 0.13}24.40 start_POSTSUBSCRIPT ± 0.13 end_POSTSUBSCRIPT
FCOS + CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION 33.41±0.12subscript33.41plus-or-minus0.1233.41_{\pm 0.12}33.41 start_POSTSUBSCRIPT ± 0.12 end_POSTSUBSCRIPT 15.32±0.17subscript15.32plus-or-minus0.1715.32_{\pm 0.17}15.32 start_POSTSUBSCRIPT ± 0.17 end_POSTSUBSCRIPT

Ablation study

We investigate the effect of the regularization parameter λ𝜆\lambdaitalic_λ, the test score threshold γ𝛾\gammaitalic_γ, and the number of epochs used for fine-tuning on the reported metrics. In Figure 3 we present fine-tuned F-RCNN detectors on Cityscapes for different values of the λ𝜆\lambdaitalic_λ parameter. We notice that increasing the weight of calibration regularization leads to noticeable reduction in calibration error. Table 5 shows the evaluated performance on test set using different thresholds γ𝛾\gammaitalic_γ. From Table 6 we can observe that fine-tuning even for a few epochs can lead to improvements in calibration error, without sacrificing AP. Further experiments for the effect of λ𝜆\lambdaitalic_λ, both for fine-tuned models and trained from scratch on the three datasets, can be found in the Appendix.

3333333333.533.533.533.53434343434.534.534.534.53535353535.535.535.535.5363636362020202022222222242424242626262628282828012345APCE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION
Figure 3: Effect of λ𝜆\lambdaitalic_λ on APAP\operatorname{AP}roman_AP and CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION. The points represent fine-tuned Faster-RCNN detectors on Cityscapes for three epochs. The number next to the point denotes the value of λ𝜆\lambdaitalic_λ.
Table 5: Effect of test score threshold γ𝛾\gammaitalic_γ on detection performance and calibration of Faster-RCNN trained on Cityscapes.
Model AP CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION
F-RCNN (γ𝛾\gammaitalic_γ = 0.1) 38.34±0.18subscript38.34plus-or-minus0.1838.34_{\pm 0.18}38.34 start_POSTSUBSCRIPT ± 0.18 end_POSTSUBSCRIPT 22.61±1.11subscript22.61plus-or-minus1.1122.61_{\pm 1.11}22.61 start_POSTSUBSCRIPT ± 1.11 end_POSTSUBSCRIPT
F-RCNN (γ𝛾\gammaitalic_γ = 0.2) 37.53±0.26subscript37.53plus-or-minus0.2637.53_{\pm 0.26}37.53 start_POSTSUBSCRIPT ± 0.26 end_POSTSUBSCRIPT 26.26±0.78subscript26.26plus-or-minus0.7826.26_{\pm 0.78}26.26 start_POSTSUBSCRIPT ± 0.78 end_POSTSUBSCRIPT
F-RCNN (γ𝛾\gammaitalic_γ = 0.3) 36.92±0.31subscript36.92plus-or-minus0.3136.92_{\pm 0.31}36.92 start_POSTSUBSCRIPT ± 0.31 end_POSTSUBSCRIPT 27.86±0.80subscript27.86plus-or-minus0.8027.86_{\pm 0.80}27.86 start_POSTSUBSCRIPT ± 0.80 end_POSTSUBSCRIPT
F-RCNN (γ𝛾\gammaitalic_γ = 0.4) 36.15±0.43subscript36.15plus-or-minus0.4336.15_{\pm 0.43}36.15 start_POSTSUBSCRIPT ± 0.43 end_POSTSUBSCRIPT 28.36±0.83subscript28.36plus-or-minus0.8328.36_{\pm 0.83}28.36 start_POSTSUBSCRIPT ± 0.83 end_POSTSUBSCRIPT
F-RCNN (γ𝛾\gammaitalic_γ = 0.5) 35.36±0.65subscript35.36plus-or-minus0.6535.36_{\pm 0.65}35.36 start_POSTSUBSCRIPT ± 0.65 end_POSTSUBSCRIPT 28.48±0.68subscript28.48plus-or-minus0.6828.48_{\pm 0.68}28.48 start_POSTSUBSCRIPT ± 0.68 end_POSTSUBSCRIPT
F-RCNN (γ𝛾\gammaitalic_γ = 0.6) 34.31±0.75subscript34.31plus-or-minus0.7534.31_{\pm 0.75}34.31 start_POSTSUBSCRIPT ± 0.75 end_POSTSUBSCRIPT 28.19±0.90subscript28.19plus-or-minus0.9028.19_{\pm 0.90}28.19 start_POSTSUBSCRIPT ± 0.90 end_POSTSUBSCRIPT
F-RCNN (γ𝛾\gammaitalic_γ = 0.7) 33.26±0.96subscript33.26plus-or-minus0.9633.26_{\pm 0.96}33.26 start_POSTSUBSCRIPT ± 0.96 end_POSTSUBSCRIPT 27.48±0.99subscript27.48plus-or-minus0.9927.48_{\pm 0.99}27.48 start_POSTSUBSCRIPT ± 0.99 end_POSTSUBSCRIPT
Table 6: Effect of number of epochs for fine-tuning a Faster-RCNN network on Cityscapes, with λ=1𝜆1\lambda=1italic_λ = 1 and γ=0.5𝛾0.5\gamma=0.5italic_γ = 0.5 .
Model AP CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION
F-RCNN 35.36±0.65subscript35.36plus-or-minus0.6535.36_{\pm 0.65}35.36 start_POSTSUBSCRIPT ± 0.65 end_POSTSUBSCRIPT 28.48±0.68subscript28.48plus-or-minus0.6828.48_{\pm 0.68}28.48 start_POSTSUBSCRIPT ± 0.68 end_POSTSUBSCRIPT
F-RCNN + CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION (1x) 34.39±0.19subscript34.39plus-or-minus0.1934.39_{\pm 0.19}34.39 start_POSTSUBSCRIPT ± 0.19 end_POSTSUBSCRIPT 25.88±1.40subscript25.88plus-or-minus1.4025.88_{\pm 1.40}25.88 start_POSTSUBSCRIPT ± 1.40 end_POSTSUBSCRIPT
F-RCNN + CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION (2x) 34.45±0.22subscript34.45plus-or-minus0.2234.45_{\pm 0.22}34.45 start_POSTSUBSCRIPT ± 0.22 end_POSTSUBSCRIPT 25.48±0.60subscript25.48plus-or-minus0.6025.48_{\pm 0.60}25.48 start_POSTSUBSCRIPT ± 0.60 end_POSTSUBSCRIPT
F-RCNN + CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION (3x) 35.06±0.51subscript35.06plus-or-minus0.5135.06_{\pm 0.51}35.06 start_POSTSUBSCRIPT ± 0.51 end_POSTSUBSCRIPT 25.76±0.85subscript25.76plus-or-minus0.8525.76_{\pm 0.85}25.76 start_POSTSUBSCRIPT ± 0.85 end_POSTSUBSCRIPT
F-RCNN + CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION (4x) 35.67±0.41subscript35.67plus-or-minus0.4135.67_{\pm 0.41}35.67 start_POSTSUBSCRIPT ± 0.41 end_POSTSUBSCRIPT 26.72±0.56subscript26.72plus-or-minus0.5626.72_{\pm 0.56}26.72 start_POSTSUBSCRIPT ± 0.56 end_POSTSUBSCRIPT
F-RCNN + CE^^CE\operatorname{\widehat{\operatorname{CE}}}start_OPFUNCTION over^ start_ARG roman_CE end_ARG end_OPFUNCTION (5x) 35.80±0.40subscript35.80plus-or-minus0.4035.80_{\pm 0.40}35.80 start_POSTSUBSCRIPT ± 0.40 end_POSTSUBSCRIPT 26.19±1.46subscript26.19plus-or-minus1.4626.19_{\pm 1.46}26.19 start_POSTSUBSCRIPT ± 1.46 end_POSTSUBSCRIPT

5 Discussion

In this paper, we tackled the challenge of defining calibration and estimating calibration error for object detection. Specifically, calibration error for OD is a quantity that depends on the choice of a measure of “correctness“ of a prediction, which we have decomposed into a similarity measure L𝐿Litalic_L and a link function ψ𝜓\psiitalic_ψ, which together determine the notion of calibration that is relevant for a particular scenario. Beyond the definition, we also proposed a consitent and differentiable estimator of a calibration error for OD, which can be used for common one-stage, two-stage, anchor-based and anchor-free DNN-based detectors.

The empirical results showed that our estimator performs equally well for estimating calibration error as existing binning-based versions, while offering superior statistical properties. Moreover, due to its differentiability, it can be directly included as part of both post-hoc and train-time calibration setups. In particular, when integrated into the task-specific loss during training, our estimator achieves substantial improvements in calibration error, while maintaining similar levels of AP. We have also shown that a reduction in calibration error can be achieved by fine-tuning with our auxiliary loss for a few epochs, thus eliminating the need to train object detectors from scratch.

This work also has some limitations. The main limitation is the intrinsic 𝒪(n2)𝒪superscript𝑛2\mathcal{O}(n^{2})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) [30] complexity of KDE. When using the estimator of calibration error as an auxiliary loss for one-stage object detectors, which generate a considerable number of candidate detections during training, the complexity becomes particularly challenging. Nevertheless, our synthetic experiments have revealed that the calibration error can be accurately estimated using only a few thousand data points. As a result, during training with RetinaNet and FCOS, we opt to randomly sample a subset of detections and evaluate the calibration error based on this reduced set of data points. Our empirical results verify that this is an effective strategy to conduct calibration-regularized training. Another limitation of this paper is that we did not include a transformer-based model, such as DINO [16], in our experiments. However, the presented mathematical principles are model-agnostic and our estimator can be trivially extended to this family of detectors as well. Finally, it is worth noting that in certain scenarios, calibration regularized training results in a slight reduction of the AP metric. Exploring the relationship between CE, AP, and the test score threshold, along with investigating the advantage that calibration brings to downstream tasks that utilize object detectors, is an interesting direction for future work.

To conclude, our study puts forward a novel and principled view on calibration in OD, and proposes a new mathematical framework, which enables both estimation of calibration error, and calibration regularized training for object detection. Considering the critical role of calibration in enhancing overall system robustness, our method is highly relevant across diverse computer vision applications, and especially in risk-sensitive scenarios.

Acknowledgements

This research received funding from the Flemish Government (AI Research Program) and the Research Foundation - Flanders (FWO) through project number G0G2921N. The publication was also supported by funding from the Academy of Finland (Profi6 336449 funding program). The authors wish to acknowledge CSC – IT Center for Science, Finland, for generous computational resources.

References

  • Ashukha et al. [2020] Arsenii Ashukha, Alexander Lyzhov, Dmitry Molchanov, and Dmitry Vetrov. Pitfalls of in-domain uncertainty estimation and ensembling in deep learning. In International Conference on Learning Representations, 2020.
  • Bröcker [2009] Jochen Bröcker. Reliability, sufficiency, and the decomposition of proper scores. Quarterly Journal of the Royal Meteorological Society, 135(643):1512–1519, Jul 2009.
  • Chen [1999] Song Xi Chen. Beta kernel estimators for density functions. Computational Statistics & Data Analysis, 31:131–145, 1999.
  • Cordts et al. [2016] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Ding et al. [2020] Yukun Ding, **glan Liu, **jun Xiong, and Yiyu Shi. Revisiting the evaluation of uncertainty estimation and its application to explore model complexity-uncertainty trade-off. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 22–31, 2020.
  • Dusenberry et al. [2020] Michael W Dusenberry, Dustin Tran, Edward Choi, Jonas Kemp, Jeremy Nixon, Ghassen Jerfel, Katherine Heller, and Andrew M Dai. Analyzing the role of model uncertainty for electronic health records. In Proceedings of the ACM Conference on Health, Inference, and Learning, pages 204–213, 2020.
  • Everingham et al. [2012] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html, 2012.
  • Guo et al. [2017] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, pages 1321–1330. PMLR, 2017.
  • Harakeh and Waslander [2021] Ali Harakeh and Steven L. Waslander. Estimating and evaluating regression predictive uncertainty in deep object detectors. In International Conference on Learning Representations, 2021.
  • Hebbalaguppe et al. [2022] Ramya Hebbalaguppe, Jatin Prakash, Neelabh Madan, and Chetan Arora. A stitch in time saves nine: A train-time regularizing loss for improved neural network calibration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16081–16090, 2022.
  • Ibragimov and Has’ Minskii [2013] Ildar Abdulovich Ibragimov and Rafail Zalmanovich Has’ Minskii. Statistical estimation: asymptotic theory, volume 16. Springer Science & Business Media, 2013.
  • Karandikar et al. [2021] Archit Karandikar, Nicholas Cain, Dustin Tran, Balaji Lakshminarayanan, Jonathon Shlens, Michael Curtis Mozer, and Rebecca Roelofs. Soft calibration objectives for neural networks. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
  • Kumar et al. [2019] Ananya Kumar, Percy S Liang, and Tengyu Ma. Verified uncertainty calibration. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • Kumar et al. [2018] Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks from kernel mean embeddings. In ICML, 2018.
  • Küppers et al. [2020] Fabian Küppers, Jan Kronenberger, Amirhossein Shantia, and Anselm Haselhoff. Multivariate confidence calibration for object detection. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2020.
  • Li et al. [2023] Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3041–3050, 2023.
  • Liang et al. [2020] Gongbo Liang, Yu Zhang, Xiaoqin Wang, and Nathan Jacobs. Improved trainable calibration method for neural networks on medical imaging classification. In British Machine Vision Conference, 2020.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
  • Lin et al. [2017] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42:318–327, 2017.
  • Ma and Blaschko [2021] Xingchen Ma and Matthew B. Blaschko. Meta-cal: Well-controlled post-hoc calibration by ranking. In International Conference on Machine Learning, 2021.
  • Maier-Hein et al. [2022] Lena Maier-Hein, Bjoern Menze, et al. Metrics reloaded: Pitfalls and recommendations for image analysis validation. arXiv. org, 2022.
  • Mukhoti et al. [2020] Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip Torr, and Puneet Dokania. Calibrating deep neural networks using focal loss. Advances in Neural Information Processing Systems, 33:15288–15299, 2020.
  • Müller et al. [2019] Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? Advances in neural information processing systems, 32, 2019.
  • Munir et al. [2022] Muhammad Akhtar Munir, Muhammad Haris Khan, M Sarfraz, and Mohsen Ali. Towards improving calibration in object detection under domain shift. Advances in Neural Information Processing Systems, 35:38706–38718, 2022.
  • Munir et al. [2023] Muhammad Akhtar Munir, Muhammad Haris Khan, Salman Khan, and Fahad Shahbaz Khan. Bridging precision and confidence: A train-time loss for calibrating object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11474–11483, 2023.
  • Oksuz et al. [2018] Kemal Oksuz, Baris Can Cam, Emre Akbas, and Sinan Kalkan. Localization recall precision (lrp): A new performance metric for object detection. In Proceedings of the European conference on computer vision (ECCV), pages 504–519, 2018.
  • Oksuz et al. [2023] Kemal Oksuz, Tom Joy, and Puneet K. Dokania. Towards building self-aware object detectors via reliable uncertainty quantification and calibration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9263–9274, June 2023.
  • Ovadia et al. [2019] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32, 2019.
  • Pathiraja et al. [2023] Bimsara Pathiraja, Malitha Gunawardhana, and Muhammad Haris Khan. Multiclass confidence and localization calibration for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19734–19743, June 2023.
  • Popordanoska et al. [2022] Teodora Popordanoska, Raphael Sayer, and Matthew B. Blaschko. A consistent and differentiable lp canonical calibration error estimator. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  • Redmon and Farhadi [2018] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
  • Ren et al. [2015] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:1137–1149, 2015.
  • Silverman [1986] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall, 1986.
  • Tian et al. [2019] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9626–9635, 2019.
  • Vaicenavicius et al. [2019] Juozas Vaicenavicius, David Widmann, Carl Andersson, Fredrik Lindsten, Jacob Roll, and Thomas Schön. Evaluating model calibration in classification. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3459–3467. PMLR, 2019.
  • Widmann et al. [2019] David Widmann, Fredrik Lindsten, and Dave Zachariah. Calibration tests in multi-class classification: A unifying framework. Advances in Neural Information Processing Systems, 32, 2019.
  • Wied and Weißbach [2010] Dominik Wied and Rafael Weißbach. Consistency of the kernel density estimator - a survey. Statistical Papers, 53(1):1–21, 2010.
  • Wu et al. [2019] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2, 2019.
  • Yurtsever et al. [2020] Ekim Yurtsever, Jacob Lambert, Alexander Carballo, and Kazuya Takeda. A survey of autonomous driving: Common practices and emerging technologies. IEEE access, 8:58443–58469, 2020.
  • Zadrozny and Elkan [2002] B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002.
  • Zadrozny and Elkan [2001] Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. ICML, 1, 05 2001.
  • Zhang et al. [2020] Jize Zhang, Bhavya Kailkhura, and T. Yong-** Han. Mix-n-match: Ensemble and compositional methods for uncertainty calibration in deep learning. In International Conference on Machine Learning, 2020.