EF-Calib: Spatiotemporal Calibration of
Event- and Frame-Based Cameras
Using Continuous-Time Trajectories

Shaoan Wang, Zhanhua Xin, Yaoqing Hu, Dongyue Li, Mingzhu Zhu, and Junzhi Yu This work was supported in part by the National Natural Science Foundation of China under Grant T2121002 and Grant 62233001, and in part by the Bei**g Natural Science Foundation under Grant 2022MQ05. (Corresponding author: Junzhi Yu.)Shaoan Wang, Zhanhua Xin, Yaoqing Hu, Dongyue Li, and Junzhi Yu are with the State Key Laboratory for Turbulence and Complex Systems, Department of Advanced Manufacturing and Robotics, College of Engineering, Peking University, Bei**g 100871, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]).Mingzhu Zhu is with the Department of Mechanical Engineering, Fuzhou University, Fuzhou 350000, China (e-mail: [email protected]).
Abstract

Event camera, a bio-inspired asynchronous triggered camera, offers promising prospects for fusion with frame-based cameras owing to its low latency and high dynamic range. However, calibrating stereo vision systems that incorporate both event and frame-based cameras remains a significant challenge. In this letter, we present EF-Calib, a spatiotemporal calibration framework for event- and frame-based cameras using continuous-time trajectories. A novel calibration pattern applicable to both camera types and the corresponding event recognition algorithm is proposed. Leveraging the asynchronous nature of events, a derivable piece-wise B-spline to represent camera pose continuously is introduced, enabling calibration for intrinsic parameters, extrinsic parameters, and time offset, with analytical Jacobians provided. Various experiments are carried out to evaluate the calibration performance of EF-Calib, including calibration experiments for intrinsic parameters, extrinsic parameters, and time offset. Experimental results show that EF-Calib achieves the most accurate intrinsic parameters compared to current SOTA, the close accuracy of the extrinsic parameters compared to the frame-based results, and accurate time offset estimation. EF-Calib provides a convenient and accurate toolbox for calibrating the system that fuses events and frames. The code of this paper will also be open-sourced at: https://github.com/wsakobe/EF-Calib.

Index Terms:
Event camera, spatiotemporal calibration, continuous-time trajectory, time offset estimation.

I Introduction

Recent years, there has been a growing interest among researchers in a novel bio-inspired camera called the event camera [1]. Abandoning the frame-triggered concept of conventional cameras, each pixel of the event camera can be considered as responding independently and asynchronously to changes in illumination, resulting in an extremely low-latency and high-dynamic-range response pattern. These advantages offer competitive prospects for event cameras in areas such as robotics [2], autonomous driving [3], VR/AR [4], and camera imaging [5].

Refer to caption
Figure 1: Overview diagram of EF-Calib. (a) The novel calibration pattern consists of the concentric circle and crosspoint. (b) The stereo vision system consists of an event camera and a frame-based camera. (c) The calibration process of EF-Calib.

However, due to the imaging principle of event cameras, they can only react to changes in illumination, making it difficult to capture absolute amounts of illumination and RGB values as frame-based cameras do. This limitation weakens the ability of event cameras to perceive and understand the environment. Therefore, many recent studies have attempted to fuse events with images in order to fully utilize the unique advantages of both modalities, as illustrated in Fig. 1. Some novel SLAM systems achieve more robust localization under fast motion by fusing events and frames [2, 8, 7, 6]. In addition, several studies are exploring how to fuse events and frames for object detection in challenging environments [3, 9]. In recent years, some event-centric datasets with multiple sensors, including frame-based cameras, have also been widely proposed [12, 11, 10].

It is important to note that calibrating the intrinsic and extrinsic parameters of each camera is an indispensable step in the context of multi-camera fusion. Classical camera-to-camera calibration schemes typically require time synchronization, followed by the acquisition of each camera’s parameters through the synchronous acquisition of images of the calibration board with different poses [13]. However, due to the asynchronous nature of the event camera, it is difficult to combine multiple events into single “frames” and time-synchronize them with an image. In addition, events are only generated if there is relative motion, so the event camera cannot capture a stationary calibration board. In summary, a new calibration framework must be designed for the system including event- and frame-based cameras.

To address the aforementioned issues, this letter proposes a novel spatiotemporal calibration framework for event- and frame-based cameras. To the best of our knowledge, EF-Calib is the first calibration framework to achieve joint calibration of event- and frame-based cameras without requiring any time synchronization. The main contributions of this paper are as follows:

  1. 1.

    A novel spatiotemporal calibration framework for event- and frame-based cameras is proposed. This framework can jointly obtain the intrinsic and extrinsic parameters, as well as the time offset without requiring any hardware synchronization.

  2. 2.

    Leveraging the asynchronous and low-latency properties of the event camera, the framework introduces a continuous-time trajectory to optimize its motion trajectory, facilitating arbitrary timestamp alignment with the frame-based camera.

  3. 3.

    Extensive experiments are conducted in diverse scenarios to validate the proposed calibration framework. The results indicate that the framework achieves accuracy close to that of frame-based camera calibration methods and consistently calibrates the time offset between the cameras.

The rest of the letter is organized as follows. Sec. II summarizes the related works. Sec. III presents the preliminaries of event-based vision and continuous-time trajectory. Sec. IV introduces the calibration framework. Sec. V evaluates the calibration performance from different aspects. At last, Sec. VI presents the conclusion of this letter.

II Related Works

For geometric vision, camera calibration is particularly crucial as it serves as the initial step in processing the input image signal, with the quality of calibration often dictating the performance of subsequent tasks. Traditional camera calibration has undergone significant evolution. The most prevalent calibration method today involves capturing images of a calibration pattern with a known size, such as a checkerboard, from various viewpoints to identify corresponding feature points [13]. Subsequently, the intrinsic and distortion parameters of each camera, as well as the extrinsic parameters between cameras, are automatically calculated.

Nevertheless, applying this static and discrete calibration method to event cameras, which are triggered by changes in illumination or relative motion, presents challenges. Initially, many open-source event camera calibration toolkits utilized a synchronized blinking LED calibration board with a known size or a blinking checkerboard pattern generated by an LED screen [14, 15, 16]. This allowed the event camera to identify features similar to a conventional camera and utilize traditional calibration methods. However, these toolkits require complex device preparation and are unsuitable for calibrating extrinsic parameters between event- and frame-based cameras.

In recent years, there has been increased focus on designing new calibration frameworks to facilitate event camera calibration using existing calibration boards. Muglikar et al. [17] utilize deep learning-based image reconstruction networks, such as E2VID [18], to record events generated by moving the calibration board and then apply the reconstructed images to classical calibration methods. However, these methods heavily rely on the quality of image reconstruction and face challenges in achieving time synchronization with conventional cameras. Another approach involves directly utilizing events generated during camera motion for camera calibration. Huang et al. [19] proposed a calibration framework based on a circular calibration board and employed B-splines to optimize the movement trajectory, which is the most similar method to the one proposed in this letter. However, they directly use clustered asynchronous events as features for optimization, compromising sub-pixel accuracy and being highly sensitive to noise. Additionally, their focus is solely on the intrinsic calibration of event cameras, without addressing extrinsic calibration between event cameras and frame-based cameras. Salah et al. [20] also utilize circular calibration boards and introduce eRWLS to fit circular features with sub-pixel accuracy. However, they compress events over a period into a fixed timestamp to obtain a reference “frame” making this method challenging to synchronize with a conventional camera. Furthermore, it does not account for the deformation of circular features at different viewing angles, leading to reduced sub-pixel localization accuracy. The calibration framework proposed in this letter continues this concept and provides an improvement to address the problems of these methods.

III Preliminaries

III-A Event-Based Vision

Unlike conventional cameras, each pixel of the event camera is independently triggered and responds to changes in the logarithmic illumination signal L(𝐮k,tk)𝐿subscript𝐮𝑘subscript𝑡𝑘L(\mathbf{u}_{k},t_{k})italic_L ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). An event (𝐮k,tk,pk)subscript𝐮𝑘subscript𝑡𝑘subscript𝑝𝑘(\mathbf{u}_{k},t_{k},p_{k})( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is triggered when the change in logarithmic illumination received by a pixel 𝐮k=(xk,yk)subscript𝐮𝑘subscript𝑥𝑘subscript𝑦𝑘\mathbf{u}_{k}=(x_{k},y_{k})bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) exceeds a threshold value C𝐶Citalic_C, i.e.

ΔL(𝐮k,tk)L(𝐮k,tk)L(𝐮k,tkΔt)=pkCapproaches-limitΔ𝐿subscript𝐮𝑘subscript𝑡𝑘𝐿subscript𝐮𝑘subscript𝑡𝑘𝐿subscript𝐮𝑘subscript𝑡𝑘Δ𝑡subscript𝑝𝑘𝐶\Delta L(\mathbf{u}_{k},t_{k})\doteq L(\mathbf{u}_{k},t_{k})-L(\mathbf{u}_{k},% t_{k}-\Delta t)=p_{k}Croman_Δ italic_L ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ≐ italic_L ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_L ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - roman_Δ italic_t ) = italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_C (1)

where ΔtΔ𝑡\Delta troman_Δ italic_t is the time since the last triggered event by the same pixel, pk{1,+1}subscript𝑝𝑘11p_{k}\in\{-\text{1},+\text{1}\}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ { - 1 , + 1 } is the polarity of the event.

III-B Continuous-Time Trajectory Representation

Continuous-time trajectories are often represented utilizing a weighted combination of the temporal basis functions [21], such as polynomial functions, FFTs, and Bézier curves. In this letter, the uniform B-spline is introduced as a representation of the continuous-time trajectory. B-splines have the advantages of smoothness, local support, and analytic derivatives, which are well-suited for representing the 6-DoF pose of the event camera [23, 22, 24]. Following the formulation of cumulative k𝑘kitalic_kth degree B-spline \mathcal{L}caligraphic_L, the event camera pose 𝐓ew(τ)𝕊𝔼3subscriptsuperscript𝐓𝑤𝑒𝜏𝕊𝔼3\mathbf{T}^{w}_{e}(\tau)\in\mathbb{SE}\text{3}bold_T start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_τ ) ∈ blackboard_S blackboard_E 3 at any time τ[ti,ti+1)𝜏subscript𝑡𝑖subscript𝑡𝑖1\tau\in[t_{i},t_{i+1})italic_τ ∈ [ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ), it can be represented by N𝑁Nitalic_N control points 𝐓i𝕊𝔼3,i[0,1,,N1]formulae-sequencesubscript𝐓𝑖𝕊𝔼3𝑖01𝑁1\mathbf{T}_{i}\in\mathbb{SE}\text{3},i\in[\text{0},\text{1},\ldots,N-\text{1}]bold_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_S blackboard_E 3 , italic_i ∈ [ 0 , 1 , … , italic_N - 1 ]:

:𝐓ew(τ)=𝐓ij=1kExp(𝐁~j(τ)Log(𝐓i+j11𝐓i+j)):superscriptsubscript𝐓𝑒𝑤𝜏subscript𝐓𝑖superscriptsubscriptproduct𝑗1𝑘Expsubscript~𝐁𝑗𝜏Logsuperscriptsubscript𝐓𝑖𝑗11subscript𝐓𝑖𝑗\mathcal{L}:\mathbf{T}_{e}^{w}(\tau)=\mathbf{T}_{i}\cdot\prod_{j=\text{1}}^{k}% \mathrm{Exp}\left(\tilde{\mathbf{B}}_{j}(\tau)\cdot\mathrm{Log}\left(\mathbf{T% }_{i+j-1}^{-1}\mathbf{T}_{i+j}\right)\right)caligraphic_L : bold_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_τ ) = bold_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_Exp ( over~ start_ARG bold_B end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_τ ) ⋅ roman_Log ( bold_T start_POSTSUBSCRIPT italic_i + italic_j - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_T start_POSTSUBSCRIPT italic_i + italic_j end_POSTSUBSCRIPT ) ) (2)

where 𝐁~j(τ)subscript~𝐁𝑗𝜏\tilde{\mathbf{B}}_{j}(\tau)over~ start_ARG bold_B end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_τ ) is the cumulative basis function, which is denoted by

𝐁~j(τ)=𝐌~(k)𝐮subscript~𝐁𝑗𝜏superscript~𝐌𝑘𝐮\tilde{\mathbf{B}}_{j}(\tau)=\tilde{\mathbf{M}}^{(k)}\mathbf{u}over~ start_ARG bold_B end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_τ ) = over~ start_ARG bold_M end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT bold_u (3)
𝐮=[1uuk]T,u=τtiti+1tiformulae-sequence𝐮superscriptmatrix1𝑢superscript𝑢𝑘𝑇𝑢𝜏subscript𝑡𝑖subscript𝑡𝑖1subscript𝑡𝑖\mathbf{u}=\begin{bmatrix}1&u&\cdots&u^{k}\end{bmatrix}^{T},u=\frac{\tau-t_{i}% }{t_{i+1}-t_{i}}bold_u = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL italic_u end_CELL start_CELL ⋯ end_CELL start_CELL italic_u start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_u = divide start_ARG italic_τ - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG (4)

where 𝐌~(k)(k+1)×(k+1)superscript~𝐌𝑘superscript𝑘1𝑘1\tilde{\mathbf{M}}^{(k)}\in\mathbb{R}^{(k+\text{1})\times(k+\text{1})}over~ start_ARG bold_M end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_k + 1 ) × ( italic_k + 1 ) end_POSTSUPERSCRIPT is the cumulative blending matrix of B-splines. Since the control points of the B-splines are uniformly distributed on the time scale, the cumulative blending matrix 𝐌(k)superscript𝐌𝑘\mathbf{M}^{(k)}bold_M start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT is constant. In this letter, considering the continuity and complexity, we use cubic B-splines to represent the camera pose, i.e., k=3𝑘3k=\text{3}italic_k = 3. The corresponding cumulative mixing matrix 𝐌~(3)superscript~𝐌3\tilde{\mathbf{M}}^{(3)}over~ start_ARG bold_M end_ARG start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT [25] is

𝐌~(3)=16[600053-31133-20001]superscript~𝐌316matrix600053-31133-20001\tilde{\mathbf{M}}^{(\text{3})}=\dfrac{\text{1}}{\text{6}}\begin{bmatrix}\text% {6}&\text{0}&\text{0}&\text{0}\\ \text{5}&\text{3}&\text{-3}&\text{1}\\ \text{1}&\text{3}&\text{3}&\text{-2}\\ \text{0}&\text{0}&\text{0}&\text{1}\end{bmatrix}over~ start_ARG bold_M end_ARG start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 6 end_ARG [ start_ARG start_ROW start_CELL 6 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 5 end_CELL start_CELL 3 end_CELL start_CELL -3 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 3 end_CELL start_CELL 3 end_CELL start_CELL -2 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] (5)

Commonly, to simplify the computation, some work decouples the rotation 𝐑ew(τ)𝕊𝕆superscriptsubscript𝐑𝑒𝑤𝜏𝕊𝕆\mathbf{R}_{e}^{w}(\tau)\in\mathbb{SO}bold_R start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_τ ) ∈ blackboard_S blackboard_O3 and translation 𝐩ew(τ)3superscriptsubscript𝐩𝑒𝑤𝜏superscript3\mathbf{p}_{e}^{w}(\tau)\in\mathbb{R}^{\text{3}}bold_p start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_τ ) ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT into two independent cubic B-splines, and the same process is carried out in this paper as well. Hence, the continuous-time trajectory of the camera pose can be finally formulated as

𝐑ew(τ)=𝐑ij=13Exp(𝐁~j(τ)Log(𝐑i+j11𝐑i+j))superscriptsubscript𝐑𝑒𝑤𝜏subscript𝐑𝑖superscriptsubscriptproduct𝑗13Expsubscript~𝐁𝑗𝜏Logsuperscriptsubscript𝐑𝑖𝑗11subscript𝐑𝑖𝑗\mathbf{R}_{e}^{w}(\tau)=\mathbf{R}_{i}\cdot\prod_{j=\text{1}}^{3}\mathrm{Exp}% \left(\tilde{\mathbf{B}}_{j}(\tau)\cdot\mathrm{Log}\left(\mathbf{R}_{i+j-1}^{-% 1}\mathbf{R}_{i+j}\right)\right)bold_R start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_τ ) = bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_Exp ( over~ start_ARG bold_B end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_τ ) ⋅ roman_Log ( bold_R start_POSTSUBSCRIPT italic_i + italic_j - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_R start_POSTSUBSCRIPT italic_i + italic_j end_POSTSUBSCRIPT ) ) (6)
𝐩ew(τ)=𝐩i+j=13𝐁~j(τ)(𝐩i+j𝐩i+j1)superscriptsubscript𝐩𝑒𝑤𝜏subscript𝐩𝑖superscriptsubscript𝑗13subscript~𝐁𝑗𝜏subscript𝐩𝑖𝑗subscript𝐩𝑖𝑗1\mathbf{p}_{e}^{w}(\tau)=\mathbf{p}_{i}+\sum_{j=\text{1}}^{3}\tilde{\mathbf{B}% }_{j}(\tau)\cdot(\mathbf{p}_{i+j}-\mathbf{p}_{i+j-1})bold_p start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_τ ) = bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT over~ start_ARG bold_B end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_τ ) ⋅ ( bold_p start_POSTSUBSCRIPT italic_i + italic_j end_POSTSUBSCRIPT - bold_p start_POSTSUBSCRIPT italic_i + italic_j - 1 end_POSTSUBSCRIPT ) (7)

After decoupling the pose into two cubic B-splines, the corresponding analytic derivatives [26] can also be derived

𝐑ew˙(τ)˙superscriptsubscript𝐑𝑒𝑤𝜏\displaystyle\dot{\mathbf{R}_{e}^{w}}(\tau)over˙ start_ARG bold_R start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_ARG ( italic_τ ) =𝐑ew(τ)(𝝎(3)(τ))absentsuperscriptsubscript𝐑𝑒𝑤𝜏subscriptsuperscript𝝎3𝜏\displaystyle=\mathbf{R}_{e}^{w}(\tau)\cdot\left(\boldsymbol{\omega}^{(\text{3% })}(\tau)\right)_{\wedge}= bold_R start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_τ ) ⋅ ( bold_italic_ω start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( italic_τ ) ) start_POSTSUBSCRIPT ∧ end_POSTSUBSCRIPT (8)
=𝐑i(𝐀˙1𝐀2𝐀3+𝐀1𝐀˙2𝐀3+𝐀1𝐀2𝐀˙3)absentsubscript𝐑𝑖subscript˙𝐀1subscript𝐀2subscript𝐀3subscript𝐀1subscript˙𝐀2subscript𝐀3subscript𝐀1subscript𝐀2subscript˙𝐀3\displaystyle=\mathbf{R}_{i}\left(\dot{\mathbf{A}}_{\text{1}}\mathbf{A}_{\text% {2}}\mathbf{A}_{\text{3}}+\mathbf{A}_{\text{1}}\dot{\mathbf{A}}_{\text{2}}% \mathbf{A}_{\text{3}}+\mathbf{A}_{\text{1}}\mathbf{A}_{\text{2}}\dot{\mathbf{A% }}_{\text{3}}\right)= bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over˙ start_ARG bold_A end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + bold_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over˙ start_ARG bold_A end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + bold_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT over˙ start_ARG bold_A end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT )
𝐯ew(τ)=𝐩˙ew(τ)=𝐩ij=13𝐁j~˙(τ)(𝐩i+j𝐩i+j1)superscriptsubscript𝐯𝑒𝑤𝜏superscriptsubscript˙𝐩𝑒𝑤𝜏subscript𝐩𝑖superscriptsubscript𝑗13˙~subscript𝐁𝑗𝜏subscript𝐩𝑖𝑗subscript𝐩𝑖𝑗1\mathbf{v}_{e}^{w}(\tau)=\dot{\mathbf{p}}_{e}^{w}(\tau)=\mathbf{p}_{i}\cdot% \sum_{j=\text{1}}^{\text{3}}\dot{\tilde{\mathbf{B}_{j}}}(\tau)\cdot(\mathbf{p}% _{i+j}-\mathbf{p}_{i+j-1})bold_v start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_τ ) = over˙ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_τ ) = bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT over˙ start_ARG over~ start_ARG bold_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG end_ARG ( italic_τ ) ⋅ ( bold_p start_POSTSUBSCRIPT italic_i + italic_j end_POSTSUBSCRIPT - bold_p start_POSTSUBSCRIPT italic_i + italic_j - 1 end_POSTSUBSCRIPT ) (9)

where

𝐀j=Exp(𝐁~j(τ)Log(𝐑i+j11𝐑i+j))subscript𝐀𝑗Expsubscript~𝐁𝑗𝜏Logsuperscriptsubscript𝐑𝑖𝑗11subscript𝐑𝑖𝑗\mathbf{A}_{j}=\mathrm{Exp}\left(\tilde{\mathbf{B}}_{j}(\tau)\cdot\mathrm{Log}% \left(\mathbf{R}_{i+j-1}^{-1}\mathbf{R}_{i+j}\right)\right)bold_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = roman_Exp ( over~ start_ARG bold_B end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_τ ) ⋅ roman_Log ( bold_R start_POSTSUBSCRIPT italic_i + italic_j - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_R start_POSTSUBSCRIPT italic_i + italic_j end_POSTSUBSCRIPT ) ) (10)
𝐀˙j=𝐀j𝐁~˙(τ)jLog(𝐑i+j11𝐑i+j)subscript˙𝐀𝑗subscript𝐀𝑗˙~𝐁subscript𝜏𝑗Logsuperscriptsubscript𝐑𝑖𝑗11subscript𝐑𝑖𝑗\dot{\mathbf{A}}_{j}=\mathbf{A}_{j}\dot{\tilde{\mathbf{B}}}(\tau)_{j}\mathrm{% Log}\left(\mathbf{R}_{i+j-1}^{-1}\mathbf{R}_{i+j}\right)over˙ start_ARG bold_A end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over˙ start_ARG over~ start_ARG bold_B end_ARG end_ARG ( italic_τ ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_Log ( bold_R start_POSTSUBSCRIPT italic_i + italic_j - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_R start_POSTSUBSCRIPT italic_i + italic_j end_POSTSUBSCRIPT ) (11)
𝐁j~˙(τ)=1Δt𝐌~(3)[012u3u2]˙~subscript𝐁𝑗𝜏1Δ𝑡superscript~𝐌3matrix012𝑢3superscript𝑢2\dot{\tilde{\mathbf{B}_{j}}}(\tau)=\dfrac{1}{\Delta t}\tilde{\mathbf{M}}^{(% \text{3})}\begin{bmatrix}\text{0}\\[3.00003pt] \text{1}\\[3.00003pt] \text{2}u\\[3.00003pt] \text{3}u^{\text{2}}\end{bmatrix}over˙ start_ARG over~ start_ARG bold_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG end_ARG ( italic_τ ) = divide start_ARG 1 end_ARG start_ARG roman_Δ italic_t end_ARG over~ start_ARG bold_M end_ARG start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW start_ROW start_CELL 2 italic_u end_CELL end_ROW start_ROW start_CELL 3 italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] (12)

Since this letter utilizes the uniform B-spline, ΔtΔ𝑡\Delta troman_Δ italic_t is equal to the time interval between any two consecutive knots, i.e. Δt=ti+1tiΔ𝑡subscript𝑡𝑖1subscript𝑡𝑖\Delta t=t_{i+1}-t_{i}roman_Δ italic_t = italic_t start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Refer to caption
Figure 2: Flowchart of the proposed calibration framework

IV Methodology

IV-A Calibration Framework

For camera calibration, it is of utmost importance to accurately and robustly identify features on the calibration pattern. However, the checkerboard pattern [27], which is widely used, is difficult to apply to event cameras because events disappear during parallel edge motion. Consequently, calibration patterns for event cameras often employ circular features, but these can produce blur in moving frames. To balance the characteristics of event cameras and frame-based cameras, we designed a new calibration pattern that combines isotropic circles and checkerboard crosspoints, as shown in Fig. 1(a). The center of each circle in this pattern coincides with the center of the inner crosspoint. This hybrid pattern significantly enhances the recognition efficiency and accuracy of the event camera while maintaining compatibility with frame-based cameras.

Fig. 2 presents the flowchart of the proposed calibration framework. In this letter, we divide the entire calibration process into two stages. The first stage focuses on feature extraction and refinement of the calibration pattern. The second stage is dedicated to optimizing the camera trajectory using piecewise B-splines to achieve accurate calibration results. These two stages will be elaborated in the following subsections.

IV-B Event-Based Feature Recognizer

Unlike frame-based camera, event camera only output asynchronous events during relative motion, posing a challenge for robust calibration pattern recognition. To address this, we propose an event-based calibration pattern feature recognizer, illustrated in Fig. 3. First, we accumulate events over a short period of time ΔtΔ𝑡\Delta troman_Δ italic_t according to their polarity to obtain “accumulation frames” that resemble traditional images. The introduction of this “accumulation frames” can help us to recognize the feature plate using some classical image processing algorithms. The following subsections describe the recognition algorithm based on “accumulation frames” in detail.

Refer to caption
Figure 3: The pipeline of event-based feature recognizer.

IV-B1 Noise Suppression

Event cameras often generate a large number of events from the static background when they are in motion. These noisy events can adversely affect the recognition of calibration boards, significantly reducing the operation speed and accuracy of the recognizer. Therefore, after obtaining the “accumulation frames”, a noise suppression module is designed to filter out most of the events that are not related to the calibration board.

For circular features, the triggered events typically consist of two semicircular arcs connecting regions of opposite polarity. However, many structures in the background have straight edges, making them more likely to have connected regions that resemble straight lines. To leverage this property, we introduce a fast and accurate connected component labeling (CCL) algorithm called BBDT, proposed by Grana et al. [28]. This algorithm merges neighboring events with the same polarity to obtain all the connectivity regions. Then, the magnitudes of the two principal components of each connected region are calculated using PCA. For background-triggered connected regions, the magnitude of the second principal component PC2normsubscriptPC2\|\textbf{PC}_{\text{2}}\|∥ PC start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ should be much smaller than the magnitude of the first principal component PC1normsubscriptPC1\|\textbf{PC}_{\text{1}}\|∥ PC start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥, so that a large number of noisy regions can be suppressed by the principal component magnitude ratio βPCsubscript𝛽𝑃𝐶\beta_{PC}italic_β start_POSTSUBSCRIPT italic_P italic_C end_POSTSUBSCRIPT, as given that βPC=PC1/PC2<TPCsubscript𝛽𝑃𝐶normsubscriptPC1normsubscriptPC2subscript𝑇𝑃𝐶\beta_{PC}=\|\textbf{PC}_{\text{1}}\|/\|\textbf{PC}_{\text{2}}\|<T_{PC}italic_β start_POSTSUBSCRIPT italic_P italic_C end_POSTSUBSCRIPT = ∥ PC start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ / ∥ PC start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ < italic_T start_POSTSUBSCRIPT italic_P italic_C end_POSTSUBSCRIPT, where TPCsubscript𝑇𝑃𝐶T_{PC}italic_T start_POSTSUBSCRIPT italic_P italic_C end_POSTSUBSCRIPT is a threshold for the βPCsubscript𝛽𝑃𝐶\beta_{PC}italic_β start_POSTSUBSCRIPT italic_P italic_C end_POSTSUBSCRIPT, and any region with βPCsubscript𝛽𝑃𝐶\beta_{PC}italic_β start_POSTSUBSCRIPT italic_P italic_C end_POSTSUBSCRIPT higher than TPCsubscript𝑇𝑃𝐶T_{PC}italic_T start_POSTSUBSCRIPT italic_P italic_C end_POSTSUBSCRIPT is suppressed and not involved in subsequent operations.

IV-B2 Feature Extraction

Following noise suppression, we proceed to extract potential circular features from the remaining regions. Specifically, we identify two candidate regions of opposite polarity based on their distance. Subsequently, we fit the elliptic equation using all pixels contained within these regions, exploiting the fact that circular features adhere to the elliptic model under a projective transform. Then the fitting error efitsubscript𝑒𝑓𝑖𝑡e_{fit}italic_e start_POSTSUBSCRIPT italic_f italic_i italic_t end_POSTSUBSCRIPT is calculated, excluding candidate regions with a fitting error exceeding the fitting threshold Tfitsubscript𝑇𝑓𝑖𝑡T_{fit}italic_T start_POSTSUBSCRIPT italic_f italic_i italic_t end_POSTSUBSCRIPT.

Furthermore, additional geometric constraints are needed to eliminate the remaining false positive regions. Specifically, the two candidate regions constituting the same ellipse should demonstrate similar PCA magnitudes; regions with a notable discrepancy in PCA magnitude fail to meet this criterion. Additionally, the contributions of the two candidate regions to the circumference of the ellipse should be close. In other words, the angular range θrsubscript𝜃𝑟\theta_{r}italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT of the two candidate regions with respect to the center of the ellipse should be close to 180.

Regions that successfully meet these geometric constraints are recognized as accurately representing the elliptical features within the calibration pattern. Consequently, depending on the distribution of these elliptical features, their relative positions on the calibration pattern can be decoded to correspond with the results identified in the frame-based camera.

Refer to caption
Figure 4: Schematic of the moving ellipse model. The set of events belonging to the same elliptical feature can be considered as a three-dimensional oblique elliptical cylinder.

IV-B3 Feature Refinement

Neglecting the timestamps of events during elliptical fitting inevitably introduces errors, thereby affecting the camera calibration precision. However, by regarding the timestamps t𝑡titalic_t as a third dimension alongside the pixel coordinates [x,y]Tsuperscript𝑥𝑦𝑇[x,y]^{T}[ italic_x , italic_y ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, events can be conceptualized as points within a three-dimensional space. Each “accumulation frames”, corresponding to a brief time interval, allows for the assumption of solely translational movement with speed [vx,vy]Tsuperscriptsubscript𝑣𝑥subscript𝑣𝑦𝑇[v_{x},v_{y}]^{T}[ italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, parallel to the pixel plane, for each elliptical feature at any given moment within this interval, as demonstrated in Fig. 4. Consequently, the moving ellipse model \mathcal{F}caligraphic_F is described by the following representation:

:ax(t)2+βx(t)y(t)+γy(t)2+δx(t)+ϵy(t)+ζ=0:𝑎𝑥superscript𝑡2𝛽𝑥𝑡𝑦𝑡𝛾𝑦superscript𝑡2𝛿𝑥𝑡italic-ϵ𝑦𝑡𝜁0\mathcal{F}:ax(t)^{2}+\beta x(t)y(t)+\gamma y(t)^{2}+\delta x(t)+\epsilon y(t)% +\zeta=\text{0}caligraphic_F : italic_a italic_x ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_β italic_x ( italic_t ) italic_y ( italic_t ) + italic_γ italic_y ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_δ italic_x ( italic_t ) + italic_ϵ italic_y ( italic_t ) + italic_ζ = 0 (13)

In matrix form,

{:𝐏(t)T𝐐𝐏(t)=0𝐏(t)=𝐏(t0)𝐕(tt0)𝐐=(αβ/2δ/2β/2γϵ/2δ/2ϵ/2ζ)cases:𝐏superscript𝑡𝑇𝐐𝐏𝑡0otherwise𝐏𝑡𝐏subscript𝑡0𝐕𝑡subscript𝑡0otherwise𝐐matrix𝛼𝛽2𝛿2𝛽2𝛾italic-ϵ2𝛿2italic-ϵ2𝜁otherwise\begin{cases}\mathcal{F}:\mathbf{P}(t)^{T}\mathbf{Q}\mathbf{P}(t)=\text{0}\\ \mathbf{P}(t)=\mathbf{P}(t_{0})-\mathbf{V}\cdot(t-t_{0})\\ \mathbf{Q}=\begin{pmatrix}\alpha&\beta/\text{2}&\delta/\text{2}\\ \beta/\text{2}&\gamma&\epsilon/\text{2}\\ \delta/\text{2}&\epsilon/\text{2}&\zeta\end{pmatrix}\end{cases}{ start_ROW start_CELL caligraphic_F : bold_P ( italic_t ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_QP ( italic_t ) = 0 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL bold_P ( italic_t ) = bold_P ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - bold_V ⋅ ( italic_t - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL bold_Q = ( start_ARG start_ROW start_CELL italic_α end_CELL start_CELL italic_β / 2 end_CELL start_CELL italic_δ / 2 end_CELL end_ROW start_ROW start_CELL italic_β / 2 end_CELL start_CELL italic_γ end_CELL start_CELL italic_ϵ / 2 end_CELL end_ROW start_ROW start_CELL italic_δ / 2 end_CELL start_CELL italic_ϵ / 2 end_CELL start_CELL italic_ζ end_CELL end_ROW end_ARG ) end_CELL start_CELL end_CELL end_ROW (14)

where 𝐏(t)=[x(t),y(t),1]T𝐏𝑡superscript𝑥𝑡𝑦𝑡1𝑇\mathbf{P}(t)=[x(t),y(t),\text{1}]^{T}bold_P ( italic_t ) = [ italic_x ( italic_t ) , italic_y ( italic_t ) , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, 𝐕=[vx,vy]T𝐕superscriptsubscript𝑣𝑥subscript𝑣𝑦𝑇\mathbf{V}=[v_{x},v_{y}]^{T}bold_V = [ italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, t[t0,t0+2δt]𝑡subscript𝑡0subscript𝑡02subscript𝛿𝑡t\in[t_{0},t_{0}+\text{2}\delta_{t}]italic_t ∈ [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 2 italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ], and t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT represents the starting time of current “accumulation frame”. Here the cost function for model optimization is defined as

argmin,𝐕i{e}(𝐏iT𝐐𝐏i1)subscript𝐕subscript𝑖𝑒superscriptnormsuperscriptsubscript𝐏𝑖𝑇subscript𝐐𝐏𝑖1\arg\min_{\mathcal{F},\mathbf{V}}\sum_{i\in\{e\}}\left(\|\mathbf{P}_{i}^{T}% \mathbf{Q}\mathbf{P}_{i}\|^{1}\right)roman_arg roman_min start_POSTSUBSCRIPT caligraphic_F , bold_V end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ { italic_e } end_POSTSUBSCRIPT ( ∥ bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_QP start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) (15)

By substituting the events into the aforementioned model, we utilize the Levenberg-Marquardt algorithm to solve the model, thereby refining the elliptical features.

IV-C Trajectory Optimization

Refer to caption
Figure 5: Schematic of a piece-wise B-spline trajectory. The number of event features inside the red box is insufficient; therefore, this segment of the trajectory is omitted.

The refinement process converts the previously discrete elliptical features into densely populated patterns within the corresponding time period of associated events. This densely populated feature facilitates the optimization of event camera poses within a continuous-time trajectory representation. Nevertheless, maintaining continuous visibility of the entire calibration plate during the calibration process poses a challenge. Incomplete calibration patterns in a continuous event stream often result in unsuccessful recognition. To mitigate this issue, the continuous-time trajectory is divided into multiple segments based on the output from the recognizer. Consequently, a piece-wise B-spline-based optimizer for event camera pose trajectories is proposed.

First, based on the predefined knot interval ΔtΔ𝑡\Delta troman_Δ italic_t and the results of the recognizer, the features whose timestamps differ from the timestamps of other features by more than ΔtΔ𝑡\Delta troman_Δ italic_t are eliminated. In addition, segments containing too few features are also excluded to ensure optimization accuracy, and only the more desirable segments are preserved. Fig. 5 illustrates the segmentation process of the trajectory. The entire calibration process ΛsubscriptΛ\Lambda_{\mathcal{L}}roman_Λ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT is divided into a combination of M𝑀Mitalic_M segments of trajectories \mathcal{L}caligraphic_L:

Λ=m{m}ambm,m=1,,Mformulae-sequencesubscriptΛsubscript𝑚superscriptsubscriptsubscript𝑚subscript𝑎𝑚subscript𝑏𝑚𝑚1𝑀\Lambda_{\mathcal{L}}=\sum_{m}\{\mathcal{L}_{m}\}_{a_{m}}^{b_{m}},m=\text{1},% \ldots,Mroman_Λ start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT { caligraphic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_m = 1 , … , italic_M (16)

where amsubscript𝑎𝑚a_{m}italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and bmsubscript𝑏𝑚b_{m}italic_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are the starting and ending times corresponding to the m𝑚mitalic_mth segment of B-splines, respectively. Each segment of B-splines is optimized by only the features whose timestamps belong to its time period.

Refer to caption
Figure 6: Schematic of sampling on a continuously moving ellipse model.

For msubscript𝑚\mathcal{L}_{m}caligraphic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, the corresponding state vector 𝒳esubscript𝒳𝑒\mathcal{X}_{e}caligraphic_X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is:

𝒳e=[ξm1,ξm2,,ξmNm,Ke,De]subscript𝒳𝑒subscriptsuperscript𝜉1𝑚subscriptsuperscript𝜉2𝑚subscriptsuperscript𝜉subscript𝑁𝑚𝑚subscript𝐾𝑒subscript𝐷𝑒\mathcal{X}_{e}=[\xi^{\text{1}}_{m},\ \xi^{\text{2}}_{m},\ \cdots,\ \xi^{N_{m}% }_{m},\ K_{e},\ D_{e}]caligraphic_X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = [ italic_ξ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , ⋯ , italic_ξ start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ] (17)

where ξmisuperscriptsubscript𝜉𝑚𝑖\xi_{m}^{i}italic_ξ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the control point of msubscript𝑚\mathcal{L}_{m}caligraphic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, Nmsubscript𝑁𝑚N_{m}italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the number of control points, and Kesubscript𝐾𝑒K_{e}italic_K start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and Desubscript𝐷𝑒D_{e}italic_D start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT are the intrinsic and distortion parameters of the event camera. The corresponding visual residual for the i𝑖iitalic_ith feature based on reprojection error is defined as:

𝐫e(𝒳e)=k𝒦πe(𝐑we(ti+kδti)Piw+𝐭we(ti+kδt))subscript𝐫𝑒subscript𝒳𝑒subscript𝑘𝒦subscript𝜋𝑒superscriptsubscript𝐑𝑤𝑒subscript𝑡𝑖𝑘subscript𝛿subscript𝑡𝑖superscriptsubscript𝑃𝑖𝑤superscriptsubscript𝐭𝑤𝑒subscript𝑡𝑖𝑘subscript𝛿𝑡\displaystyle\mathbf{r}_{e}(\mathcal{X}_{e})=\sum_{k\in\mathcal{K}}\pi_{e}(% \mathbf{R}_{w}^{e}(t_{i}+k\cdot\delta_{t_{i}})P_{i}^{w}+\mathbf{t}_{w}^{e}(t_{% i}+k\cdot\delta_{t}))bold_r start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_K end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( bold_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_k ⋅ italic_δ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT + bold_t start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_k ⋅ italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) (18)
([uievie]+kδt𝐕i)matrixsuperscriptsubscript𝑢𝑖𝑒superscriptsubscript𝑣𝑖𝑒𝑘subscript𝛿𝑡subscript𝐕𝑖\displaystyle-(\begin{bmatrix}u_{i}^{e}\\ v_{i}^{e}\end{bmatrix}+k\cdot\delta_{t}\mathbf{V}_{i})- ( [ start_ARG start_ROW start_CELL italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] + italic_k ⋅ italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

where 𝐑we()subscriptsuperscript𝐑𝑒𝑤\mathbf{R}^{e}_{w}(\cdot)bold_R start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( ⋅ ) and 𝐭we()subscriptsuperscript𝐭𝑒𝑤\mathbf{t}^{e}_{w}(\cdot)bold_t start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( ⋅ ) are derived from the B-spline trajectory using equations (6) and (7), respectively. Since the feature refinement yields a continuous moving ellipse model, the residuals can be constructed by sampling the model at any time. Here, 𝒦𝒦\mathcal{K}caligraphic_K denotes the partition of δtisubscript𝛿subscript𝑡𝑖\delta_{t_{i}}italic_δ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, which defines the sampling interval of the feature, as shown in Fig. 6. The function πe()subscript𝜋𝑒\pi_{e}(\cdot)italic_π start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( ⋅ ) projects the spatial point Piwsuperscriptsubscript𝑃𝑖𝑤P_{i}^{w}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT onto the “accumulation frame”.

The intrinsic and distortion parameters of the event camera, along with the control points of the splines, are jointly optimized by minimizing the following cost function:

argmin𝒳e{ρ(𝐫e(𝒳e)2)}subscriptsubscript𝒳𝑒𝜌superscriptnormsubscript𝐫𝑒subscript𝒳𝑒2\arg\min_{\mathcal{X}_{e}}\left\{\sum\rho(\|\mathbf{r}_{e}(\mathcal{X}_{e})\|^% {2})\right\}roman_arg roman_min start_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT { ∑ italic_ρ ( ∥ bold_r start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) } (19)

where ρ()𝜌\rho(\cdot)italic_ρ ( ⋅ ) is the Huber loss function.

IV-D Spatialtemporal Calibration

The final step involves jointly optimizing the event camera and the frame-based camera, utilizing the previously optimized trajectories, to determine the extrinsic parameters and time offset. The corresponding state vector 𝒳fsubscript𝒳𝑓\mathcal{X}_{f}caligraphic_X start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is:

𝒳f=[Kf,Df,𝐓ef,td]subscript𝒳𝑓subscript𝐾𝑓subscript𝐷𝑓subscriptsuperscript𝐓𝑓𝑒subscript𝑡𝑑\mathcal{X}_{f}=[K_{f},\ D_{f},\ \mathbf{T}^{f}_{e},\ t_{d}]caligraphic_X start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = [ italic_K start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , bold_T start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ] (20)

where 𝐓efsubscriptsuperscript𝐓𝑓𝑒\mathbf{T}^{f}_{e}bold_T start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is the transformation matrix between the two cameras and tdsubscript𝑡𝑑t_{d}italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is the difference between the real timestamps of the two cameras, i.e., the time offset.

Similarly, define the visual residuals based on reprojection errors in spatiotemporal calibration as:

𝐫f(𝒳f)=πf(𝐑ef(𝐑we(ti+td)Piw+𝐭we(ti+td))+𝐭ef)[uifvif]subscript𝐫𝑓subscript𝒳𝑓subscript𝜋𝑓subscriptsuperscript𝐑𝑓𝑒subscriptsuperscript𝐑𝑒𝑤subscript𝑡𝑖subscript𝑡𝑑subscriptsuperscript𝑃𝑤𝑖subscriptsuperscript𝐭𝑒𝑤subscript𝑡𝑖subscript𝑡𝑑subscriptsuperscript𝐭𝑓𝑒matrixsubscriptsuperscript𝑢𝑓𝑖subscriptsuperscript𝑣𝑓𝑖\mathbf{r}_{f}(\mathcal{X}_{f})=\pi_{f}\left(\mathbf{R}^{f}_{e}\left(\mathbf{R% }^{e}_{w}(t_{i}+t_{d})P^{w}_{i}+\mathbf{t}^{e}_{w}(t_{i}+t_{d})\right)+\mathbf% {t}^{f}_{e}\right)-\begin{bmatrix}u^{f}_{i}\\[6.0pt] v^{f}_{i}\end{bmatrix}bold_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) = italic_π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( bold_R start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( bold_R start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_t start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ) + bold_t start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) - [ start_ARG start_ROW start_CELL italic_u start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_v start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] (21)

where πe()subscript𝜋𝑒\pi_{e}(\cdot)italic_π start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( ⋅ ) projects the spatial points onto the image plane of the frame-based camera.

From (21), the Jacobian Jtdsubscript𝐽subscript𝑡𝑑J_{t_{d}}italic_J start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT of 𝐫fsubscript𝐫𝑓\mathbf{r}_{f}bold_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT w.r.t tdsubscript𝑡𝑑t_{d}italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT can be obtained by the chain rule:

Jtd=𝐫fPifPiftd=𝐫fPie(Pif𝐑we𝐑wetd+Pif𝐭we𝐭wetd)subscript𝐽subscript𝑡𝑑subscript𝐫𝑓superscriptsubscript𝑃𝑖𝑓superscriptsubscript𝑃𝑖𝑓subscript𝑡𝑑subscript𝐫𝑓superscriptsubscript𝑃𝑖𝑒superscriptsubscript𝑃𝑖𝑓superscriptsubscript𝐑𝑤𝑒superscriptsubscript𝐑𝑤𝑒subscript𝑡𝑑superscriptsubscript𝑃𝑖𝑓superscriptsubscript𝐭𝑤𝑒superscriptsubscript𝐭𝑤𝑒subscript𝑡𝑑J_{t_{d}}=\frac{\partial\mathbf{r}_{f}}{\partial P_{i}^{f}}\frac{\partial P_{i% }^{f}}{\partial t_{d}}=\frac{\partial\mathbf{r}_{f}}{\partial P_{i}^{e}}(\frac% {\partial P_{i}^{f}}{\partial\mathbf{R}_{w}^{e}}\frac{\partial\mathbf{R}_{w}^{% e}}{\partial t_{d}}+\frac{\partial P_{i}^{f}}{\partial\mathbf{t}_{w}^{e}}\frac% {\partial\mathbf{t}_{w}^{e}}{\partial t_{d}})italic_J start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG ∂ bold_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG = divide start_ARG ∂ bold_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_ARG ( divide start_ARG ∂ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ bold_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG + divide start_ARG ∂ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_t start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ bold_t start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG ) (22)

where Pif=𝐑ef(𝐑we(ti+td)Piw+𝐭we(ti+td))+𝐭efsuperscriptsubscript𝑃𝑖𝑓subscriptsuperscript𝐑𝑓𝑒subscriptsuperscript𝐑𝑒𝑤subscript𝑡𝑖subscript𝑡𝑑subscriptsuperscript𝑃𝑤𝑖subscriptsuperscript𝐭𝑒𝑤subscript𝑡𝑖subscript𝑡𝑑subscriptsuperscript𝐭𝑓𝑒P_{i}^{f}=\mathbf{R}^{f}_{e}\left(\mathbf{R}^{e}_{w}(t_{i}+t_{d})P^{w}_{i}+% \mathbf{t}^{e}_{w}(t_{i}+t_{d})\right)+\mathbf{t}^{f}_{e}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT = bold_R start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( bold_R start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_t start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ) + bold_t start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT.

Based on (8), (9), and (22), the structure of Pif/tdsuperscriptsubscript𝑃𝑖𝑓subscript𝑡𝑑\partial P_{i}^{f}/\partial t_{d}∂ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT / ∂ italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT can be derived straightforwardly:

Pif𝐑we𝐑wetd=𝐑ef𝐑˙we(ti+td)Piw+𝐑ef𝐯we(ti+td)superscriptsubscript𝑃𝑖𝑓superscriptsubscript𝐑𝑤𝑒superscriptsubscript𝐑𝑤𝑒subscript𝑡𝑑subscriptsuperscript𝐑𝑓𝑒subscriptsuperscript˙𝐑𝑒𝑤subscript𝑡𝑖subscript𝑡𝑑subscriptsuperscript𝑃𝑤𝑖subscriptsuperscript𝐑𝑓𝑒subscriptsuperscript𝐯𝑒𝑤subscript𝑡𝑖subscript𝑡𝑑\frac{\partial P_{i}^{f}}{\partial\mathbf{R}_{w}^{e}}\frac{\partial\mathbf{R}_% {w}^{e}}{\partial t_{d}}=\mathbf{R}^{f}_{e}\dot{\mathbf{R}}^{e}_{w}(t_{i}+t_{d% })P^{w}_{i}+\mathbf{R}^{f}_{e}\mathbf{v}^{e}_{w}(t_{i}+t_{d})divide start_ARG ∂ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ bold_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG = bold_R start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT over˙ start_ARG bold_R end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) italic_P start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_R start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT bold_v start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) (23)

To jointly optimize the intrinsic and extrinsic parameters of the frame-based camera, as well as the time offset, the following cost function is minimized:

argmin𝒳f{ρ(𝐫f(𝒳f)2)}subscriptsubscript𝒳𝑓𝜌superscriptnormsubscript𝐫𝑓subscript𝒳𝑓2\arg\min_{\mathcal{X}_{f}}\left\{\sum\rho(\|\mathbf{r}_{f}(\mathcal{X}_{f})\|^% {2})\right\}roman_arg roman_min start_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT { ∑ italic_ρ ( ∥ bold_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) } (24)

Notably, the optimization problems in (15), (19), and (24) are solved using Google Ceres111https://ceres-solver.org/.

V Experiments

In this section, several experiments are conducted to evaluate the performance of EF-Calib, encompassing intrinsic calibration test, extrinsic calibration test, and time offset calibration test. Additionally, an ablation study is conducted to evaluate the contribution of several key modules within EF-Calib.

V-A System Setup

A real-world stereo vision system was designed as Fig. 1(b) shows. It contains an event camera and a frame-based camera. The two cameras are integrated by a slide, on which the baseline and viewing angle can be arbitrarily changed to test the calibration performance of EF-Calib in different situations comprehensively. The event camera utilized in this letter is the Inivation DAVIS 346, featuring a resolution of 346×\times×260 and a maximum temporal resolution of 1 µs. Additionally, this type of camera can also generate regular frames at a frequency of 30 Hz under standard illumination conditions. This configuration can be readily employed to compare EF-Calib with a high-quality, frame-based calibration pipeline, such as the OpenCV calibration toolbox [29]. The frame-based camera employed is the HikVision MV-CE013-80UM industrial camera with a global shutter and a resolution of 1280×\times×1024 pixels. Note that no hardware synchronization was utilized in the stereo vision system. This deliberate choice was made to provide a more rigorous evaluation of the calibration capability of EF-Calib under real-world conditions.

V-B Calibration Experiments

The calibration performance of a stereo vision system is usually affected by the camera baseline and viewing angle. To fully evaluate the calibration performance of EF-Calib, we conducted calibration experiments in three settings and analyzed the corresponding calibration results separately. In the first setting (Trial 1), the cameras are configured for a regular baseline. In the second setting (Trial 2), the cameras are configured for a wide baseline. In the third setup (Trial 3), the cameras are configured as a narrow baseline. For each trial, the cameras are adjusted to obtain a reasonable viewing angle, ensuring sufficient overlap of the camera field of view. For each trial, images with event data are recorded simultaneously for sufficient time (about 40 s) to achieve converged calibration results.

V-B1 Intrinsic Calibration Test

We utilized only the event stream data from the above three trials and completed the intrinsic calibration of the event camera respectively. Previously, we completed the intrinsic calibration using OpenCV toolbox [29] with the frame provided by DAVIS 346 and considered this calibration result as the ground truth. In addition, EF-Calib is compared with two state-of-the-art event camera intrinsic calibration methods [17, 20]. Note that the calibration patterns used by the compared methods are the ones originally employed by them: [17] utilizes a checkerboard pattern, while [20] employs an asymmetric circular pattern.

Table I shows the intrinsic calibration results of each method. It can be seen that the intrinsic parameter obtained by our method is closest to the ground truth (GT) and the results corresponding to three trials are very stable. In addition, Fig. 7 illustrates the plot of the intrinsic parameters over time. It can be noticed that EF-Calib can get converged results in less than 20 s, demonstrating the ease of use of our method.

Refer to caption
Figure 7: Results of the intrinsic calibration test.
TABLE I: Comparative Results of Intrinsic Calibration Test
Methods fxsubscript𝑓𝑥f_{x}italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT fysubscript𝑓𝑦f_{y}italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT cxsubscript𝑐𝑥c_{x}italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT cysubscript𝑐𝑦c_{y}italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT k1subscript𝑘1k_{\text{1}}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT k2subscript𝑘2k_{\text{2}}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT RPE
Frame-based (GT) 413.84 413.80 157.42 132.25 -0.38 0.31 0.13
E2Calib [17] 417.56 417.24 159.86 132.35 -0.36 0.09 0.41
E-Calib [20] 404.66 403.99 159.81 132.69 -0.37 0.32 0.33
EF-Calib (Trial 1) 414.06 413.28 158.03 132.43 -0.38 0.31 0.10
EF-Calib (Trial 2) 414.79 413.85 158.74 131.90 -0.38 0.34 0.12
EF-Calib (Trial 3) 414.87 413.94 159.16 133.65 -0.37 0.31 0.13

V-B2 Extrinsic Calibration Test

To evaluate the extrinsic calibration performance of the EF-Calib, we calculated the errors corresponding to rotation and translation separately for each trial, i.e.

etsubscript𝑒𝑡\displaystyle e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =1NiN𝐭we(ti+td)𝐓fe𝐭wfi2absent1𝑁superscriptsubscript𝑖𝑁subscriptnormsubscriptsuperscript𝐭𝑒𝑤subscript𝑡𝑖subscript𝑡𝑑superscriptsubscript𝐓𝑓𝑒subscriptsubscriptsuperscript𝐭𝑓𝑤𝑖2\displaystyle=\frac{1}{N}\sum_{i}^{N}\left\|\mathbf{t}^{e}_{w}(t_{i}+t_{d})-% \mathbf{T}_{f}^{e}{\mathbf{t}^{f}_{w}}_{i}\right\|_{2}= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ bold_t start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) - bold_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT bold_t start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (25)
ersubscript𝑒𝑟\displaystyle e_{r}italic_e start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT =1NiN𝜽(𝐑we(ti+td))𝜽(𝐑fe𝐑wfi)2absent1𝑁superscriptsubscript𝑖𝑁subscriptnorm𝜽superscriptsubscript𝐑𝑤𝑒subscript𝑡𝑖subscript𝑡𝑑𝜽superscriptsubscript𝐑𝑓𝑒subscriptsuperscriptsubscript𝐑𝑤𝑓𝑖2\displaystyle=\frac{1}{N}\sum_{i}^{N}\|\boldsymbol{\theta}(\mathbf{R}_{w}^{e}(% t_{i}+t_{d}))-\boldsymbol{\theta}(\mathbf{R}_{f}^{e}{\mathbf{R}_{w}^{f}}_{i})% \|_{2}= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ bold_italic_θ ( bold_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ) - bold_italic_θ ( bold_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT bold_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

where N𝑁Nitalic_N is the frame number and 𝜽()𝜽\boldsymbol{\theta}(\cdot)bold_italic_θ ( ⋅ ) represents the Euler angle corresponding to the rotation matrix. Similarly to the intrinsic calibration test, we also acquired 30 pairs of images containing a checkerboard calibration board from the two cameras in different poses simultaneously. These frames were calibrated using the OpenCV toolbox to obtain the corresponding extrinsic parameters for both cameras. The computed EF-Calib calibration errors were compared with the corresponding errors of the frame-based extrinsic parameter calibration. From Table  II, it can be seen that EF-Calib can achieve the same level of error as the frame-based extrinsic calibration, verifying its effectiveness in extrinsic calibration.

TABLE II: Comparative Results of Extrinsic Calibration Test
Trial Method etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (mm) ersubscript𝑒𝑟e_{r}italic_e start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT () Frames
Trial 1 Frame-based 0.3499 0.0918 30
EF-Calib 0.5336 0.1984 250
Trial 2 Frame-based 0.4199 0.1854 30
EF-Calib 0.6572 0.3127 207
Trial 3 Frame-based 0.2541 0.0954 30
EF-Calib 0.3638 0.2911 184

V-B3 Time Offset Calibration Test

Time offset estimation is crucial for multi-camera systems that lack hardware triggering. In this test, we evaluate EF-Calib’s capability to calibrate time offsets by manually adjusting the timestamp of each image frame. Specifically, we compare the differences between the time offset calibration results obtained with modified timestamps and the original results. This is achieved by uniformly delaying or advancing the timestamps by a fixed time interval ΔtdΔsubscript𝑡𝑑\Delta t_{d}roman_Δ italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. Specifically, the timestamps were modified by ±plus-or-minus\pm±2.5 ms and ±plus-or-minus\pm±5 ms for each trial, and the difference between the modified time delay and the original time delay was calculated and compared to the delta. Fig. 8 illustrates the experimental results of the time offset calibration test, and EF-Calib can accurately calibrate the time offset at different scenarios.

Refer to caption
Figure 8: Results of the time offset calibration test.
TABLE III: Ablation Study of EF-Calib
Piece-wise trajectory Temporal calibration Feature refinement Trial 1 Trial 2 Trial 3
etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ersubscript𝑒𝑟e_{r}italic_e start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ersubscript𝑒𝑟e_{r}italic_e start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ersubscript𝑒𝑟e_{r}italic_e start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT
8.1633 2.8632 4.6642 2.6372 1.2161 1.4511
8.0019 2.8187 3.8794 2.7859 0.4974 1.4322
0.9141 5.6910 0.6322 0.3239 0.3707 0.3065
0.5336 0.1984 0.6572 0.3127 0.3638 0.2911

V-C Ablation Study

To thoroughly analyze and validate the performance and functionality of each module within EF-Calib, we conducted an ablation study. Specifically, we scrutinized three modules: piece-wise trajectories, temporal calibration, and feature refinement. Table III demonstrates the impact of the introduction of these three modules on the calibration error of the EF-Calib extrinsic parameters. As can be seen from Table III, the introduction of all three modules can greatly improve the extrinsic parameter calibration accuracy.

VI Conclusion

In this letter, we propose a spatiotemporal calibration framework called EF-Calib, aiming to achieve joint calibration of intrinsic parameters, extrinsic parameters, and time offset for event and frame-based cameras. First, we design a novel calibration pattern that accommodates the heterogeneous nature of event and frame-based representations, incorporating both circles and crosspoints to facilitate simultaneous recognition by both camera types. A corresponding event-based recognition algorithm is developed to ensure robust and accurate feature recovery using this pattern. Additionally, to manage the asynchronous characteristics of the event stream, we introduce a piece-wise B-spline to continuously represent the pose trajectory of the event camera. Finally, we provide the analytic Jacobian of the error term and implement the joint calibration of intrinsic, extrinsic, and time offset for both camera types. Experimental results demonstrate that EF-Calib outperforms current state-of-the-art methods in intrinsic parameter estimation while also achieving high accuracy in extrinsic parameter and time offset estimation. These results demonstrate the spatiotemporal calibration capabilities of EF-Calib and lay a robust foundation for the fusion of event and frame.

In the future, we aim to explore markerless online calibration based on EF-Calib. Additionally, we plan to utilize EF-Calib to create novel visual perception frameworks that fuse events and frames.

References

  • [1] G. Gallego, T. Delbruck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. Davison, J. Conradt, and K. Daniilidis, “Event-based vision: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 1, pp. 154–180, Jan. 2020.
  • [2] S. Zhu, Z. Tang, M. Yang, E. Learned-Miller, and D. Kim, “Event camera-based visual odometry for dynamic motion tracking of a legged robot using adaptive time surface,” in Proc. IEEE/RSJ Int. Conf. Intell. Rob. Syst., Detroit, MI, USA, 2023, pp. 3475–3482.
  • [3] Z. Zhou, Z. Wu, R. Boutteau, F. Yang, C. Demonceaux, and D. Ginhac, “RGB-Event fusion for moving object detection in autonomous driving,” in Proc. IEEE Int. Conf. Robot. Autom., London, United Kingdom, Jul. 2023, pp. 7808–7815.
  • [4] J. Jiang, J. Li, B. Zhang, X. Deng, and B. Shi, “EvHandPose: Event-based 3D hand pose estimation with sparse supervision,” IEEE Trans. Pattern Anal. Mach. Intell., early access, doi: 10.1109/TPAMI.2024.3380648, 2024.
  • [5] J. Han, Y. Yang, P. Duan, C. Zhou, L. Ma, C. Xu, T. Huang, I. Sato, B. Shi, “Hybrid high dynamic range imaging fusing neuromorphic and conventional images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 7, pp. 8553–8565, Jul. 2023.
  • [6] A. R. Vidal, H. Rebecq, T. Horstschaefer, and D. Scaramuzza, “Ultimate SLAM? Combining events, images, and IMU for robust visual SLAM in HDR and high-speed scenarios,” IEEE Robot. Automat. Lett., vol. 3, no. 2, pp. 994–1001, Apr. 2018.
  • [7] M. S. Lee, J. H. Jung, Y. J. Kim, and C. G. Park, “Event-and frame-based visual-inertial odometry with adaptive filtering based on 8-DOF war** uncertainty,” IEEE Robot. Automat. Lett., vol. 9, no. 2, pp. 1003–1010, Feb. 2024,
  • [8] W. Guan, P. Chen, Y. Xie, and P. Lu, “PL-EVIO: Robust monocular event-based visual inertial odometry with point and line features,” IEEE Trans. Automat. Sci. Eng., early access, doi: 10.1109/TASE.2023.3324365, 2023.
  • [9] C. Luo, J. Wu, S. Sun, and P. Ren, “TransCODNet: Underwater transparently camouflaged object detection via RGB and event frames collaboration,” IEEE Robot. Automat. Lett., vol. 9, no. 2, pp. 1444–1451, Feb. 2024.
  • [10] P. Chen, W. Guan, F. Huang, Y. Zhong, W. Wen, L. Hsu, P. Lu, “ECMD: An event-centric multisensory driving dataset for SLAM,” IEEE Trans. Intell. Veh., vol. 9, no. 1, pp. 407–416, Jan. 2024.
  • [11] C. Creß, W. Zimmer, N. Purschke, B. N. Doan, S. Kirchner, V. Lakshminarasimhan, L. Strand, and A. Knoll, “TUMTraf event: Calibration and fusion resulting in a dataset for roadside event-based and RGB cameras,” IEEE Trans. Intell. Veh., early access, doi: 10.1109/TIV.2024.3393749, 2024.
  • [12] L. Gao, Y. Liang, J. Yang, S. Wu, C. Wang, J. Chen, and L. Kneip, “VECtor: A versatile event-centric benchmark for multi-sensor SLAM,” IEEE Robot. Automat. Lett., vol. 7, no. 3, pp. 8217–8224, Jul. 2022.
  • [13] Z. Zhang, “A flexible new technique for camera calibration,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 11, pp. 1330–1334, Nov. 2000.
  • [14] “Calibration toolbox by RPG, University of Zurich,” https://github.com/uzh-rpg/rpg_dvs_ros/tree/master/dvs_calibration.
  • [15] G. Orchard, “Calibration toolbox by G. Orchard,” https://github.com/gorchard/DVScalibration.
  • [16] “Calibration toolbox by VLOGroup at TU Graz,” https://github.com/VLOGroup/dvs-calibration.
  • [17] M. Muglikar, M. Gehrig, D. Gehrig, and D. Scaramuzza, “How to calibrate your event camera,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops, Nashville, TN, USA, Jun. 2021, pp. 1403–1409.
  • [18] H. Rebecq, R. Ranftl, V. Koltun, and D. Scaramuzza, “High speed and high dynamic range video with an event camera,” IEEE Trans. Pattern Anal. Mach. Intell. vol. 43, no. 6, pp. 1964–1980, Jun. 2021.
  • [19] K. Huang, Y. Wang, and L. Kneip, “Dynamic event camera calibration,” in Proc. IEEE/RSJ Int. Conf. Intell. Rob. Syst., Prague, Czech Republic, Sep. 2021, pp. 7021–7028.
  • [20] M. Salah, A. Abdulla, H. Muhammad, G. Daniel, A. Abdelqader, S. Lakmal, S. Davide, and Z. Yahya, “E-Calib: A fast, robust and accurate calibration toolbox for event cameras,” Jun. 2023, arXiv:2306.09078.
  • [21] J. Rehder, R. Siegwart, and P. Furgale, “A general approach to spatiotemporal calibration in multisensor systems,” IEEE Trans. Robot., vol. 32, no. 2, pp. 383–398, Apr. 2016.
  • [22] J. Huai, Y. Zhuang, Y. Lin, G. Jozkow, Q. Yuan, and D. Chen, “Continuous-time spatiotemporal calibration of a rolling shutter camera-IMU system,” IEEE Sensors J., vol. 22, no. 8, pp. 7920–7930, Apr. 2022.
  • [23] E. Mueggler, G. Gallego, H. Rebecq, and D. Scaramuzza, “Continuous-time visual-inertial odometry for event cameras,” IEEE Trans. Robot., vol. 34, no. 6, pp. 1425–1440, Dec. 2018.
  • [24] A. Patron-Perez, S. Lovegrove, and G. Sibley, “A spline-based trajectory representation for sensor fusion and rolling shutter cameras,” Int. J. Comput. Vis., vol. 113, no. 3, pp. 208–219, Feb. 2015.
  • [25] K. Qin, “General matrix representations for B-splines,” Visual Comput., vol. 16, no. 3, pp. 177-186, 2000.
  • [26] C. Sommer, V. Usenko, D. Schubert, N. Demmel, and D. Cremers, “Efficient derivative computation for cumulative B-splines on lie groups,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Seattle, WA, USA, Jun. 2020, pp. 11145–11153
  • [27] S. Wang, M. Zhu, Y. Hu, D. Li, F. Yuan, and J. Yu, “Accurate detection and localization of curved checkerboard-like marker based on quadratic form,” IEEE Trans. Instrum. Meas., vol. 71, pp. 1–11, Jul. 2022.
  • [28] C. Grana, D. Borghesani, and R. Cucchiara, “Optimized block-based connected components labeling with decision trees,” IEEE Trans. Imag. Process., vol. 19, no. 6, pp. 1596–1609, Jun. 2010.
  • [29] G. Bradski and A. Kaehler, Learning OpenCV: Computer vision with the OpenCV library. Sebastopol, CA, USA: O’Reilly, 2008, pp. 370–396.