EF-Calib: Spatiotemporal Calibration of
Event- and Frame-Based Cameras
Using Continuous-Time Trajectories

Shaoan Wang, Zhanhua Xin, Yaoqing Hu, Dongyue Li, Mingzhu Zhu, and Junzhi Yu This work was supported in part by the National Natural Science Foundation of China under Grant T2121002 and Grant 62233001, and in part by the Bei**g Natural Science Foundation under Grant 2022MQ05. (Corresponding author: Junzhi Yu.)Shaoan Wang, Zhanhua Xin, Yaoqing Hu, Dongyue Li, and Junzhi Yu are with the State Key Laboratory for Turbulence and Complex Systems, Department of Advanced Manufacturing and Robotics, College of Engineering, Peking University, Bei**g 100871, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]).Mingzhu Zhu is with the Department of Mechanical Engineering, Fuzhou University, Fuzhou 350000, China (e-mail: [email protected]).

Abstract

Event camera, a bio-inspired asynchronous triggered camera, offers promising prospects for fusion with frame-based cameras owing to its low latency and high dynamic range. However, calibrating stereo vision systems that incorporate both event and frame-based cameras remains a significant challenge. In this letter, we present EF-Calib, a spatiotemporal calibration framework for event- and frame-based cameras using continuous-time trajectories. A novel calibration pattern applicable to both camera types and the corresponding event recognition algorithm is proposed. Leveraging the asynchronous nature of events, a derivable piece-wise B-spline to represent camera pose continuously is introduced, enabling calibration for intrinsic parameters, extrinsic parameters, and time offset, with analytical Jacobians provided. Various experiments are carried out to evaluate the calibration performance of EF-Calib, including calibration experiments for intrinsic parameters, extrinsic parameters, and time offset. Experimental results show that EF-Calib achieves the most accurate intrinsic parameters compared to current SOTA, the close accuracy of the extrinsic parameters compared to the frame-based results, and accurate time offset estimation. EF-Calib provides a convenient and accurate toolbox for calibrating the system that fuses events and frames. The code of this paper will also be open-sourced at: https://github.com/wsakobe/EF-Calib.

Index Terms:

Event camera, spatiotemporal calibration, continuous-time trajectory, time offset estimation.

I Introduction

Recent years, there has been a growing interest among researchers in a novel bio-inspired camera called the event camera [1]. Abandoning the frame-triggered concept of conventional cameras, each pixel of the event camera can be considered as responding independently and asynchronously to changes in illumination, resulting in an extremely low-latency and high-dynamic-range response pattern. These advantages offer competitive prospects for event cameras in areas such as robotics [2], autonomous driving [3], VR/AR [4], and camera imaging [5].

Refer to caption — Figure 1: Overview diagram of EF-Calib. (a) The novel calibration pattern consists of the concentric circle and crosspoint. (b) The stereo vision system consists of an event camera and a frame-based camera. (c) The calibration process of EF-Calib.

However, due to the imaging principle of event cameras, they can only react to changes in illumination, making it difficult to capture absolute amounts of illumination and RGB values as frame-based cameras do. This limitation weakens the ability of event cameras to perceive and understand the environment. Therefore, many recent studies have attempted to fuse events with images in order to fully utilize the unique advantages of both modalities, as illustrated in Fig. 1. Some novel SLAM systems achieve more robust localization under fast motion by fusing events and frames [2, 8, 7, 6]. In addition, several studies are exploring how to fuse events and frames for object detection in challenging environments [3, 9]. In recent years, some event-centric datasets with multiple sensors, including frame-based cameras, have also been widely proposed [12, 11, 10].

It is important to note that calibrating the intrinsic and extrinsic parameters of each camera is an indispensable step in the context of multi-camera fusion. Classical camera-to-camera calibration schemes typically require time synchronization, followed by the acquisition of each camera’s parameters through the synchronous acquisition of images of the calibration board with different poses [13]. However, due to the asynchronous nature of the event camera, it is difficult to combine multiple events into single “frames” and time-synchronize them with an image. In addition, events are only generated if there is relative motion, so the event camera cannot capture a stationary calibration board. In summary, a new calibration framework must be designed for the system including event- and frame-based cameras.

To address the aforementioned issues, this letter proposes a novel spatiotemporal calibration framework for event- and frame-based cameras. To the best of our knowledge, EF-Calib is the first calibration framework to achieve joint calibration of event- and frame-based cameras without requiring any time synchronization. The main contributions of this paper are as follows:

1.

A novel spatiotemporal calibration framework for event- and frame-based cameras is proposed. This framework can jointly obtain the intrinsic and extrinsic parameters, as well as the time offset without requiring any hardware synchronization.
2.

Leveraging the asynchronous and low-latency properties of the event camera, the framework introduces a continuous-time trajectory to optimize its motion trajectory, facilitating arbitrary timestamp alignment with the frame-based camera.
3.

Extensive experiments are conducted in diverse scenarios to validate the proposed calibration framework. The results indicate that the framework achieves accuracy close to that of frame-based camera calibration methods and consistently calibrates the time offset between the cameras.

The rest of the letter is organized as follows. Sec. II summarizes the related works. Sec. III presents the preliminaries of event-based vision and continuous-time trajectory. Sec. IV introduces the calibration framework. Sec. V evaluates the calibration performance from different aspects. At last, Sec. VI presents the conclusion of this letter.

II Related Works

For geometric vision, camera calibration is particularly crucial as it serves as the initial step in processing the input image signal, with the quality of calibration often dictating the performance of subsequent tasks. Traditional camera calibration has undergone significant evolution. The most prevalent calibration method today involves capturing images of a calibration pattern with a known size, such as a checkerboard, from various viewpoints to identify corresponding feature points [13]. Subsequently, the intrinsic and distortion parameters of each camera, as well as the extrinsic parameters between cameras, are automatically calculated.

Nevertheless, applying this static and discrete calibration method to event cameras, which are triggered by changes in illumination or relative motion, presents challenges. Initially, many open-source event camera calibration toolkits utilized a synchronized blinking LED calibration board with a known size or a blinking checkerboard pattern generated by an LED screen [14, 15, 16]. This allowed the event camera to identify features similar to a conventional camera and utilize traditional calibration methods. However, these toolkits require complex device preparation and are unsuitable for calibrating extrinsic parameters between event- and frame-based cameras.

In recent years, there has been increased focus on designing new calibration frameworks to facilitate event camera calibration using existing calibration boards. Muglikar et al. [17] utilize deep learning-based image reconstruction networks, such as E2VID [18], to record events generated by moving the calibration board and then apply the reconstructed images to classical calibration methods. However, these methods heavily rely on the quality of image reconstruction and face challenges in achieving time synchronization with conventional cameras. Another approach involves directly utilizing events generated during camera motion for camera calibration. Huang et al. [19] proposed a calibration framework based on a circular calibration board and employed B-splines to optimize the movement trajectory, which is the most similar method to the one proposed in this letter. However, they directly use clustered asynchronous events as features for optimization, compromising sub-pixel accuracy and being highly sensitive to noise. Additionally, their focus is solely on the intrinsic calibration of event cameras, without addressing extrinsic calibration between event cameras and frame-based cameras. Salah et al. [20] also utilize circular calibration boards and introduce eRWLS to fit circular features with sub-pixel accuracy. However, they compress events over a period into a fixed timestamp to obtain a reference “frame” making this method challenging to synchronize with a conventional camera. Furthermore, it does not account for the deformation of circular features at different viewing angles, leading to reduced sub-pixel localization accuracy. The calibration framework proposed in this letter continues this concept and provides an improvement to address the problems of these methods.

III Preliminaries

III-A Event-Based Vision

Unlike conventional cameras, each pixel of the event camera is independently triggered and responds to changes in the logarithmic illumination signal $L(\mathbf{u}_{k},t_{k})$ . An event $(\mathbf{u}_{k},t_{k},p_{k})$ is triggered when the change in logarithmic illumination received by a pixel $\mathbf{u}_{k}=(x_{k},y_{k})$ exceeds a threshold value $C$ , i.e.

\Delta L(\mathbf{u}_{k},t_{k})\doteq L(\mathbf{u}_{k},t_{k})-L(\mathbf{u}_{k},% t_{k}-\Delta t)=p_{k}C

(1)

where $\Delta t$ is the time since the last triggered event by the same pixel, $p_{k}\in\{-\text{1},+\text{1}\}$ is the polarity of the event.

III-B Continuous-Time Trajectory Representation

Continuous-time trajectories are often represented utilizing a weighted combination of the temporal basis functions [21], such as polynomial functions, FFTs, and Bézier curves. In this letter, the uniform B-spline is introduced as a representation of the continuous-time trajectory. B-splines have the advantages of smoothness, local support, and analytic derivatives, which are well-suited for representing the 6-DoF pose of the event camera [23, 22, 24]. Following the formulation of cumulative $k$ th degree B-spline $\mathcal{L}$ , the event camera pose $\mathbf{T}^{w}_{e}(\tau)\in\mathbb{SE}\text{3}$ at any time $\tau\in[t_{i},t_{i+1})$ , it can be represented by $N$ control points $\mathbf{T}_{i}\in\mathbb{SE}\text{3},i\in[\text{0},\text{1},\ldots,N-\text{1}]$ :

\mathcal{L}:\mathbf{T}_{e}^{w}(\tau)=\mathbf{T}_{i}\cdot\prod_{j=\text{1}}^{k}% \mathrm{Exp}\left(\tilde{\mathbf{B}}_{j}(\tau)\cdot\mathrm{Log}\left(\mathbf{T% }_{i+j-1}^{-1}\mathbf{T}_{i+j}\right)\right)

(2)

where $\tilde{\mathbf{B}}_{j}(\tau)$ is the cumulative basis function, which is denoted by

\tilde{\mathbf{B}}_{j}(\tau)=\tilde{\mathbf{M}}^{(k)}\mathbf{u}

(3)

\mathbf{u}=\begin{bmatrix}1&u&\cdots&u^{k}\end{bmatrix}^{T},u=\frac{\tau-t_{i}% }{t_{i+1}-t_{i}}

(4)

where $\tilde{\mathbf{M}}^{(k)}\in\mathbb{R}^{(k+\text{1})\times(k+\text{1})}$ is the cumulative blending matrix of B-splines. Since the control points of the B-splines are uniformly distributed on the time scale, the cumulative blending matrix $\mathbf{M}^{(k)}$ is constant. In this letter, considering the continuity and complexity, we use cubic B-splines to represent the camera pose, i.e., $k=\text{3}$ . The corresponding cumulative mixing matrix $\tilde{\mathbf{M}}^{(3)}$ [25] is

\tilde{\mathbf{M}}^{(\text{3})}=\dfrac{\text{1}}{\text{6}}\begin{bmatrix}\text% {6}&\text{0}&\text{0}&\text{0}\\ \text{5}&\text{3}&\text{-3}&\text{1}\\ \text{1}&\text{3}&\text{3}&\text{-2}\\ \text{0}&\text{0}&\text{0}&\text{1}\end{bmatrix}

(5)

Commonly, to simplify the computation, some work decouples the rotation $\mathbf{R}_{e}^{w}(\tau)\in\mathbb{SO}$ 3 and translation $\mathbf{p}_{e}^{w}(\tau)\in\mathbb{R}^{\text{3}}$ into two independent cubic B-splines, and the same process is carried out in this paper as well. Hence, the continuous-time trajectory of the camera pose can be finally formulated as

\mathbf{R}_{e}^{w}(\tau)=\mathbf{R}_{i}\cdot\prod_{j=\text{1}}^{3}\mathrm{Exp}% \left(\tilde{\mathbf{B}}_{j}(\tau)\cdot\mathrm{Log}\left(\mathbf{R}_{i+j-1}^{-% 1}\mathbf{R}_{i+j}\right)\right)

(6)

\mathbf{p}_{e}^{w}(\tau)=\mathbf{p}_{i}+\sum_{j=\text{1}}^{3}\tilde{\mathbf{B}% }_{j}(\tau)\cdot(\mathbf{p}_{i+j}-\mathbf{p}_{i+j-1})

(7)

After decoupling the pose into two cubic B-splines, the corresponding analytic derivatives [26] can also be derived

	$\displaystyle\dot{\mathbf{R}_{e}^{w}}(\tau)$	$\displaystyle=\mathbf{R}_{e}^{w}(\tau)\cdot\left(\boldsymbol{\omega}^{(\text{3% })}(\tau)\right)_{\wedge}$		(8)
		$\displaystyle=\mathbf{R}_{i}\left(\dot{\mathbf{A}}_{\text{1}}\mathbf{A}_{\text% {2}}\mathbf{A}_{\text{3}}+\mathbf{A}_{\text{1}}\dot{\mathbf{A}}_{\text{2}}% \mathbf{A}_{\text{3}}+\mathbf{A}_{\text{1}}\mathbf{A}_{\text{2}}\dot{\mathbf{A% }}_{\text{3}}\right)$		(8)

\mathbf{v}_{e}^{w}(\tau)=\dot{\mathbf{p}}_{e}^{w}(\tau)=\mathbf{p}_{i}\cdot% \sum_{j=\text{1}}^{\text{3}}\dot{\tilde{\mathbf{B}_{j}}}(\tau)\cdot(\mathbf{p}% _{i+j}-\mathbf{p}_{i+j-1})

(9)

where

\mathbf{A}_{j}=\mathrm{Exp}\left(\tilde{\mathbf{B}}_{j}(\tau)\cdot\mathrm{Log}% \left(\mathbf{R}_{i+j-1}^{-1}\mathbf{R}_{i+j}\right)\right)

(10)

\dot{\mathbf{A}}_{j}=\mathbf{A}_{j}\dot{\tilde{\mathbf{B}}}(\tau)_{j}\mathrm{% Log}\left(\mathbf{R}_{i+j-1}^{-1}\mathbf{R}_{i+j}\right)

(11)

\dot{\tilde{\mathbf{B}_{j}}}(\tau)=\dfrac{1}{\Delta t}\tilde{\mathbf{M}}^{(% \text{3})}\begin{bmatrix}\text{0}\\[3.00003pt] \text{1}\\[3.00003pt] \text{2}u\\[3.00003pt] \text{3}u^{\text{2}}\end{bmatrix}

(12)

Since this letter utilizes the uniform B-spline, $\Delta t$ is equal to the time interval between any two consecutive knots, i.e. $\Delta t=t_{i+1}-t_{i}$ .

IV Methodology

IV-A Calibration Framework

For camera calibration, it is of utmost importance to accurately and robustly identify features on the calibration pattern. However, the checkerboard pattern [27], which is widely used, is difficult to apply to event cameras because events disappear during parallel edge motion. Consequently, calibration patterns for event cameras often employ circular features, but these can produce blur in moving frames. To balance the characteristics of event cameras and frame-based cameras, we designed a new calibration pattern that combines isotropic circles and checkerboard crosspoints, as shown in Fig. 1(a). The center of each circle in this pattern coincides with the center of the inner crosspoint. This hybrid pattern significantly enhances the recognition efficiency and accuracy of the event camera while maintaining compatibility with frame-based cameras.

Fig. 2 presents the flowchart of the proposed calibration framework. In this letter, we divide the entire calibration process into two stages. The first stage focuses on feature extraction and refinement of the calibration pattern. The second stage is dedicated to optimizing the camera trajectory using piecewise B-splines to achieve accurate calibration results. These two stages will be elaborated in the following subsections.

IV-B Event-Based Feature Recognizer

Unlike frame-based camera, event camera only output asynchronous events during relative motion, posing a challenge for robust calibration pattern recognition. To address this, we propose an event-based calibration pattern feature recognizer, illustrated in Fig. 3. First, we accumulate events over a short period of time $\Delta t$ according to their polarity to obtain “accumulation frames” that resemble traditional images. The introduction of this “accumulation frames” can help us to recognize the feature plate using some classical image processing algorithms. The following subsections describe the recognition algorithm based on “accumulation frames” in detail.

IV-B1 Noise Suppression

Event cameras often generate a large number of events from the static background when they are in motion. These noisy events can adversely affect the recognition of calibration boards, significantly reducing the operation speed and accuracy of the recognizer. Therefore, after obtaining the “accumulation frames”, a noise suppression module is designed to filter out most of the events that are not related to the calibration board.

For circular features, the triggered events typically consist of two semicircular arcs connecting regions of opposite polarity. However, many structures in the background have straight edges, making them more likely to have connected regions that resemble straight lines. To leverage this property, we introduce a fast and accurate connected component labeling (CCL) algorithm called BBDT, proposed by Grana et al. [28]. This algorithm merges neighboring events with the same polarity to obtain all the connectivity regions. Then, the magnitudes of the two principal components of each connected region are calculated using PCA. For background-triggered connected regions, the magnitude of the second principal component $\|\textbf{PC}_{\text{2}}\|$ should be much smaller than the magnitude of the first principal component $\|\textbf{PC}_{\text{1}}\|$ , so that a large number of noisy regions can be suppressed by the principal component magnitude ratio $\beta_{PC}$ , as given that $\beta_{PC}=\|\textbf{PC}_{\text{1}}\|/\|\textbf{PC}_{\text{2}}\|<T_{PC}$ , where $T_{PC}$ is a threshold for the $\beta_{PC}$ , and any region with $\beta_{PC}$ higher than $T_{PC}$ is suppressed and not involved in subsequent operations.

IV-B2 Feature Extraction

Following noise suppression, we proceed to extract potential circular features from the remaining regions. Specifically, we identify two candidate regions of opposite polarity based on their distance. Subsequently, we fit the elliptic equation using all pixels contained within these regions, exploiting the fact that circular features adhere to the elliptic model under a projective transform. Then the fitting error $e_{fit}$ is calculated, excluding candidate regions with a fitting error exceeding the fitting threshold $T_{fit}$ .

Furthermore, additional geometric constraints are needed to eliminate the remaining false positive regions. Specifically, the two candidate regions constituting the same ellipse should demonstrate similar PCA magnitudes; regions with a notable discrepancy in PCA magnitude fail to meet this criterion. Additionally, the contributions of the two candidate regions to the circumference of the ellipse should be close. In other words, the angular range $\theta_{r}$ of the two candidate regions with respect to the center of the ellipse should be close to 180^∘.

Regions that successfully meet these geometric constraints are recognized as accurately representing the elliptical features within the calibration pattern. Consequently, depending on the distribution of these elliptical features, their relative positions on the calibration pattern can be decoded to correspond with the results identified in the frame-based camera.

IV-B3 Feature Refinement

Neglecting the timestamps of events during elliptical fitting inevitably introduces errors, thereby affecting the camera calibration precision. However, by regarding the timestamps $t$ as a third dimension alongside the pixel coordinates $[x,y]^{T}$ , events can be conceptualized as points within a three-dimensional space. Each “accumulation frames”, corresponding to a brief time interval, allows for the assumption of solely translational movement with speed $[v_{x},v_{y}]^{T}$ , parallel to the pixel plane, for each elliptical feature at any given moment within this interval, as demonstrated in Fig. 4. Consequently, the moving ellipse model $\mathcal{F}$ is described by the following representation:

\mathcal{F}:ax(t)^{2}+\beta x(t)y(t)+\gamma y(t)^{2}+\delta x(t)+\epsilon y(t)% +\zeta=\text{0}

(13)

In matrix form,

\begin{cases}\mathcal{F}:\mathbf{P}(t)^{T}\mathbf{Q}\mathbf{P}(t)=\text{0}\\ \mathbf{P}(t)=\mathbf{P}(t_{0})-\mathbf{V}\cdot(t-t_{0})\\ \mathbf{Q}=\begin{pmatrix}\alpha&\beta/\text{2}&\delta/\text{2}\\ \beta/\text{2}&\gamma&\epsilon/\text{2}\\ \delta/\text{2}&\epsilon/\text{2}&\zeta\end{pmatrix}\end{cases}

(14)

where $\mathbf{P}(t)=[x(t),y(t),\text{1}]^{T}$ , $\mathbf{V}=[v_{x},v_{y}]^{T}$ , $t\in[t_{0},t_{0}+\text{2}\delta_{t}]$ , and $t_{0}$ represents the starting time of current “accumulation frame”. Here the cost function for model optimization is defined as

\arg\min_{\mathcal{F},\mathbf{V}}\sum_{i\in\{e\}}\left(\|\mathbf{P}_{i}^{T}% \mathbf{Q}\mathbf{P}_{i}\|^{1}\right)

(15)

By substituting the events into the aforementioned model, we utilize the Levenberg-Marquardt algorithm to solve the model, thereby refining the elliptical features.

IV-C Trajectory Optimization

The refinement process converts the previously discrete elliptical features into densely populated patterns within the corresponding time period of associated events. This densely populated feature facilitates the optimization of event camera poses within a continuous-time trajectory representation. Nevertheless, maintaining continuous visibility of the entire calibration plate during the calibration process poses a challenge. Incomplete calibration patterns in a continuous event stream often result in unsuccessful recognition. To mitigate this issue, the continuous-time trajectory is divided into multiple segments based on the output from the recognizer. Consequently, a piece-wise B-spline-based optimizer for event camera pose trajectories is proposed.

First, based on the predefined knot interval $\Delta t$ and the results of the recognizer, the features whose timestamps differ from the timestamps of other features by more than $\Delta t$ are eliminated. In addition, segments containing too few features are also excluded to ensure optimization accuracy, and only the more desirable segments are preserved. Fig. 5 illustrates the segmentation process of the trajectory. The entire calibration process $\Lambda_{\mathcal{L}}$ is divided into a combination of $M$ segments of trajectories $\mathcal{L}$ :

\Lambda_{\mathcal{L}}=\sum_{m}\{\mathcal{L}_{m}\}_{a_{m}}^{b_{m}},m=\text{1},% \ldots,M

(16)

where $a_{m}$ and $b_{m}$ are the starting and ending times corresponding to the $m$ th segment of B-splines, respectively. Each segment of B-splines is optimized by only the features whose timestamps belong to its time period.

For $\mathcal{L}_{m}$ , the corresponding state vector $\mathcal{X}_{e}$ is:

\mathcal{X}_{e}=[\xi^{\text{1}}_{m},\ \xi^{\text{2}}_{m},\ \cdots,\ \xi^{N_{m}% }_{m},\ K_{e},\ D_{e}]

(17)

where $\xi_{m}^{i}$ is the control point of $\mathcal{L}_{m}$ , $N_{m}$ is the number of control points, and $K_{e}$ and $D_{e}$ are the intrinsic and distortion parameters of the event camera. The corresponding visual residual for the $i$ th feature based on reprojection error is defined as:

	$\displaystyle\mathbf{r}_{e}(\mathcal{X}_{e})=\sum_{k\in\mathcal{K}}\pi_{e}(% \mathbf{R}_{w}^{e}(t_{i}+k\cdot\delta_{t_{i}})P_{i}^{w}+\mathbf{t}_{w}^{e}(t_{% i}+k\cdot\delta_{t}))$		(18)
	$\displaystyle-(\begin{bmatrix}u_{i}^{e}\\ v_{i}^{e}\end{bmatrix}+k\cdot\delta_{t}\mathbf{V}_{i})$		(18)

where $\mathbf{R}^{e}_{w}(\cdot)$ and $\mathbf{t}^{e}_{w}(\cdot)$ are derived from the B-spline trajectory using equations (6) and (7), respectively. Since the feature refinement yields a continuous moving ellipse model, the residuals can be constructed by sampling the model at any time. Here, $\mathcal{K}$ denotes the partition of $\delta_{t_{i}}$ , which defines the sampling interval of the feature, as shown in Fig. 6. The function $\pi_{e}(\cdot)$ projects the spatial point $P_{i}^{w}$ onto the “accumulation frame”.

The intrinsic and distortion parameters of the event camera, along with the control points of the splines, are jointly optimized by minimizing the following cost function:

\arg\min_{\mathcal{X}_{e}}\left\{\sum\rho(\|\mathbf{r}_{e}(\mathcal{X}_{e})\|^% {2})\right\}

(19)

where $\rho(\cdot)$ is the Huber loss function.

IV-D Spatialtemporal Calibration

The final step involves jointly optimizing the event camera and the frame-based camera, utilizing the previously optimized trajectories, to determine the extrinsic parameters and time offset. The corresponding state vector $\mathcal{X}_{f}$ is:

\mathcal{X}_{f}=[K_{f},\ D_{f},\ \mathbf{T}^{f}_{e},\ t_{d}]

(20)

where $\mathbf{T}^{f}_{e}$ is the transformation matrix between the two cameras and $t_{d}$ is the difference between the real timestamps of the two cameras, i.e., the time offset.

Similarly, define the visual residuals based on reprojection errors in spatiotemporal calibration as:

\mathbf{r}_{f}(\mathcal{X}_{f})=\pi_{f}\left(\mathbf{R}^{f}_{e}\left(\mathbf{R% }^{e}_{w}(t_{i}+t_{d})P^{w}_{i}+\mathbf{t}^{e}_{w}(t_{i}+t_{d})\right)+\mathbf% {t}^{f}_{e}\right)-\begin{bmatrix}u^{f}_{i}\\[6.0pt] v^{f}_{i}\end{bmatrix}

(21)

where $\pi_{e}(\cdot)$ projects the spatial points onto the image plane of the frame-based camera.

From (21), the Jacobian $J_{t_{d}}$ of $\mathbf{r}_{f}$ w.r.t $t_{d}$ can be obtained by the chain rule:

J_{t_{d}}=\frac{\partial\mathbf{r}_{f}}{\partial P_{i}^{f}}\frac{\partial P_{i% }^{f}}{\partial t_{d}}=\frac{\partial\mathbf{r}_{f}}{\partial P_{i}^{e}}(\frac% {\partial P_{i}^{f}}{\partial\mathbf{R}_{w}^{e}}\frac{\partial\mathbf{R}_{w}^{% e}}{\partial t_{d}}+\frac{\partial P_{i}^{f}}{\partial\mathbf{t}_{w}^{e}}\frac% {\partial\mathbf{t}_{w}^{e}}{\partial t_{d}})

(22)

where $P_{i}^{f}=\mathbf{R}^{f}_{e}\left(\mathbf{R}^{e}_{w}(t_{i}+t_{d})P^{w}_{i}+% \mathbf{t}^{e}_{w}(t_{i}+t_{d})\right)+\mathbf{t}^{f}_{e}$ .

Based on (8), (9), and (22), the structure of $\partial P_{i}^{f}/\partial t_{d}$ can be derived straightforwardly:

\frac{\partial P_{i}^{f}}{\partial\mathbf{R}_{w}^{e}}\frac{\partial\mathbf{R}_% {w}^{e}}{\partial t_{d}}=\mathbf{R}^{f}_{e}\dot{\mathbf{R}}^{e}_{w}(t_{i}+t_{d% })P^{w}_{i}+\mathbf{R}^{f}_{e}\mathbf{v}^{e}_{w}(t_{i}+t_{d})

(23)

To jointly optimize the intrinsic and extrinsic parameters of the frame-based camera, as well as the time offset, the following cost function is minimized:

\arg\min_{\mathcal{X}_{f}}\left\{\sum\rho(\|\mathbf{r}_{f}(\mathcal{X}_{f})\|^% {2})\right\}

(24)

Notably, the optimization problems in (15), (19), and (24) are solved using Google Ceres¹¹1https://ceres-solver.org/.

V Experiments

In this section, several experiments are conducted to evaluate the performance of EF-Calib, encompassing intrinsic calibration test, extrinsic calibration test, and time offset calibration test. Additionally, an ablation study is conducted to evaluate the contribution of several key modules within EF-Calib.

V-A System Setup

A real-world stereo vision system was designed as Fig. 1(b) shows. It contains an event camera and a frame-based camera. The two cameras are integrated by a slide, on which the baseline and viewing angle can be arbitrarily changed to test the calibration performance of EF-Calib in different situations comprehensively. The event camera utilized in this letter is the Inivation DAVIS 346, featuring a resolution of 346 $\times$ 260 and a maximum temporal resolution of 1 µs. Additionally, this type of camera can also generate regular frames at a frequency of 30 Hz under standard illumination conditions. This configuration can be readily employed to compare EF-Calib with a high-quality, frame-based calibration pipeline, such as the OpenCV calibration toolbox [29]. The frame-based camera employed is the HikVision MV-CE013-80UM industrial camera with a global shutter and a resolution of 1280 $\times$ 1024 pixels. Note that no hardware synchronization was utilized in the stereo vision system. This deliberate choice was made to provide a more rigorous evaluation of the calibration capability of EF-Calib under real-world conditions.

V-B Calibration Experiments

The calibration performance of a stereo vision system is usually affected by the camera baseline and viewing angle. To fully evaluate the calibration performance of EF-Calib, we conducted calibration experiments in three settings and analyzed the corresponding calibration results separately. In the first setting (Trial 1), the cameras are configured for a regular baseline. In the second setting (Trial 2), the cameras are configured for a wide baseline. In the third setup (Trial 3), the cameras are configured as a narrow baseline. For each trial, the cameras are adjusted to obtain a reasonable viewing angle, ensuring sufficient overlap of the camera field of view. For each trial, images with event data are recorded simultaneously for sufficient time (about 40 s) to achieve converged calibration results.

V-B1 Intrinsic Calibration Test

We utilized only the event stream data from the above three trials and completed the intrinsic calibration of the event camera respectively. Previously, we completed the intrinsic calibration using OpenCV toolbox [29] with the frame provided by DAVIS 346 and considered this calibration result as the ground truth. In addition, EF-Calib is compared with two state-of-the-art event camera intrinsic calibration methods [17, 20]. Note that the calibration patterns used by the compared methods are the ones originally employed by them: [17] utilizes a checkerboard pattern, while [20] employs an asymmetric circular pattern.

Table I shows the intrinsic calibration results of each method. It can be seen that the intrinsic parameter obtained by our method is closest to the ground truth (GT) and the results corresponding to three trials are very stable. In addition, Fig. 7 illustrates the plot of the intrinsic parameters over time. It can be noticed that EF-Calib can get converged results in less than 20 s, demonstrating the ease of use of our method.

TABLE I: Comparative Results of Intrinsic Calibration Test

Methods	$f_{x}$	$f_{y}$	$c_{x}$	$c_{y}$	$k_{\text{1}}$	$k_{\text{2}}$	RPE
Frame-based (GT)	413.84	413.80	157.42	132.25	-0.38	0.31	0.13
E2Calib [17]	417.56	417.24	159.86	132.35	-0.36	0.09	0.41
E-Calib [20]	404.66	403.99	159.81	132.69	-0.37	0.32	0.33
EF-Calib (Trial 1)	414.06	413.28	158.03	132.43	-0.38	0.31	0.10
EF-Calib (Trial 2)	414.79	413.85	158.74	131.90	-0.38	0.34	0.12
EF-Calib (Trial 3)	414.87	413.94	159.16	133.65	-0.37	0.31	0.13

V-B2 Extrinsic Calibration Test

To evaluate the extrinsic calibration performance of the EF-Calib, we calculated the errors corresponding to rotation and translation separately for each trial, i.e.

	$\displaystyle e_{t}$	$\displaystyle=\frac{1}{N}\sum_{i}^{N}\left\\|\mathbf{t}^{e}_{w}(t_{i}+t_{d})-% \mathbf{T}_{f}^{e}{\mathbf{t}^{f}_{w}}_{i}\right\\|_{2}$		(25)
	$\displaystyle e_{r}$	$\displaystyle=\frac{1}{N}\sum_{i}^{N}\\|\boldsymbol{\theta}(\mathbf{R}_{w}^{e}(% t_{i}+t_{d}))-\boldsymbol{\theta}(\mathbf{R}_{f}^{e}{\mathbf{R}_{w}^{f}}_{i})% \\|_{2}$		(25)

where $N$ is the frame number and $\boldsymbol{\theta}(\cdot)$ represents the Euler angle corresponding to the rotation matrix. Similarly to the intrinsic calibration test, we also acquired 30 pairs of images containing a checkerboard calibration board from the two cameras in different poses simultaneously. These frames were calibrated using the OpenCV toolbox to obtain the corresponding extrinsic parameters for both cameras. The computed EF-Calib calibration errors were compared with the corresponding errors of the frame-based extrinsic parameter calibration. From Table II, it can be seen that EF-Calib can achieve the same level of error as the frame-based extrinsic calibration, verifying its effectiveness in extrinsic calibration.

TABLE II: Comparative Results of Extrinsic Calibration Test

Trial	Method	$e_{t}$ (mm)	$e_{r}$ (^∘)	Frames
Trial 1	Frame-based	0.3499	0.0918	30
Trial 1	EF-Calib	0.5336	0.1984	250
Trial 2	Frame-based	0.4199	0.1854	30
Trial 2	EF-Calib	0.6572	0.3127	207
Trial 3	Frame-based	0.2541	0.0954	30
Trial 3	EF-Calib	0.3638	0.2911	184

V-B3 Time Offset Calibration Test

Time offset estimation is crucial for multi-camera systems that lack hardware triggering. In this test, we evaluate EF-Calib’s capability to calibrate time offsets by manually adjusting the timestamp of each image frame. Specifically, we compare the differences between the time offset calibration results obtained with modified timestamps and the original results. This is achieved by uniformly delaying or advancing the timestamps by a fixed time interval $\Delta t_{d}$ . Specifically, the timestamps were modified by $\pm$ 2.5 ms and $\pm$ 5 ms for each trial, and the difference between the modified time delay and the original time delay was calculated and compared to the delta. Fig. 8 illustrates the experimental results of the time offset calibration test, and EF-Calib can accurately calibrate the time offset at different scenarios.

TABLE III: Ablation Study of EF-Calib

Piece-wise trajectory	Temporal calibration	Feature refinement	Trial 1		Trial 2		Trial 3
Piece-wise trajectory	Temporal calibration	Feature refinement	$e_{t}$	$e_{r}$	$e_{t}$	$e_{r}$	$e_{t}$	$e_{r}$
			8.1633	2.8632	4.6642	2.6372	1.2161	1.4511
✔			8.0019	2.8187	3.8794	2.7859	0.4974	1.4322
✔	✔		0.9141	5.6910	0.6322	0.3239	0.3707	0.3065
✔	✔	✔	0.5336	0.1984	0.6572	0.3127	0.3638	0.2911

V-C Ablation Study

To thoroughly analyze and validate the performance and functionality of each module within EF-Calib, we conducted an ablation study. Specifically, we scrutinized three modules: piece-wise trajectories, temporal calibration, and feature refinement. Table III demonstrates the impact of the introduction of these three modules on the calibration error of the EF-Calib extrinsic parameters. As can be seen from Table III, the introduction of all three modules can greatly improve the extrinsic parameter calibration accuracy.

VI Conclusion

In this letter, we propose a spatiotemporal calibration framework called EF-Calib, aiming to achieve joint calibration of intrinsic parameters, extrinsic parameters, and time offset for event and frame-based cameras. First, we design a novel calibration pattern that accommodates the heterogeneous nature of event and frame-based representations, incorporating both circles and crosspoints to facilitate simultaneous recognition by both camera types. A corresponding event-based recognition algorithm is developed to ensure robust and accurate feature recovery using this pattern. Additionally, to manage the asynchronous characteristics of the event stream, we introduce a piece-wise B-spline to continuously represent the pose trajectory of the event camera. Finally, we provide the analytic Jacobian of the error term and implement the joint calibration of intrinsic, extrinsic, and time offset for both camera types. Experimental results demonstrate that EF-Calib outperforms current state-of-the-art methods in intrinsic parameter estimation while also achieving high accuracy in extrinsic parameter and time offset estimation. These results demonstrate the spatiotemporal calibration capabilities of EF-Calib and lay a robust foundation for the fusion of event and frame.

In the future, we aim to explore markerless online calibration based on EF-Calib. Additionally, we plan to utilize EF-Calib to create novel visual perception frameworks that fuse events and frames.

References

[1] G. Gallego, T. Delbruck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. Davison, J. Conradt, and K. Daniilidis, “Event-based vision: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 1, pp. 154–180, Jan. 2020.
[2] S. Zhu, Z. Tang, M. Yang, E. Learned-Miller, and D. Kim, “Event camera-based visual odometry for dynamic motion tracking of a legged robot using adaptive time surface,” in Proc. IEEE/RSJ Int. Conf. Intell. Rob. Syst., Detroit, MI, USA, 2023, pp. 3475–3482.
[3] Z. Zhou, Z. Wu, R. Boutteau, F. Yang, C. Demonceaux, and D. Ginhac, “RGB-Event fusion for moving object detection in autonomous driving,” in Proc. IEEE Int. Conf. Robot. Autom., London, United Kingdom, Jul. 2023, pp. 7808–7815.
[4] J. Jiang, J. Li, B. Zhang, X. Deng, and B. Shi, “EvHandPose: Event-based 3D hand pose estimation with sparse supervision,” IEEE Trans. Pattern Anal. Mach. Intell., early access, doi: 10.1109/TPAMI.2024.3380648, 2024.
[5] J. Han, Y. Yang, P. Duan, C. Zhou, L. Ma, C. Xu, T. Huang, I. Sato, B. Shi, “Hybrid high dynamic range imaging fusing neuromorphic and conventional images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 7, pp. 8553–8565, Jul. 2023.
[6] A. R. Vidal, H. Rebecq, T. Horstschaefer, and D. Scaramuzza, “Ultimate SLAM? Combining events, images, and IMU for robust visual SLAM in HDR and high-speed scenarios,” IEEE Robot. Automat. Lett., vol. 3, no. 2, pp. 994–1001, Apr. 2018.
[7] M. S. Lee, J. H. Jung, Y. J. Kim, and C. G. Park, “Event-and frame-based visual-inertial odometry with adaptive filtering based on 8-DOF war** uncertainty,” IEEE Robot. Automat. Lett., vol. 9, no. 2, pp. 1003–1010, Feb. 2024,
[8] W. Guan, P. Chen, Y. Xie, and P. Lu, “PL-EVIO: Robust monocular event-based visual inertial odometry with point and line features,” IEEE Trans. Automat. Sci. Eng., early access, doi: 10.1109/TASE.2023.3324365, 2023.
[9] C. Luo, J. Wu, S. Sun, and P. Ren, “TransCODNet: Underwater transparently camouflaged object detection via RGB and event frames collaboration,” IEEE Robot. Automat. Lett., vol. 9, no. 2, pp. 1444–1451, Feb. 2024.
[10] P. Chen, W. Guan, F. Huang, Y. Zhong, W. Wen, L. Hsu, P. Lu, “ECMD: An event-centric multisensory driving dataset for SLAM,” IEEE Trans. Intell. Veh., vol. 9, no. 1, pp. 407–416, Jan. 2024.
[11] C. Creß, W. Zimmer, N. Purschke, B. N. Doan, S. Kirchner, V. Lakshminarasimhan, L. Strand, and A. Knoll, “TUMTraf event: Calibration and fusion resulting in a dataset for roadside event-based and RGB cameras,” IEEE Trans. Intell. Veh., early access, doi: 10.1109/TIV.2024.3393749, 2024.
[12] L. Gao, Y. Liang, J. Yang, S. Wu, C. Wang, J. Chen, and L. Kneip, “VECtor: A versatile event-centric benchmark for multi-sensor SLAM,” IEEE Robot. Automat. Lett., vol. 7, no. 3, pp. 8217–8224, Jul. 2022.
[13] Z. Zhang, “A flexible new technique for camera calibration,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 11, pp. 1330–1334, Nov. 2000.
[14] “Calibration toolbox by RPG, University of Zurich,” https://github.com/uzh-rpg/rpg_dvs_ros/tree/master/dvs_calibration.
[15] G. Orchard, “Calibration toolbox by G. Orchard,” https://github.com/gorchard/DVScalibration.
[16] “Calibration toolbox by VLOGroup at TU Graz,” https://github.com/VLOGroup/dvs-calibration.
[17] M. Muglikar, M. Gehrig, D. Gehrig, and D. Scaramuzza, “How to calibrate your event camera,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops, Nashville, TN, USA, Jun. 2021, pp. 1403–1409.
[18] H. Rebecq, R. Ranftl, V. Koltun, and D. Scaramuzza, “High speed and high dynamic range video with an event camera,” IEEE Trans. Pattern Anal. Mach. Intell. vol. 43, no. 6, pp. 1964–1980, Jun. 2021.
[19] K. Huang, Y. Wang, and L. Kneip, “Dynamic event camera calibration,” in Proc. IEEE/RSJ Int. Conf. Intell. Rob. Syst., Prague, Czech Republic, Sep. 2021, pp. 7021–7028.
[20] M. Salah, A. Abdulla, H. Muhammad, G. Daniel, A. Abdelqader, S. Lakmal, S. Davide, and Z. Yahya, “E-Calib: A fast, robust and accurate calibration toolbox for event cameras,” Jun. 2023, arXiv:2306.09078.
[21] J. Rehder, R. Siegwart, and P. Furgale, “A general approach to spatiotemporal calibration in multisensor systems,” IEEE Trans. Robot., vol. 32, no. 2, pp. 383–398, Apr. 2016.
[22] J. Huai, Y. Zhuang, Y. Lin, G. Jozkow, Q. Yuan, and D. Chen, “Continuous-time spatiotemporal calibration of a rolling shutter camera-IMU system,” IEEE Sensors J., vol. 22, no. 8, pp. 7920–7930, Apr. 2022.
[23] E. Mueggler, G. Gallego, H. Rebecq, and D. Scaramuzza, “Continuous-time visual-inertial odometry for event cameras,” IEEE Trans. Robot., vol. 34, no. 6, pp. 1425–1440, Dec. 2018.
[24] A. Patron-Perez, S. Lovegrove, and G. Sibley, “A spline-based trajectory representation for sensor fusion and rolling shutter cameras,” Int. J. Comput. Vis., vol. 113, no. 3, pp. 208–219, Feb. 2015.
[25] K. Qin, “General matrix representations for B-splines,” Visual Comput., vol. 16, no. 3, pp. 177-186, 2000.
[26] C. Sommer, V. Usenko, D. Schubert, N. Demmel, and D. Cremers, “Efficient derivative computation for cumulative B-splines on lie groups,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Seattle, WA, USA, Jun. 2020, pp. 11145–11153
[27] S. Wang, M. Zhu, Y. Hu, D. Li, F. Yuan, and J. Yu, “Accurate detection and localization of curved checkerboard-like marker based on quadratic form,” IEEE Trans. Instrum. Meas., vol. 71, pp. 1–11, Jul. 2022.
[28] C. Grana, D. Borghesani, and R. Cucchiara, “Optimized block-based connected components labeling with decision trees,” IEEE Trans. Imag. Process., vol. 19, no. 6, pp. 1596–1609, Jun. 2010.
[29] G. Bradski and A. Kaehler, Learning OpenCV: Computer vision with the OpenCV library. Sebastopol, CA, USA: O’Reilly, 2008, pp. 370–396.

EF-Calib: Spatiotemporal Calibration of Event- and Frame-Based Cameras Using Continuous-Time Trajectories