Efficient Event Stream Super-Resolution with Recursive Multi-Branch Fusion

Quanmin Liang1,2∗    Zhilin Huang2,3∗    Xiawu Zheng2    Feidiao Yang2   
Jun Peng2
   Kai Huang122footnotemark: 2&Yonghong Tian2,422footnotemark: 2
1School of Computer Science and Engineering, Sun Yat-Sen University
2Peng Cheng Laboratory
3Shenzhen International Graduate School, Tsinghua University
4Peking University
[email protected], {zerinhwang03, yhtian}@pku.edu.cn, [email protected],
{yangfd, pengj01}@pcl.ac.cn, [email protected]
Abstract

Current Event Stream Super-Resolution (ESR) methods overlook the redundant and complementary information present in positive and negative events within the event stream, employing a direct mixing approach for super-resolution, which may lead to detail loss and inefficiency. To address these issues, we propose an efficient Recursive Multi-Branch Information Fusion Network (RMFNet) that separates positive and negative events for complementary information extraction, followed by mutual supplementation and refinement. Particularly, we introduce Feature Fusion Modules (FFM) and Feature Exchange Modules (FEM). FFM is designed for the fusion of contextual information within neighboring event streams, leveraging the coupling relationship between positive and negative events to alleviate the misleading of noises in the respective branches. FEM efficiently promotes the fusion and exchange of information between positive and negative branches, enabling superior local information enhancement and global information complementation. Experimental results demonstrate that our approach achieves over 17% and 31% improvement on synthetic and real datasets, accompanied by a 2.3×\times× acceleration. Furthermore, we evaluate our method on two downstream event-driven applications, i.e., object recognition and video reconstruction, achieving remarkable results that outperform existing methods. Our code and Supplementary Material are available at https://github.com/Lqm26/RMFNet.

1 Introduction

Event cameras are biologically inspired asynchronous sensors Brandli et al. (2014). Unlike traditional cameras, event cameras register only the changes in brightness for each pixel over time. These are known as “events”, which are categorized as positive or negative, depending on whether the brightness increases or decreases, respectively. This characteristic significantly reduces the amount of recorded information, resulting in advantages such as high temporal resolution, low power consumption, and a high dynamic range (HDR) Gallego et al. (2020). However, as the application scenarios become more complex, the spatial resolution of existing event cameras is insufficient Li et al. (2021). Increasing spatial resolution at the hardware level presents challenges in implementing asynchronous circuits Gallego et al. (2020), making it difficult to maintain the low power consumption and high temporal resolution advantages of event cameras Weng et al. (2022); Gehrig and Scaramuzza (2022). Therefore, some researchers propose to address this issue at the software level, e.g. by leveraging advanced algorithms, which is referred to as Event Stream Super-Resolution (ESR).

Refer to caption
Figure 1: Compared to previous ESR methods that directly mix positive and negative events, our multi-branch approach effectively extracts and integrates features from positive and negative events, achieving a more complete and clearer details (see the green box).

Current research on ESR can be mainly divided into two directions. One approach aims to directly generate high-resolution event data from low-resolution event streams by spiking neural networks Li et al. (2019a, 2021) or frame-assisted methods Wang et al. (2020b). However, these methods often require significant memory  Li et al. (2019a, 2021) and high-quality images as assistance Wang et al. (2020b), which complicates the training process and hinders achieving large-factor super-resolution. Hence, researchers have proposed stacking event streams into either event frames Rebecq et al. (2017) or event count images Maqueda et al. (2018); Zhu et al. (2018) and subsequently applying learning-based methods for ESR Duan et al. (2021); Weng et al. (2022). Within event streams, there exist spatiotemporally inconsistent positive and negative events Gehrig et al. (2019). These events do not perfectly align on a 2D plane but contain complementary information. Merging them into an event frame results in partial cancellation between positive and negative events, forming a new representation. As positive and negative events typically do not occur independently, the event frame helps filter out some naturally occurring noise in the event stream. Consequently, positive events, negative events, and event frames each contain different information about the event stream. However, previous methods did not effectively distinguish and fully utilize this information. They simply mix them and input them into the ESR model, leading to the loss of fine details in the super-resolved (SR) event stream (Figure 1(a)).

To address these issues, we propose an efficient Recursive Multi-Branch Information Fusion Network (RMFNet). As illustrated in Figure 1(b), this network processes positive events, negative events, and event frames in a multi-branch fashion. Positive and negative events contain the majority of information in the event stream, while the event frame provide the guidance to filter noises. Therefore, we design a Feature Fusion Module (FFM) to highlight the valuable information in positive and negative streams according to the event frame at the initial stage. Specifically, this module calculates attention weight maps from features of different branches, facilitating the fusion of contextual information and aiding the positive and negative branches in noise removal. Subsequently, the positive and negative branches conduct feature extraction for positive and negative events, respectively, employing a Feature Exchange Module (FEM) for the adaptive fusion and exchange. Through capturing complementary information and long-range dependencies between positive and negative events, FEM improves the integration and exchange of information across different branches.

The main contributions of our work are as follows:

  • We introduce an efficient Recursive Multi-Branch Information Fusion Network capable of effectively merging positive events, negative events, and event frames, thereby obtaining high-quality SR event images.

  • We design Feature Fusion Modules and Feature Exchange Modules, which enhance the positive and negative event streams while effectively fusing and complementing information across different branches.

  • We explored the impact of existing data augmentation methods on ESR tasks and proposed an effective data augmentation strategy to enhance the model’s robustness and performance.

  • Our method achieves over 17% and 31% improvement on synthetic and real datasets, accompanied by a 2.3×\times× acceleration. In downstream event-based recognition and reconstruction tasks, our method effectively enhances performance, further validating the effectiveness of our approach.

2 Related Work

Event Stream Super-resolution. Due to the unique spatio-temporal characteristics of event streams, event stream super-resolution (ESR) tasks are often more challenging. Initially, Li et al. Li et al. (2019a) introduced the Event Count Map (ECM) as a method to describe event spatial distribution. They established a spatiotemporal filter to generate a time-rate function and employed a non-homogeneous Poisson distribution to model events on each pixel. However, this approach encounters inaccuracies in estimating spatiotemporal distributions when performing high-factor super-resolution. To address this issue, Wang et al. Wang et al. (2020b) proposed a novel optimization framework called GEF, which utilizes motion correlation probabilities to filter event noise. The optimization maximizes the structural correspondence between low-resolution events and high-resolution image signals, facilitating event stream super-resolution in conjunction with image frames. Despite performing well in certain scenarios, the GEF method exhibits performance degradation when image frame quality deteriorates. Building upon this, Li et al. Li et al. (2021) proposed a spatio-temporal constraint learning method based on the spiking neural network (SNN) characteristics to simultaneously learn temporal and spatial features in event streams. On the other hand, Duan et al. Duan et al. (2021) transformed event streams into a 2D event frames format and designed a 3D U-Net-based network for ESR. While both methods demonstrated excellent performance in small-scale super-resolution tasks, they faced challenges of excessive memory requirements and training difficulties in large-factor super-resolution. To effectively address the challenges of large-factor super-resolution, Weng et al. Weng et al. (2022) introduced an event-based super-resolution method based on Recurrent Neural Networks. They initially transformed event streams into coarse-grained high-resolution event streams using coordinate relocation, followed by super-resolution through recurrent networks. This approach not only handles high-factor super-resolution effectively but also mitigates training challenges posed by excessive memory requirements.

However, these methods did not account for the spatiotemporal inconsistencies and complementarities between positive and negative events in the event stream. Directly mixing them may lead to the loss of details. Therefore, we propose RMFNet, employing a multi-branch approach to mutually fuse and complement positive and negative events, which effectively enhances the performance of ESR.

3 Method

Refer to caption
Figure 2: Architecture of our proposed Recursive Multi-Branch Information Fusion Network (RMFNet). Initially, the event frame is fused into positive and negative branches along with the previous output Ot1subscript𝑂𝑡1O_{t-1}italic_O start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and state ht1subscript𝑡1h_{t-1}italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT using the Feature Fusion Module (bottom left). Subsequently, each branch independently extracts features through Residual Blocks, and a Feature Exchange Module (bottom) facilitates the exchange of information between the branches. Finally, the features from the positive and negative branches are concatenated, and high-resolution event count images Otsubscript𝑂𝑡O_{t}italic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are obtained through Pixel Shuffle operation.

In this section, we first introduced the data representation methods for event cameras in Section 3.1. Subsequently, we presented the proposed Recursive Multi-Branch Information Fusion Network in Section 3.2. Finally, we described the data augmentation methods for ESR in Section 3.3.

3.1 Event Data Representation

A set of event streams can be represented as ={ek}k=1Nsubscriptsuperscriptsubscript𝑒𝑘𝑁𝑘1{\mathcal{E}}=\{e_{k}\}^{N}_{k=1}caligraphic_E = { italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT, where N𝑁Nitalic_N is the number of events, each event eksubscript𝑒𝑘e_{k}\in{\mathcal{E}}italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_E can be denoted by a tuple (xi,yi,ti,pi)subscript𝑥𝑖subscript𝑦𝑖subscript𝑡𝑖subscript𝑝𝑖(x_{i},y_{i},t_{i},p_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), representing spatial coordinates, timestamp and polarity respectively. Subsequently, we partition {ek}k=1Nsubscriptsuperscriptsubscript𝑒𝑘𝑁𝑘1\{e_{k}\}^{N}_{k=1}{ italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT into positive events {ek}k=1Npsubscriptsuperscriptsubscript𝑒𝑘subscript𝑁𝑝𝑘1\{e_{k}\}^{N_{p}}_{k=1}{ italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT and negative events {ek}k=1Nnsubscriptsuperscriptsubscript𝑒𝑘subscript𝑁𝑛𝑘1\{e_{k}\}^{N_{n}}_{k=1}{ italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT based on their polarity pi=±1subscript𝑝𝑖plus-or-minus1p_{i}=\pm 1italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ± 1. Specifically, we stack {ek}k=1Npsubscriptsuperscriptsubscript𝑒𝑘subscript𝑁𝑝𝑘1\{e_{k}\}^{N_{p}}_{k=1}{ italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT and {ek}k=1Nnsubscriptsuperscriptsubscript𝑒𝑘subscript𝑁𝑛𝑘1\{e_{k}\}^{N_{n}}_{k=1}{ italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT into event count images Maqueda et al. (2018); Zhu et al. (2018) according to the following equations:

h(x,y)=ekδ(xxi,yyi)𝑥𝑦subscriptsubscript𝑒𝑘𝛿𝑥subscript𝑥𝑖𝑦subscript𝑦𝑖h\left({x,y}\right)={\sum\limits_{e_{k}\in{\mathcal{E}}}{\delta\left(x-x_{i},~% {}y-y_{i}\right)}}italic_h ( italic_x , italic_y ) = ∑ start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_E end_POSTSUBSCRIPT italic_δ ( italic_x - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (1)

where δ𝛿\deltaitalic_δ represents the Kronecker delta. Thus, we can build up two event count images from {ek}k=1Npsubscriptsuperscriptsubscript𝑒𝑘subscript𝑁𝑝𝑘1\{e_{k}\}^{N_{p}}_{k=1}{ italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT and {ek}k=1Nnsubscriptsuperscriptsubscript𝑒𝑘subscript𝑁𝑛𝑘1\{e_{k}\}^{N_{n}}_{k=1}{ italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT: positive 𝐩tH×Wsubscript𝐩𝑡superscript𝐻𝑊{\mathbf{p}}_{t}\in\mathbb{R}^{H\times W}bold_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT, and negative 𝐧tH×Wsubscript𝐧𝑡superscript𝐻𝑊{\mathbf{n}}_{t}\in\mathbb{R}^{H\times W}bold_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT. And the event frame Rebecq et al. (2017) is obtained by stacking all events (including positive and negative events) using equation (1), resulting in 𝐟tH×Wsubscript𝐟𝑡superscript𝐻𝑊{\mathbf{f}}_{t}\in\mathbb{R}^{H\times W}bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT.

3.2 Multi-Branch Fusion Networks

The framework of our proposed RMFNet is depicted in Figure 2. The main inputs of this network include positive events, negative events, and event frames. Additionally, following a recursive approach Schuster and Paliwal (1997), we introduce the previous output Ot1subscript𝑂𝑡1O_{t-1}italic_O start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and state ht1subscript𝑡1h_{t-1}italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT into the input, aiding in better capturing features from adjacent event streams and achieving contextual fusion of event stream information. Given that positive and negative events contain the majority of information in the event stream, we process them separately through dedicated positive and negative branches. The event frame, serving as a coupled representation of positive and negative events, assists in filtering out noise (as positive and negative events do not occur independently). Thus, in the initial stage of RMFNet, we fuse 𝐟tsubscript𝐟𝑡{\mathbf{f}}_{t}bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with Ot1subscript𝑂𝑡1O_{t-1}italic_O start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and ht1subscript𝑡1h_{t-1}italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT as event-enhanced information FEnhC×H×Wsubscript𝐹𝐸𝑛superscript𝐶𝐻𝑊F_{Enh}\in\mathbb{R}^{C\times H\times W}italic_F start_POSTSUBSCRIPT italic_E italic_n italic_h end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT, which is passed to the positive and negative branches through the Feature Fusion Module (FFM). The positive and negative branches utilize Residual Block He et al. (2016) as the backbone for feature extraction from the fused positive and negative events, respectively. Subsequently, a Feature Exchange Module (FEM) is employed to facilitate the fusion and exchange of information between the positive and negative branches. Finally, the features from positive and negative events are concatenated, outputting the hidden state htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and the SR event count images Otsubscript𝑂𝑡O_{t}italic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are obtained through pixel shuffle Shi et al. (2016).

3.2.1 Feature Fusion Module

As depicted in the bottom left corner of Figure 2, the FFM is tasked with transmitting event-enhanced information FEnhsubscript𝐹𝐸𝑛F_{Enh}italic_F start_POSTSUBSCRIPT italic_E italic_n italic_h end_POSTSUBSCRIPT to the positive and negative branches without compromising their distinctive features. Denoting the features extracted by convolutional layers for positive and negative events as FtpC×H×Wsubscriptsuperscript𝐹𝑝𝑡superscript𝐶𝐻𝑊F^{p}_{t}\in\mathbb{R}^{C\times H\times W}italic_F start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT and FtnC×H×Wsubscriptsuperscript𝐹𝑛𝑡superscript𝐶𝐻𝑊F^{n}_{t}\in\mathbb{R}^{C\times H\times W}italic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT, respectively. Initially, we concatenate Ftpsubscriptsuperscript𝐹𝑝𝑡F^{p}_{t}italic_F start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with FEnhsubscript𝐹𝐸𝑛F_{Enh}italic_F start_POSTSUBSCRIPT italic_E italic_n italic_h end_POSTSUBSCRIPT, followed by a preliminary fusion through a Basic Block, resulting in Ftfusesubscriptsuperscript𝐹𝑓𝑢𝑠𝑒𝑡F^{fuse}_{t}italic_F start_POSTSUPERSCRIPT italic_f italic_u italic_s italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The event-enhanced information encompasses details from adjacent event streams and coupling information between positive and negative events, effectively guiding the positive branch in detail recovery. To integrate these features seamlessly, we utilize the fused feature Ftfusesubscriptsuperscript𝐹𝑓𝑢𝑠𝑒𝑡F^{fuse}_{t}italic_F start_POSTSUPERSCRIPT italic_f italic_u italic_s italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to compute two attention weights. The first is local attention weight:

𝐀tloc=BN(𝑪1×1(𝑹(BN(𝑪1×1(Ftfuse)))))superscriptsubscript𝐀𝑡𝑙𝑜𝑐𝐵𝑁subscript𝑪11𝑹𝐵𝑁subscript𝑪11superscriptsubscript𝐹𝑡𝑓𝑢𝑠𝑒{\mathbf{A}}_{t}^{loc}=BN\left({\bm{C}}_{1\times 1}\left({{\bm{R}}\left({BN% \left({{\bm{C}}_{1\times 1}\left(F_{t}^{fuse}\right)}\right)}\right)}\right)\right)bold_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_o italic_c end_POSTSUPERSCRIPT = italic_B italic_N ( bold_italic_C start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT ( bold_italic_R ( italic_B italic_N ( bold_italic_C start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_u italic_s italic_e end_POSTSUPERSCRIPT ) ) ) ) ) (2)

where BN𝐵𝑁BNitalic_B italic_N denotes batch normalization, 𝑪1×1subscript𝑪11{\bm{C}}_{1\times 1}bold_italic_C start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT represents a 1×1111\times 11 × 1 convolution operation, and 𝑹𝑹{\bm{R}}bold_italic_R represents the ReLU activation function.

The second is global attention weight 𝐀tgloC×1×1superscriptsubscript𝐀𝑡𝑔𝑙𝑜superscript𝐶11{\mathbf{A}}_{t}^{glo}\in\mathbb{R}^{C\times 1\times 1}bold_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_l italic_o end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × 1 × 1 end_POSTSUPERSCRIPT, which is computed channel-wise. Specifically, we incorporate global average pooling to process Ftfusesubscriptsuperscript𝐹𝑓𝑢𝑠𝑒𝑡F^{fuse}_{t}italic_F start_POSTSUPERSCRIPT italic_f italic_u italic_s italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT along the spatial dimensions:

𝐀tglo=𝒇att(GAP(Ftfuse))superscriptsubscript𝐀𝑡𝑔𝑙𝑜superscript𝒇𝑎𝑡𝑡𝐺𝐴𝑃superscriptsubscript𝐹𝑡𝑓𝑢𝑠𝑒\displaystyle{\mathbf{A}}_{t}^{glo}={\bm{f}}^{att}\left(GAP\left(F_{t}^{fuse}% \right)\right)bold_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_l italic_o end_POSTSUPERSCRIPT = bold_italic_f start_POSTSUPERSCRIPT italic_a italic_t italic_t end_POSTSUPERSCRIPT ( italic_G italic_A italic_P ( italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_u italic_s italic_e end_POSTSUPERSCRIPT ) ) (3)

where 𝒇attsuperscript𝒇𝑎𝑡𝑡{\bm{f}}^{att}bold_italic_f start_POSTSUPERSCRIPT italic_a italic_t italic_t end_POSTSUPERSCRIPT represents the function given in equation (2), and GAP𝐺𝐴𝑃GAPitalic_G italic_A italic_P stands for global average pooling.

Finally, we combine the global and local attention, apply it to Ftfusesubscriptsuperscript𝐹𝑓𝑢𝑠𝑒𝑡F^{fuse}_{t}italic_F start_POSTSUPERSCRIPT italic_f italic_u italic_s italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and add it to the previous features of positive events Ftpsubscriptsuperscript𝐹𝑝𝑡F^{p}_{t}italic_F start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, thereby integrating the event-enhanced information into the positive branch:

Ftout=Ftp+Ftfuse(σ(𝐀tglo𝐀tloc))superscriptsubscript𝐹𝑡𝑜𝑢𝑡superscriptsubscript𝐹𝑡𝑝tensor-productsuperscriptsubscript𝐹𝑡𝑓𝑢𝑠𝑒𝜎direct-sumsuperscriptsubscript𝐀𝑡𝑔𝑙𝑜superscriptsubscript𝐀𝑡𝑙𝑜𝑐F_{t}^{out}=F_{t}^{p}+F_{t}^{fuse}\otimes\left(\sigma\left({\mathbf{A}}_{t}^{% glo}\oplus{\mathbf{A}}_{t}^{loc}\right)\right)italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT = italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT + italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_u italic_s italic_e end_POSTSUPERSCRIPT ⊗ ( italic_σ ( bold_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_l italic_o end_POSTSUPERSCRIPT ⊕ bold_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_o italic_c end_POSTSUPERSCRIPT ) ) (4)

where tensor-product\otimes represents element-wise product, σ𝜎\sigmaitalic_σ denotes the sigmoid activation function, and direct-sum\oplus signifies broadcasting addition. Our network is entirely symmetric with respect to positive and negative branches, so the negative branch follows the same process.

3.2.2 Feature Exchange Module

Considering the complementary and redundant information present in positive and negative events, directly integrating feature information from these branches may have adverse effects. To address this, we introduce a Feature Exchange Module (depicted below Figure 2), which utilizes attention mechanisms to automatically select and enhance crucial features, facilitating efficient information exchange between the two branches.

Firstly, to reduce the redundancy in individual branch features and emphasize important features, we apply spatial attention separately to both branches:

𝐅~t=Convbasic(Ftin)subscript~𝐅𝑡𝐶𝑜𝑛subscript𝑣𝑏𝑎𝑠𝑖𝑐superscriptsubscript𝐹𝑡𝑖𝑛\tilde{\mathbf{F}}_{t}={Conv}_{basic}\left(F_{t}^{in}\right)over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT italic_b italic_a italic_s italic_i italic_c end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT ) (5)
𝐅~tP=Conv(𝐅~t)𝐅~t+Conv(𝐅~t)superscriptsubscript~𝐅𝑡𝑃tensor-product𝐶𝑜𝑛𝑣subscript~𝐅𝑡subscript~𝐅𝑡𝐶𝑜𝑛𝑣subscript~𝐅𝑡\tilde{\mathbf{F}}_{t}^{P}=Conv\left(\tilde{\mathbf{F}}_{t}\right)\otimes% \tilde{\mathbf{F}}_{t}+Conv\left(\tilde{\mathbf{F}}_{t}\right)over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT = italic_C italic_o italic_n italic_v ( over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⊗ over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_C italic_o italic_n italic_v ( over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (6)

where Ftinsuperscriptsubscript𝐹𝑡𝑖𝑛F_{t}^{in}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT represents the input features from the positive and negative branches, Convbasic𝐶𝑜𝑛subscript𝑣𝑏𝑎𝑠𝑖𝑐{Conv}_{basic}italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT italic_b italic_a italic_s italic_i italic_c end_POSTSUBSCRIPT denotes the Basic Block, and Conv𝐶𝑜𝑛𝑣Convitalic_C italic_o italic_n italic_v represents the convolutional operation. The Conv(𝐅~t)𝐶𝑜𝑛𝑣subscript~𝐅𝑡Conv(\tilde{\mathbf{F}}_{t})italic_C italic_o italic_n italic_v ( over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) in Eq.(6) respectively serves as the weight and bias, adjusting the weights of branch features. 𝐅~tPsuperscriptsubscript~𝐅𝑡𝑃\tilde{\mathbf{F}}_{t}^{P}over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT is the output of the positive branch, and 𝐅~tNsuperscriptsubscript~𝐅𝑡𝑁\tilde{\mathbf{F}}_{t}^{N}over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT is obtained similarly from the negative branch.

Subsequently, inspired by self-attention mechanisms Vaswani et al. (2017); Wang et al. (2018), we design two symmetrical Attention Blocks to capture complementary information from the positive and negative branches. Taking the positive branch as an example, we use 𝑪1×1subscript𝑪11{\bm{C}}_{1\times 1}bold_italic_C start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT to obtain 𝐕C×(HW)𝐕superscript𝐶𝐻𝑊{\mathbf{V}}\in\mathbb{R}^{C\times(HW)}bold_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × ( italic_H italic_W ) end_POSTSUPERSCRIPT for the positive branch, and 𝐐C1×(HW)𝐐superscriptsubscript𝐶1𝐻𝑊{\mathbf{Q}}\in\mathbb{R}^{C_{1}\times(HW)}bold_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ( italic_H italic_W ) end_POSTSUPERSCRIPT and 𝐊C1×(HW)𝐊superscriptsubscript𝐶1𝐻𝑊{\mathbf{K}}\in\mathbb{R}^{C_{1}\times(HW)}bold_K ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ( italic_H italic_W ) end_POSTSUPERSCRIPT for the negative branch. Here, C𝐶Citalic_C represents the number of channels, and C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is set to 1/8 of C𝐶Citalic_C for enhanced computational efficiency. Therefore, the output of the positive branch, fused with features from the negative branch, can be represented as:

𝐅fuseP=𝐕(σ(𝐐𝖳𝐊))superscriptsubscript𝐅𝑓𝑢𝑠𝑒𝑃tensor-product𝐕𝜎tensor-productsuperscript𝐐𝖳𝐊{\mathbf{F}}_{fuse}^{P}={\mathbf{V}}\otimes\left(\sigma\left({\mathbf{Q}}^{% \mathsf{T}}\otimes{\mathbf{K}}\right)\right)bold_F start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT = bold_V ⊗ ( italic_σ ( bold_Q start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ⊗ bold_K ) ) (7)

where 𝐐𝖳superscript𝐐𝖳{\mathbf{Q}}^{\mathsf{T}}bold_Q start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT represents the transpose of 𝐐𝐐{\mathbf{Q}}bold_Q. Through the two symmetrical attention modules, we can achieve a complementary fusion of features from the positive and negative branches, effectively enhancing the performance of ESR.

3.2.3 Training Objectives

We partition the event stream into multiple sequences of length T𝑇Titalic_T for training our method, following the approach of Weng et al Weng et al. (2022). We set T=9𝑇9T=9italic_T = 9 and use Mean Squared Error (MSE) to calculate the loss:

=t=1TMSE(OtSR,ECItHR)superscriptsubscript𝑡1𝑇𝑀𝑆𝐸superscriptsubscript𝑂𝑡𝑆𝑅𝐸𝐶superscriptsubscript𝐼𝑡𝐻𝑅\mathcal{L}=~{}{\sum_{t=1}^{T}{MSE\left({O}_{t}^{SR},~{}{ECI}_{t}^{HR}\right)}}caligraphic_L = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_M italic_S italic_E ( italic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_R end_POSTSUPERSCRIPT , italic_E italic_C italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_R end_POSTSUPERSCRIPT ) (8)

where OtSRsuperscriptsubscript𝑂𝑡𝑆𝑅{O}_{t}^{SR}italic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_R end_POSTSUPERSCRIPT represents the event count images of the final SR event stream, ECItHR𝐸𝐶superscriptsubscript𝐼𝑡𝐻𝑅{ECI}_{t}^{HR}italic_E italic_C italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_R end_POSTSUPERSCRIPT represents the ground truth event count images, and MSE𝑀𝑆𝐸MSEitalic_M italic_S italic_E is the mean square error function.

3.3 Data Augmentation for ESR

Previous research in the field of image or video super-resolution has shown that methods involving operations or augmentations in the pixel space Zhang et al. (2017); Yun et al. (2019) can effectively enhance task performance Yoo et al. (2020), as they preserve the spatial relationships within the images. In the realm of ESR, there is currently a lack of systematic investigation into Event Stream Super-Resolution Data Augmentation (ESRDA). To address this gap, we adapt and refine data augmentation methods from some event stream studies Gu et al. (2021); Barchid et al. (2023) and RGB image domains, exploring the impact of data augmentation on the ESR task. We experiment with the following methods:

  • Polarity flip**.

  • RandomFlip Simonyan and Zisserman (2015).

  • Drop by time Gu et al. (2021).

  • Random drop Gu et al. (2021).

  • Drop by area Gu et al. (2021).

  • Random drop or add noise.

  • Static Translation.

  • RandomResizedCrop He et al. (2016).

Regarding the details and parameters for data augmentation operations, please refer to the Supplementary Material.

Refer to caption
Figure 3: Qualitative analysis comparison on synthetic and real datasets. The upper and lower figures represent 4×4\times4 × SR results on the NFS-syn and EventNFS datasets, respectively. It is evident that our RMFNet excels in recovering finer details of the event streams on both datasets (see the green box), resulting in sharper edges. Positive events are in blue, negative events in red. Zoom in for the best view.

Methods NFS-syn RGB-syn EventNFS-real Param (M) Inference time (ms) 2×2\times2 × 4×4\times4 × 8×8\times8 × 2×2\times2 × 4×4\times4 × 2×2\times2 × 4×4\times4 × 2×2\times2 × 4×4\times4 × 8×8\times8 × 2×2\times2 × 4×4\times4 × 8×8\times8 × bicubic 0.616 0.531 0.545 0.1197 0.1429 0.760 0.899 - - - - - - SRFBN 0.411 0.394 0.394 0.1051 0.1010 0.415 0.545 2.1 3.6 7.9 37.3 54.8 65.4 RSTT 0.389 0.366 0.365 0.0954 0.0909 0.310 0.399 3.8 4.1 4.3 61.4 61.1 73.0 EventZoom 0.806 1.049 1.239 0.4462 1.2232 0.778 1.248 11.5 11.5 11.5 17.4 70.1 396.6 RecEvSR 0.430 0.368 0.332 0.5013 0.3360 0.376 0.449 1.8 1.8 1.8 13.2 18.9 19.2 Ours 0.316 0.300 0.305 0.0899 0.0865 0.250 0.316 3.0 3.1 3.6 7.0 7.5 7.8

Table 1: Quantitative analysis comparison on real and synthetic datasets. Mean Squared Error (MSE𝑀𝑆𝐸MSEitalic_M italic_S italic_E) is used as the evaluation metric. Model Parameters (Param) and Inference time are calculated on the NFS-syn dataset. Bold and underline indicate the best and second-best results.

4 Experiments

4.1 Datasets and Training Settings

Obtaining event data is challenging, and the availability of event datasets containing LR-HR pairs at multiple scales is limited. To address this scarcity, similar to many event-based tasks Weng et al. (2022); Rebecq et al. (2019); Wang et al. (2020a), we employed synthetic simulation datasets to enrich our training data. EventNFS Duan et al. (2021) is the first dataset to include LR-HR pairs captured through a designed display-camera system, capturing rapidly displayed images on a monitor. However, due to device resolution limitations, the minimum resolution is 55×31553155\times 3155 × 31, and there are only 4×4\times4 × data pairs at the maximum. Moreover, the data at the smallest resolution suffers severe degradation due to its low resolution. To overcome these issues, we utilized an event simulator Lin et al. (2022) to transform the NFS dataset Kiani Galoogahi et al. (2017) and RGB-DAVIS dataset Wang et al. (2020b) into event data, resulting in NFS-syn and RGB-syn datasets. We selected these datasets because of their high temporal resolution, which can better simulate real-world event streams. For further details, please refer to the Supplementary Material.

For a fair comparison, we maintained training settings consistent with Weng et al. (2022). We used MSE𝑀𝑆𝐸MSEitalic_M italic_S italic_E as the evaluation metric for our models. All experiments were conducted on a Tesla V100 GPU.

4.2 Comparison with State-of-the-Art Models

In this work, we primarily compared our proposed RMFNet with two previous learning-based approaches, EventZoom Duan et al. (2021) and RecEvSR Weng et al. (2022). Other ESR methods Li et al. (2021); Wang et al. (2020b); Li et al. (2019a) relying on real frames as assistance or prone to failure in complex scenes, pose challenges for fair comparisons. EventZoom, being the first learning-based event stream super-resolution method, faces challenges in training for large-scale SR due to its 3D-Unet architecture, making it difficult and computationally expensive. To address this, following previous practices Weng et al. (2022), we ran EventZoom-2×2\times2 × multiple times to obtain results for larger SR factors. Additionally, we include classic image super-resolution methods such as bicubic and SRFBN Li et al. (2019b), as well as a transformer-based video super-resolution method, RSTT Geng et al. (2022), for comparison. We randomly split the real dataset EventNFS for training and testing. To evaluate the model’s generalization, we select a subset of NFS-syn data for 2(4,8)×2(4,8)\times2 ( 4 , 8 ) × SR training and then validate on both the NFS-syn and RGB-syn datasets.

Qualitative Analysis Results. As depicted in Figure 3, we present the 4×4\times4 × SR results of various methods on both synthetic and real data (for more results, please refer to the Supplementary Material). It can be observed that traditional image super-resolution methods such as bicubic and SRFBN Li et al. (2019b) struggle to achieve satisfactory visual results in ESR tasks, exhibiting blurry edges and significant detail loss. This may be attributed to the gap between event stream and RGB images. EventZoom Duan et al. (2021), on the other hand, exhibits numerous detail losses, likely due to error accumulation from multiple runs of EventZoom-2×2\times2 ×. In comparison, RSTT Geng et al. (2022) and RecEvSR Weng et al. (2022) produce event images of higher quality, yet they still fall short in detail restoration and supplementation. In contrast, our proposed RMFNet can better extract detailed information from positive and negative event streams and complement each other, resulting in more comprehensive details and clearer edge information.

Quantitative Analysis Results. As shown in Table 1, compared to the previously SOTA ESR method RecEvSR, RMFNet achieves an average MSE improvement of 17.7% and 31.6% on NFS-syn and EventNFS, respectively. On RGB-syn, RecEvSR exhibits fragile generalization, while our RMFNet maintains good generalization with an average MSE improvement of 78%. Additionally, the average inference speed is improved by 2.3×2.3\times2.3 ×. Compared to the video super-resolution method RSTT, RMFNet achieves an average MSE improvement of 17.8% and 20% on NFS-syn and EventNFS, respectively. On RGB-syn, while RSTT maintains good generalization, our RMFNet still outperforms it with a 5.3% improvement. Furthermore, our inference speed is improved by 8.7×8.7\times8.7 ×. These results demonstrate the efficiency and robustness of our proposed RMFNet.

Method NFS-syn RGB-syn EventNFS RMFNet (w/o DA) 0.304 0.0874 0.795 Polarity flip** 0.304 0.0865 \downarrow 0.793 \downarrow RandomFlip 0.302 \downarrow 0.0868 \downarrow 0.793 \downarrow Drop by time 0.304 0.0873 \downarrow 0.790 \downarrow Random drop 0.306 \uparrow 0.0880 \uparrow 0.785 \downarrow Drop by area 0.306 \uparrow 0.0881 \uparrow 0.783 \downarrow Random drop or add noise 0.303 \downarrow 0.0871 \downarrow 0.786 \downarrow Static Translation - - - RandomResizedCrop 0.330 \uparrow 0.0921 \uparrow 0.799 \uparrow Selected DA’s (random) 0.300 \downarrow 0.0865 \downarrow 0.771 \downarrow

Table 2: Comparison of different data augmentation methods in ESR task. Training is conducted on the NFS-syn dataset, and 4×4\times4 × SR testing is performed on NFS-syn, RGB-syn, and EventNFS datasets.

4.3 Analysis of ESRDA Methods

As shown in Table 2, we compared the impact of different DA methods on our RMFNet for the 4×4\times4 × ESR task. To better highlight the influence of data augmentation methods on the generalization of our model, we only conducted training on NFS-syn and performed testing on NFS-syn, RGB-syn, and EventNFS. It can be observed that Polarity flip**, RandomFlip, and Drop by time all contribute to performance gains in the ESR task. However, Static translation leads to training instability, while RandomResizedCrop and Drop by area result in a decline in the performance and generalization of our RMFNet. This suggests that altering the relative spatial relationships between events may adversely affect the ESR task. This could be attributed to the sparse and unidimensional nature of event streams, lacking important features such as color and intensity. Therefore, disrupting the relative spatial relationships among event streams significantly impacts the overall structure, introducing additional noise and consequently leading to a decline in model performance.

Random drop only discards a portion of events in the LR event stream, introducing potential biases in model fitting. To address this, we propose Random drop or add noise, where events are not only dropped with a certain probability but noise is also added simultaneously, mitigating this issue and enhancing model robustness. Lastly, inspired by RandAugment Cubuk et al. (2020), we combine Polarity flip**, RandomFlip, Drop by time, and Random drop or add noise into a data augmentation ensemble, from which one augmentation is randomly selected (Selected DA). Experimental results demonstrate that our DA strategy effectively enhances the performance and generalization of our model in the ESR task. For more related experiments, please refer to the Supplementary Material.

Model Multi-Branch FFM FEM NFS-syn EventNFS model#A 0.329 0.347 model#B 0.317 0.331 model#C 0.309 0.322 model#D 0.313 0.326 model#E 0.300 0.316

Table 3: Ablation results for different components of our RMFNet.

4.4 Ablation Study

To validate the effectiveness of different components in our proposed RMFNet, we conducted experiments with four different variants and compared the 4×4\times4 × SR results on the NFS-syn and EventNFS datasets.

As shown in Table 3, we compared RMFNet with several variants with different settings: 1) model#A: using a single-branch model, concatenating event images and state htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at the initial stage, and then inputting them into the model. 2) model#B: discarding FFM and FEM modules, using lateral connections Feichtenhofer et al. (2019); Christoph and Pinz (2016) between branches as an alternative. 3) model#C: discarding the FFM module, using lateral connections between branches as an alternative. 4) model#D: discarding the FEM module, using lateral connections between positive and negative branches as an alternative.

According to the results in Table 3, the multi-branch model significantly outperforms the single-branch model, as it effectively decouples different parts of the event stream, allowing for fine-grained learning of each part’s features. Additionally, the FFM and FEM designed in our model efficiently fuse and exchange features from different branches, promoting information complementarity between positive and negative event streams, outperforming methods that directly mix features from different branches. For more details about model hyperparameter ablation experiments, please refer to the Supplementary Material.

4.5 Event-based Applications

Video Reconstruction Methods 2×2\times2 × 4×4\times4 × 8×8\times8 × SSIM \uparrow LPIPS \downarrow SSIM \uparrow LPIPS \downarrow SSIM\uparrow LPIPS \downarrow bicubic 0.568 0.395 0.609 0.522 0.598 0.545 SRFBN 0.608 0.389 0.618 0.455 0.612 0.489 RSTT 0.627 0.359 0.639 0.424 0.622 0.472 EventZoom 0.542 0.429 0.575 0.488 0.574 0.542 RecEvSR 0.611 0.371 0.637 0.426 0.630 0.466 RMFNet 0.648 0.339 0.667 0.409 0.653 0.450

Methods Object Recognition ACC \uparrow AUC \uparrow ACC \uparrow AUC \uparrow ACC \uparrow AUC \uparrow bicubic 56.67 57.43 56.01 56.89 49.95 50.77 SRFBN 61.12 61.94 60.89 61.03 50.02 50.86 RSTT 63.51 63.96 63.02 64.29 52.97 54.07 EventZoom 54.68 56.03 49.56 50.45 47.96 48.74 RecEvSR 62.91 63.47 62.37 63.07 53.57 54.48 RMFNet 68.75 69.56 69.52 69.80 58.16 59.05 GT 85.16 84.99 93.44 93.52 94.96 94.81

Table 4: Quantitative comparison for event-based video reconstruction and object recognition. Video reconstruction is conducted on the NFS-syn dataset, while object recognition is performed on the NCars dataset Sironi et al. (2018). AUC and ACC represent accuracy and area under the curve, respectively. GT denotes the result obtained by directly using downsampled event streams for recognition. Bold and underline indicate the best and second-best results.

Video Reconstruction. Video reconstruction is a crucial task within event-based applications Rebecq et al. (2019); Stoffregen et al. (2020); Weng et al. (2021); Liang et al. (2023); Yang et al. (2023). Firstly, we conducted 2(4,8)×2(4,8)\times2 ( 4 , 8 ) × SR on NFS-syn using bicubic, SRFBN Li et al. (2019b), RSTT Geng et al. (2022), EventZoom Duan et al. (2021), RecEvSR Weng et al. (2022), and our RMFNet. Subsequently, we adopt E2VID Rebecq et al. (2019) as the benchmark algorithm for event-based video reconstruction and utilize the structural similarity (SSIM) Wang et al. (2004) and the perceptual similarity (LPIPS) Zhang et al. (2018) as evaluation metrics for reconstruction quality. Table 4 presents the quantitative results for event-based video reconstruction, indicating that our method surpasses others in both SSIM and LPIPS metrics, and exhibits more visually satisfying details (see Supplementary Material). This further underscores our method’s capability to better restore details in the LR event stream.

Object Recognition. We also perform a comparison of all models and methods in the event-based object recognition task. In this context, following the methodology of Weng et al. Weng et al. (2021), we employ the NCars dataset Sironi et al. (2018) for experimentation and leverage the classifier proposed by Gehrig et al. Gehrig et al. (2019) for object recognition. Specifically, we first performed 8×8\times8 × downsampling on the NCars dataset through coordinate relocation. Subsequently, we employ different models to conduct 2(4,8)×2(4,8)\times2 ( 4 , 8 ) × super-resolution on the event stream and employ the object recognition method for identification. Table 4 illustrates the results of object recognition comparison. We evaluate using accuracy (ACC) and area under the curve (AUC). GT signifies the utilization of results directly obtained from downsampled raw event streams. It can be observed that our method outperforms other approaches consistently across 2(4,8)×2(4,8)\times2 ( 4 , 8 ) × super-resolution scales. In comparison with previous methods, our approach achieves an average improvement of over 9% in terms of ACC and AUC. These results demonstrate the superior detail restoration capability of our method.

5 Conclusion

In this paper, we introduced an efficient Recursive Multi-Branch Information Fusion Network (RMFNet) for ESR tasks. RMFNet leverages a carefully designed multi-branch network architecture, taking decoupled positive and negative events as well as coupled event frames as input to achieve super resolution of event streams. Additionally, we introduced attention-based Feature Fusion Module and Feature Exchange Module, which effectively integrate contextual information from neighboring event streams and facilitate the exchange of complementary information between positive and negative events. Furthermore, we explored the impact of data augmentation methods on ESR tasks and proposed an effective data augmentation strategy to enhance model robustness and performance. Results on both real and synthetic datasets demonstrated that our approach outperforms previous ESR methods across various metrics.

Acknowledgments

This work was supported in part by the Guangxi Key R & D Program (No. GuikeAB24010324), in part by the National Natural Science Foundation of China (No. 62088102, No. 62425101), in part by the Key-Area Research and Development Program of Guangdong Province (No. 2021B0101400002), and the Major Key Project of PCL (No. PCL2021A13).

Contribution Statement

Quanmin Liang and Zhilin Huang made equal contributions. Kai Huang and Yonghong Tian are Corresponding Author. All the authors participated in designing research, analyzing data, and writing the paper.

References

  • Barchid et al. [2023] Sami Barchid, José Mennesson, and Chaabane Djéraba. Exploring joint embedding architectures and data augmentations for self-supervised representation learning in event-based vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3902–3911, 2023.
  • Brandli et al. [2014] Christian Brandli, Raphael Berner, Minhao Yang, Shih-Chii Liu, and Tobi Delbruck. A 240×\times× 180 130 db 3 μ𝜇\muitalic_μs latency global shutter spatiotemporal vision sensor. IEEE Journal of Solid-State Circuits, 49(10):2333–2341, 2014.
  • Christoph and Pinz [2016] R Christoph and Feichtenhofer Axel Pinz. Spatiotemporal residual networks for video action recognition. Advances in neural information processing systems, 2, 2016.
  • Cubuk et al. [2020] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
  • Duan et al. [2021] Peiqi Duan, Zihao W Wang, Xinyu Zhou, Yi Ma, and Boxin Shi. Eventzoom: Learning to denoise and super resolve neuromorphic events. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12824–12833, 2021.
  • Feichtenhofer et al. [2019] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
  • Gallego et al. [2020] Guillermo Gallego, Tobi Delbrück, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J Davison, Jörg Conradt, Kostas Daniilidis, et al. Event-based vision: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020.
  • Gehrig and Scaramuzza [2022] Daniel Gehrig and Davide Scaramuzza. Are high-resolution event cameras really needed? arXiv preprint arXiv:2203.14672, 2022.
  • Gehrig et al. [2019] Daniel Gehrig, Antonio Loquercio, Konstantinos G Derpanis, and Davide Scaramuzza. End-to-end learning of representations for asynchronous event-based data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5633–5643, 2019.
  • Geng et al. [2022] Zhicheng Geng, Luming Liang, Tianyu Ding, and Ilya Zharkov. Rstt: Real-time spatial temporal transformer for space-time video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17441–17451, 2022.
  • Gu et al. [2021] Fuqiang Gu, Weicong Sng, Xuke Hu, and Fangwen Yu. Eventdrop: Data augmentation for event-based learning. arXiv preprint arXiv:2106.05836, 2021.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Kiani Galoogahi et al. [2017] Hamed Kiani Galoogahi, Ashton Fagg, Chen Huang, Deva Ramanan, and Simon Lucey. Need for speed: A benchmark for higher frame rate object tracking. In Proceedings of the IEEE International Conference on Computer Vision, pages 1125–1134, 2017.
  • Li et al. [2019a] Hongmin Li, Guoqi Li, and Lu** Shi. Super-resolution of spatiotemporal event-stream image. Neurocomputing, 335:206–214, 2019.
  • Li et al. [2019b] Zhen Li, **glei Yang, Zheng Liu, Xiaomin Yang, Gwanggil Jeon, and Wei Wu. Feedback network for image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3867–3876, 2019.
  • Li et al. [2021] Siqi Li, Yutong Feng, Yipeng Li, Yu Jiang, Changqing Zou, and Yue Gao. Event stream super-resolution via spatiotemporal constraint learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4480–4489, 2021.
  • Liang et al. [2023] Quanmin Liang, Xiawu Zheng, Kai Huang, Yan Zhang, Jie Chen, and Yonghong Tian. Event-diffusion: Event-based image reconstruction and restoration with diffusion models. In Proceedings of the 31st ACM International Conference on Multimedia, pages 3837–3846, 2023.
  • Lin et al. [2022] Songnan Lin, Ye Ma, Zhenhua Guo, and Bihan Wen. Dvs-voltmeter: Stochastic process-based event simulator for dynamic vision sensors. In European Conference on Computer Vision, pages 578–593. Springer, 2022.
  • Maqueda et al. [2018] Ana I Maqueda, Antonio Loquercio, Guillermo Gallego, Narciso García, and Davide Scaramuzza. Event-based vision meets deep learning on steering prediction for self-driving cars. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5419–5427, 2018.
  • Rebecq et al. [2017] Henri Rebecq, Timo Horstschaefer, and Davide Scaramuzza. Real-time visual-inertial odometry for event cameras using keyframe-based nonlinear optimization. 2017.
  • Rebecq et al. [2019] Henri Rebecq, René Ranftl, Vladlen Koltun, and Davide Scaramuzza. High speed and high dynamic range video with an event camera. IEEE transactions on pattern analysis and machine intelligence, 43(6):1964–1980, 2019.
  • Schuster and Paliwal [1997] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681, 1997.
  • Shi et al. [2016] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016.
  • Simonyan and Zisserman [2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  • Sironi et al. [2018] Amos Sironi, Manuele Brambilla, Nicolas Bourdis, Xavier Lagorce, and Ryad Benosman. Hats: Histograms of averaged time surfaces for robust event-based object classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1731–1740, 2018.
  • Stoffregen et al. [2020] Timo Stoffregen, Cedric Scheerlinck, Davide Scaramuzza, Tom Drummond, Nick Barnes, Lindsay Kleeman, and Robert Mahony. Reducing the sim-to-real gap for event cameras. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, pages 534–549. Springer, 2020.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Wang et al. [2004] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process., 13(4):600–612, 2004.
  • Wang et al. [2018] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
  • Wang et al. [2020a] Lin Wang, Tae-Kyun Kim, and Kuk-** Yoon. Eventsr: From asynchronous events to image reconstruction, restoration, and super-resolution via end-to-end adversarial learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8315–8325, 2020.
  • Wang et al. [2020b] Zihao W Wang, Peiqi Duan, Oliver Cossairt, Aggelos Katsaggelos, Tiejun Huang, and Boxin Shi. Joint filtering of intensity images and neuromorphic events for high-resolution noise-robust imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1609–1619, 2020.
  • Weng et al. [2021] Wenming Weng, Yueyi Zhang, and Zhiwei Xiong. Event-based video reconstruction using transformer. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 2543–2552. IEEE, 2021.
  • Weng et al. [2022] Wenming Weng, Yueyi Zhang, and Zhiwei Xiong. Boosting event stream super-resolution with a recurrent neural network. In European Conference on Computer Vision, pages 470–488. Springer, 2022.
  • Yang et al. [2023] Yixin Yang, ** Han, **xiu Liang, Imari Sato, and Boxin Shi. Learning event guided high dynamic range video reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13924–13934, 2023.
  • Yoo et al. [2020] Jaejun Yoo, Namhyuk Ahn, and Kyung-Ah Sohn. Rethinking data augmentation for image super-resolution: A comprehensive analysis and a new strategy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8375–8384, 2020.
  • Yun et al. [2019] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
  • Zhang et al. [2017] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  • Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 586–595. Computer Vision Foundation / IEEE Computer Society, 2018.
  • Zhu et al. [2018] Alex Zihao Zhu, Liangzhe Yuan, Kenneth Chaney, and Kostas Daniilidis. Ev-flownet: Self-supervised optical flow estimation for event-based cameras. arXiv preprint arXiv:1802.06898, 2018.