License: arXiv.org perpetual non-exclusive license
arXiv:2403.05050v3 [cs.CV] 18 Mar 2024

DyRoNet: Dynamic Routing and Low-Rank Adapters for Autonomous Driving Streaming Perception

Xiang Huang1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT Internship at CMU    Zhi-Qi Cheng33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT Corresponding author    Jun-Yan He44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT   
Chenyang Li44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT
   Wangmeng Xiang4,545{}^{4,5}start_FLOATSUPERSCRIPT 4 , 5 end_FLOATSUPERSCRIPT    Baigui Sun44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT    Xiao Wu1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTSouthwest Jiaotong University 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTEngineering Research Center of Sustainable Urban Intelligent Transportation, China 33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTLanguage Technologies Institute, Carnegie Mellon University
44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPTAlibaba Group 55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPTThe Hong Kong Polytechnic University
[email protected], [email protected], {junyanhe1989, marquezxm, wuxiaohk}@gmail.com,
[email protected], [email protected]
Abstract

The advancement of autonomous driving systems hinges on the ability to achieve low-latency and high-accuracy perception. To address this critical need, this paper introduces Dynamic Routering Network (DyRoNet), a low-rank enhanced dynamic routing framework designed for streaming perception in autonomous driving systems. DyRoNet integrates a suite of pre-trained branch networks, each meticulously fine-tuned to function under distinct environmental conditions. At its core, the framework offers a speed router module, developed to assess and route input data to the most suitable branch for processing. This approach not only addresses the inherent limitations of conventional models in adapting to diverse driving conditions but also ensures the balance between performance and efficiency. Extensive experimental evaluations demonstrating the adaptability of DyRoNet to diverse branch selection strategies, resulting in significant performance enhancements across different scenarios. This work not only establishes a new benchmark for streaming perception but also provides valuable engineering insights for future work.111Project: https://tastevision.github.io/DyRoNet/

1 Introduction

In autonomous driving systems, it is crucial to achieve low-latency and high-precision perception. Traditional object detection algorithms Zou et al. (2023), while effective in various contexts, often confront the challenge of latency due to inherent computational delays. This lag between algorithmic processing and real-world states can lead to notable discrepancies between predicted and actual object locations. Such latency issues have been extensively reported and are known to significantly impact the decision-making process in autonomous driving systems Chen et al. (2023).

Addressing these challenges, the concept of streaming perception has been introduced as a response Li et al. (2020). This perception task aims to predict “future” results by accounting for the delays incurred during the frame processing stage. Unlike traditional methods that primarily focus on detection at a given moment, streaming perception transcends this limitation by anticipating future environmental states, and aligning perceptual outputs closer to real-time dynamics. This new paradigm is key in addressing the critical gap between real-time processing and real-world changes, thereby enhancing the safety and reliability of autonomous driving systems Muhammad et al. (2020).

Refer to caption
Figure 1: Illustration of DyRoNet’s dynamic selection mechanism in streaming perception. This diagram showcases DyRoNet’s capability to adaptively choose the most suitable perception strategy, contrasting with the static approach of traditional methods in complex environments [Viewing in color and at an expanded scale].

Although the existing streaming approach seems promising, it still faces contradictions in real-world scenarios. These contradictions primarily stem from the diverse and unpredictable nature of driving environments. The factors such as camera motion, weather conditions, lighting variations, and the presence of small objects seriously impact the performance of perception measures, leading to fluctuations that challenge their robustness and reliability (see Sec. 3.1). This complexity in real-world scenarios underscores the limitations of a single, uniform model, which often struggles to adapt to the varied demands of different driving conditions Guo et al. (2019). In general, the challenges of streaming perception mainly include:

(1) Diverse Scenario Distribution: Autonomous driving environments are inherently complex and dynamic, showing a myriad of scenarios that a single perception model may not adequately address (see Fig. 1). The need to customize perception algorithms to specific environmental conditions, while ensuring that these models operate cohesively, poses a significant challenge. As discussed in Sec. 3.1, adapting models to various scenarios without compromising their core functionality is a crucial aspect of streaming perception.

(2) Performance-Efficiency Balance: To our knowledge, the integration of both large and small-scale models is essential to handle the varying complexities encountered in different driving scenes. The large models, while potentially more accurate, may suffer from increased latency, whereas smaller models may offer faster inference at the cost of reduced accuracy. Balancing performance and efficiency, therefore, becomes a challenging task. In Sec. 3.1, we explore the strategies for optimizing this balance, exploring how different model architectures can be effectively utilized to enhance streaming perception.

Generally speaking, these challenges highlight the demand for streaming perception. As we study in Sec. 3.1, addressing the diverse scenario distribution and achieving an optimal balance between performance and efficiency are key to advancing the state-of-the-art in autonomous driving. To address the intricate challenges presented by real-world streaming perception, we introduce DyRoNet, a framework designed to enhance dynamic routing capabilities in autonomous driving systems. DyRoNet stands as a low-rank enhanced dynamic routing framework, specifically crafted to cater to the requirements of streaming perception. It encapsulates a suite of pre-trained branch networks, each meticulously fine-tuned to optimally function under distinct environmental conditions. A key component of DyRoNet is the speed router module, ingeniously developed to assess and efficiently route input data to the most appropriate branch, as detailed in Sec. 3.2. To sum up, the contributions are listed as:

  • We emphasize the impact of environmental speed as a key determinant of streaming perception. Through analysis of various environmental factors, our research highlights the imperative need for adaptive perception responsive to dynamic conditions.

  • By utilizing a variety of sophisticated streaming perception techniques, DyRoNet provides the speed router as a major invention. This component dynamically determines the best route for handling each input, ensuring efficiency and accuracy in perception. The ability to adapt and be versatile is demonstrated by this dynamic route-choosing mechanism.

  • Extensive experimental evaluations have demonstrated that DyRoNet is capable of adapting to diverse branch selection strategies, resulting in a substantial enhancement of performance across various branch structures. This not only validates the framework’s wide-ranging applicability but also confirms its effectiveness in handling different real-world scenarios.

In summary, DyRoNet offers advancements for low-latency, high-accuracy perception in autonomous driving. By addressing challenges of environmental adaptability and dynamic branch selection, DyRoNet sets new benchmarks in achieving low-latency and high-accuracy perception.

Refer to caption
Figure 2: The DyRoNet Framework: This figure presents the architecture of DyRoNet, featuring a multi-branch network design. For simplicity, only two branches are shown, each representing a streaming perception sub-network. The core network architecture is detailed in the upper right. Each branch processes both the current frame Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and a series of historical frames It1,It2,,Itnsubscript𝐼𝑡1subscript𝐼𝑡2subscript𝐼𝑡𝑛I_{t-1},I_{t-2},\cdots,I_{t-n}italic_I start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT , ⋯ , italic_I start_POSTSUBSCRIPT italic_t - italic_n end_POSTSUBSCRIPT. The backbone and neck of the network extract features, which are then split into two streams for the current and historical frames. These streams are fused together before entering the prediction head. Branch selection is governed by the Speed Router, which analyzes the frame difference ΔItΔsubscript𝐼𝑡\Delta I_{t}roman_Δ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT derived from Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and It1subscript𝐼𝑡1I_{t-1}italic_I start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to determine the most suitable branch for the given input.

2 Related Work

This section revisits developments in streaming perception and dynamic neural networks, highlighting differences from our proposed DyRoNet framework. While existing methods have made progress, limitations persist in addressing real-world autonomous driving complexity.

2.1 Streaming Perception

The existing streaming perception methods fall into three main categories. (1) The initial methods focused on single-frame, with models like YOLOv5 Jocher et al. (2021) and YOLOX Ge et al. (2021) achieving real-time performance. However, lacking motion trend capture, they struggle in dynamic scenarios. (2) The recent approaches incorporated current and historical frames, like StreamYOLO Yang et al. (2022) building on YOLOX with dual-flow fusion. LongShortNet Li et al. (2023) used longer histories and diverse fusion. DAMO-StreamNet He et al. (2023) added asymmetric distillation and deformable convolutions to improve large object perception. (3) Recognizing the limitations of single models, current methods explore dynamic multi-model systems. One approach Ghosh et al. (2021) adapts models to environments via reinforcement learning. DaDe Jo et al. (2022) extends StreamYOLO by calculating delays to determine frame steps. A later version Huang and Chen (2023) added multi-branch prediction heads. Beyond 2D detection, streaming perception expands into optical flow, tracking, and 3D detection, with innovations in metrics and benchmarks Wang et al. (2023c); Sela et al. (2022); Wang et al. (2023b). Distinct from these existing approaches, our proposed method, DyRoNet, introduces a low-rank enhanced dynamic routing mechanism specifically designed for streaming perception. DyRoNet stands out by integrating a suite of advanced branch networks, each fine-tuned for specific environmental conditions. Its key innovation lies in the speed router module, which not only routes input data efficiently but also dynamically adapts to the diverse and unpredictable nature of real-world driving scenarios.

2.2 Dynamic Neural Networks

Dynamic Neural Networks (DNNs) feature adaptive network selection, outperforming static models in efficiency and performance (Han et al. (2021); Lan et al. (2023); Zhang et al. (2023)). The existing research primarily focuses on structural design for core deep learning tasks like image classification (Huang et al. (2018); Wang et al. (2020, 2018)). DNNs follow two approaches: (1) Multi-branch models (Bejnordi et al. (2019); Cai et al. (2021); Shazeer et al. (2017); Wang et al. (2023a); Qiao et al. (2022)) rely on a lightweight router assessing inputs to direct them to appropriate branches, enabling tailored computation. (2) By generating new weights based on inputs (Yang et al. (2019); Chen et al. (2020); Su et al. (2019); Zhu et al. (2019)), these models dynamically alter computations to match diverse needs. DNN applications expand beyond conventional tasks. In object detection, DynamicDet (Lin et al. (2023)) categorizes inputs and processes them through distinct branches. This illustrates DNNs’ broader applicability and efficiency, promising contributions particularly for complex, dynamic environments.

3 Proposed Method

This section outlines the framework of our proposed DyRoNet. Beginning with its underlying motivation and the critical factors driving its design, we subsequently provide an overview of its architecture and training process.

3.1 Motivation for DyRoNet

Autonomous driving faces variability from weather, scene complexity, and vehicle velocity. By strategically analyzing key factors and routing logic, this section details the rationale behind the proposed DyRoNet.

Analysis of Influential Factors. Our statistical analysis of the Argoverse-HD dataset Li et al. (2020) underscores the profound influence of environmental dynamics on the effectiveness of streaming perception. While weather inconsistently impacts accuracy, suggesting the presence of other influential factors (see Appendix A.1), fluctuations in the object count show limited correlation with performance degradation (see Appendix A.2). Conversely, the presence of small objects across various scenes poses a significant challenge for detection, especially under varying motion states (see Appendix A.3). Notably, disparities in performance are most pronounced across different environmental motion states (see Appendix A.4), thereby motivating the need for a dynamic, velocity-aware routing mechanism in DyRoNet.

Rationale for Dynamic Routing. Analysis reveals that StreamYOLO’s reliance on a single historical frame falters at high velocities, in contrast to multi-frame models, highlighting a clear connection between velocity and detection performance (see Tab. 1). Dynamic adaptation of frame history, based on vehicular speed changes, enables DyRoNet to strike a balance between accuracy and latency (see Sec. 4.3). Through first-order differences, the system efficiently switches models to align with environmental motions. Specifically, the dynamic routing is designed to select the optimal architecture based on the vehicle’s speed profile, ensuring precision at lower velocities for detailed perception and efficiency at higher speeds for swift response. Such adaptable routing, informed by comprehensive speed analysis, positions DyRoNet as a robust solution for reliable perception across diverse autonomous driving scenarios. Next, we introduce DyRoNet in detail.

3.2 Architecture of DyRoNet

Overview of DyRoNet. The structure of DyRoNet, as depicted in Fig. 2, proposes a multi-branch structure. Each branch within DyRoNet framework functions as an independent streaming perception model, capable of processing both the current and historical frames. This dual-frame processing is central to DyRoNet’s capability, facilitating a nuanced understanding of temporal dynamics. Such a design is key in achieving a delicate balance between latency and accuracy, aspects crucial for real-time autonomous driving.

Mathematically, the core of DyRoNet lies the processing of a frame sequence, 𝒮={It,,ItNδt}𝒮subscript𝐼𝑡subscript𝐼𝑡𝑁𝛿𝑡\mathcal{S}=\{I_{t},\cdots,I_{t-N\delta t}\}caligraphic_S = { italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ⋯ , italic_I start_POSTSUBSCRIPT italic_t - italic_N italic_δ italic_t end_POSTSUBSCRIPT }, where N𝑁Nitalic_N indicates the number of frames and δt𝛿𝑡\delta titalic_δ italic_t the interval between successive frames. The process of the framework is formalized as:

𝒯=(𝒮,𝒫,𝒲),𝒯𝒮𝒫𝒲\centering\mathcal{T}=\mathcal{F}(\mathcal{S},\mathcal{P},\mathcal{W}),\@add@centeringcaligraphic_T = caligraphic_F ( caligraphic_S , caligraphic_P , caligraphic_W ) ,

where 𝒫={P0,,PK1}𝒫subscript𝑃0subscript𝑃𝐾1\mathcal{P}=\{P_{0},\cdots,P_{K-1}\}caligraphic_P = { italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , italic_P start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT } denotes a collection of streaming perception models, with each Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denoting an individual model within this suite. The architecture is further enhanced by incorporating a feature extractor 𝒢isubscript𝒢𝑖\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a perception head isubscript𝑖\mathcal{H}_{i}caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each model. The Router Network, \mathcal{R}caligraphic_R, is instrumental in selecting the most suitable streaming perception model for each specific scenario.

Correspondingly, the weights of DyRoNet are denoted by 𝒲={Wd,Wl,Wr}𝒲superscript𝑊𝑑superscript𝑊𝑙superscript𝑊𝑟\mathcal{W}=\{W^{d},W^{l},W^{r}\}caligraphic_W = { italic_W start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT }, where Wdsuperscript𝑊𝑑W^{d}italic_W start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT indicates the weights of the streaming perception model, Wlsuperscript𝑊𝑙W^{l}italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT relates to the Low-Rank Adaptation (LoRA) weights within each model, and Wrsuperscript𝑊𝑟W^{r}italic_W start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT pertains to the Router Network. The culmination of this process is the final output, 𝒯𝒯\mathcal{T}caligraphic_T, a compilation of feature maps. These maps can be further decoded through Decode(𝒯)𝐷𝑒𝑐𝑜𝑑𝑒𝒯Decode(\mathcal{T})italic_D italic_e italic_c italic_o italic_d italic_e ( caligraphic_T ), revealing essential details like objects, categories, and locations. Below we introduce each module in detail.

Refer to caption
Figure 3: The mean curves of frame differences are depicted here. The four curves correspond to frame sizes of the original frame, 200×\times×200, 100×\times×100, and 50×\times×50. Notably, these curves show distinct fluctuations across different vehicle motion scenarios.

Router Network. The Router Network in DyRoNet plays a crucial role in understanding and classifying the dynamics of the environment. This module is designed for both environmental classification and branch decision-making. To effectively and rapidly capture environmental speed, frame differences are employed as the input to the Router Network. As shown in Fig. 3, frame differences exhibit a high discriminative advantage for different environmental speeds.

Specifically, for frames at times t𝑡titalic_t and t1𝑡1t-1italic_t - 1, represented as Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and It1subscript𝐼𝑡1I_{t-1}italic_I start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT respectively, the frame difference is computed as ΔIt=ItIt1Δsubscript𝐼𝑡subscript𝐼𝑡subscript𝐼𝑡1\Delta I_{t}=I_{t}-I_{t-1}roman_Δ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_I start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. The architecture of the Router Network, \mathcal{R}caligraphic_R, is simple yet efficient. It consists of a single convolutional layer followed by a linear layer. The network’s output, denoted as frKsuperscript𝑓𝑟superscript𝐾f^{r}\in\mathbb{R}^{K}italic_f start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, captures the essence of the environmental dynamics. Based on this output, the index σ𝜎\sigmaitalic_σ of the optimal branch for processing the current input frame Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is determined through the following equation:

σ=argmaxK((ΔIt),Wr),σ{0,,K1},formulae-sequence𝜎subscript𝐾Δsubscript𝐼𝑡superscript𝑊𝑟𝜎0𝐾1\sigma=\mathop{\arg\max}_{K}(\mathcal{R}(\Delta I_{t}),W^{r}),\quad\sigma\in\{% 0,\cdots,K-1\},italic_σ = start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( caligraphic_R ( roman_Δ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_W start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) , italic_σ ∈ { 0 , ⋯ , italic_K - 1 } , (1)

where σ𝜎\sigmaitalic_σ is the index of the branch deemed most suitable for the current environmental context. Once σ𝜎\sigmaitalic_σ is determined, the input frame Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is automatically routed to the corresponding branch by a dispatcher.

In particular, this strategy of using frame differences to gauge environmental speed is efficient. It offers a faster alternative to traditional methods such as optical flow fields. Moreover, it focuses on frame-level variations rather than the speed of individual objects, providing a more generalized representation of environmental dynamics. The sparsity of ΔItΔsubscript𝐼𝑡\Delta I_{t}roman_Δ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT also contributes to the robustness of this method, reducing computational complexity and making the Router Network’s operations nearly negligible in the context of the overall model’s performance.

Model Bank & Dispatcher. The core of the DyRoNet framework is its model bank, which consists of an array of streaming perceptual models, denoted as 𝒫={P0,,PK1}𝒫subscript𝑃0subscript𝑃𝐾1\mathcal{P}=\{{P}_{0},\cdots,{P}_{K-1}\}caligraphic_P = { italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , italic_P start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT }.  Typically, the selection of the most suitable model for processing a given input is intelligently managed by the Router Network. This process is formalized as Pσ=Disp(,𝒫)subscript𝑃𝜎Disp𝒫P_{\sigma}=\text{Disp}(\mathcal{R},\mathcal{P})italic_P start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT = Disp ( caligraphic_R , caligraphic_P ), where Disp acts as a dispatcher, facilitating the dynamic selection of models from 𝒫𝒫\mathcal{P}caligraphic_P based on the input. The operational flow of DyRoNet can be mathematically defined as:

𝒯𝒯\displaystyle\mathcal{T}caligraphic_T =(𝒮,𝒫,W)absent𝒮𝒫𝑊\displaystyle=\mathcal{F}(\mathcal{S},\mathcal{P},W)= caligraphic_F ( caligraphic_S , caligraphic_P , italic_W )
=Disp((ΔIt),𝒫)(It;Wσd,Wσl)absentDispΔsubscript𝐼𝑡𝒫subscript𝐼𝑡subscriptsuperscript𝑊𝑑𝜎subscriptsuperscript𝑊𝑙𝜎\displaystyle=\text{Disp}(\mathcal{R}(\Delta I_{t}),\mathcal{P})(I_{t};W^{d}_{% \sigma},W^{l}_{\sigma})= Disp ( caligraphic_R ( roman_Δ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , caligraphic_P ) ( italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_W start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT , italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT )

where \mathcal{R}caligraphic_R symbolizes the Router Network, and ΔItΔsubscript𝐼𝑡\Delta I_{t}roman_Δ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT refers to the frame difference, a key input for model selection. The weights Wσdsubscriptsuperscript𝑊𝑑𝜎W^{d}_{\sigma}italic_W start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT and Wσlsubscriptsuperscript𝑊𝑙𝜎W^{l}_{\sigma}italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT correspond to the selected streaming perception model and its Low-Rank Adaptation (LoRA) parameters, respectively.

Note that the versatility of DyRoNet is further highlighted by its compatibility with a wide range of Streaming Perception models, even ones that rely solely on detectors Ge et al. (2021). To demonstrate the efficacy of DyRoNet, it has been evaluated using three contemporary streaming perception models: StreamYOLO Yang et al. (2022), LongShortNet Li et al. (2023), and DAMO-StreamNet He et al. (2023) (see Sec. 4.3). This Model Bank & Dispatcher strategy illustrates the adaptability and robustness of DyRoNet across different streaming perception scenarios.

Low-Rank Adaptation. A key challenge arises when fully fine-tuning individual branches, especially under the direction of Router Network. This strategy can lead to biases in the distribution of training data and inefficiencies in the learning process. Specifically, lighter branches may become predisposed to simpler cases, while more complex ones might be tailored to handle intricate scenarios, thereby heightening the risk of overfitting. Our experimental results, detailed in Sec. 4.3, support this observation.

To address these challenges, we have incorporated the Low-Rank Adapter Hu et al. (2021) into our streaming perception models. Within each model Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, initially pre-trained on a dataset, the key components are the convolution kernel and bias matrices, symbolized as Widsubscriptsuperscript𝑊𝑑𝑖W^{d}_{i}italic_W start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The rank of the Low-Rank Adaptation (LoRA) module is defined as r𝑟ritalic_r, a value significantly smaller than the dimensionality of Widsubscriptsuperscript𝑊𝑑𝑖W^{d}_{i}italic_W start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, to ensure efficient adaptation. The update to the weight matrix adheres to a low-rank decomposition form, represented as Wdi+δW=Wdi+BAsuperscriptsubscript𝑊𝑑𝑖𝛿𝑊superscriptsubscript𝑊𝑑𝑖𝐵𝐴W_{d}^{i}+\delta W=W_{d}^{i}+BAitalic_W start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_δ italic_W = italic_W start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_B italic_A.222Here, B𝐵Bitalic_B is a matrix in Rd×rsuperscript𝑅𝑑𝑟R^{d\times r}italic_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT, and A𝐴Aitalic_A is in Rr×ksuperscript𝑅𝑟𝑘R^{r\times k}italic_R start_POSTSUPERSCRIPT italic_r × italic_k end_POSTSUPERSCRIPT, ensuring that the rank r𝑟ritalic_r remains much smaller than d𝑑ditalic_d. This adaptation strategy allows for the original weights Wdisuperscriptsubscript𝑊𝑑𝑖W_{d}^{i}italic_W start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT to remain fixed, while the low-rank components BA𝐵𝐴BAitalic_B italic_A are trained and adjusted. The adaptation process is executed through the following projection:

Widx+ΔWx=Widx+Wilx,subscriptsuperscript𝑊𝑑𝑖𝑥Δ𝑊𝑥subscriptsuperscript𝑊𝑑𝑖𝑥subscriptsuperscript𝑊𝑙𝑖𝑥W^{d}_{i}x+\Delta Wx=W^{d}_{i}x+W^{l}_{i}x,italic_W start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x + roman_Δ italic_W italic_x = italic_W start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x + italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x , (2)

where x𝑥xitalic_x represents the input image or feature map, and ΔW=Wil=BAΔ𝑊subscriptsuperscript𝑊𝑙𝑖𝐵𝐴\Delta W=W^{l}_{i}=BAroman_Δ italic_W = italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_B italic_A. The matrices A𝐴Aitalic_A and B𝐵Bitalic_B start from an initialized state and are fine-tuned during the adaptation process. This approach maintains the general applicability of the model by fixing Wdisuperscriptsubscript𝑊𝑑𝑖W_{d}^{i}italic_W start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, while also enabling specialization within specific sub-domains, as determined by Router Network.

Particularly, in DyRoNet, we employ a rank r𝑟ritalic_r of 32323232 for the LoRA module, though this can be adjusted based on specific requirements of the scenarios in question. This low-rank adaptation mechanism not only enhances the flexibility of the DyRoNet framework but also significantly mitigates the risk of overfitting, ensuring that each branch remains efficient and effective in its designated role.

3.3 Training Details of DyRoNet

The training process of DyRoNet focuses on two primary goals: (1) improving the performance of individual branches within the streaming perception model and (2) achieving an optimal balance between accuracy and computational efficiency. This dual-objective framework is represented by the overall loss function:

L=sp+E2,𝐿superscript𝑠𝑝superscriptsuperscript𝐸2L=\mathcal{L}^{sp}+\mathcal{L}^{E^{2}},italic_L = caligraphic_L start_POSTSUPERSCRIPT italic_s italic_p end_POSTSUPERSCRIPT + caligraphic_L start_POSTSUPERSCRIPT italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , (3)

where spsuperscript𝑠𝑝\mathcal{L}^{sp}caligraphic_L start_POSTSUPERSCRIPT italic_s italic_p end_POSTSUPERSCRIPT represents the streaming perception loss, and E2superscriptsuperscript𝐸2\mathcal{L}^{E^{2}}caligraphic_L start_POSTSUPERSCRIPT italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT denotes the effective and efficient (E22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT) loss, which supervises branch selection.

Streaming Perception (SP) Loss. Each branch in DyRoNet is fine-tuned using its original loss function to maintain effectiveness. The router network is trained to select the optimal branch based on efficiency supervision. Let 𝒯i={Ficls,Fireg,Fiobj}subscript𝒯𝑖superscriptsubscript𝐹𝑖𝑐𝑙𝑠superscriptsubscript𝐹𝑖𝑟𝑒𝑔superscriptsubscript𝐹𝑖𝑜𝑏𝑗\mathcal{T}_{i}=\{F_{i}^{cls},F_{i}^{reg},F_{i}^{obj}\}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_g end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_b italic_j end_POSTSUPERSCRIPT } denote the logits produced by the i𝑖iitalic_i-th branch and 𝒯gt={Fgtcls,Fgtreg,Fgtobj}subscript𝒯𝑔𝑡superscriptsubscript𝐹𝑔𝑡𝑐𝑙𝑠superscriptsubscript𝐹𝑔𝑡𝑟𝑒𝑔superscriptsubscript𝐹𝑔𝑡𝑜𝑏𝑗\mathcal{T}_{gt}=\{F_{gt}^{cls},F_{gt}^{reg},F_{gt}^{obj}\}caligraphic_T start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT = { italic_F start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_g end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_b italic_j end_POSTSUPERSCRIPT } represent the corresponding ground-truth, where Fclssuperscriptsubscript𝐹𝑐𝑙𝑠F_{\cdot}^{cls}italic_F start_POSTSUBSCRIPT ⋅ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT, Fregsuperscriptsubscript𝐹𝑟𝑒𝑔F_{\cdot}^{reg}italic_F start_POSTSUBSCRIPT ⋅ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_g end_POSTSUPERSCRIPT, and Fobjsuperscriptsubscript𝐹𝑜𝑏𝑗F_{\cdot}^{obj}italic_F start_POSTSUBSCRIPT ⋅ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_b italic_j end_POSTSUPERSCRIPT are the classification, objectness, and regression logits, respectively. The streaming perception loss for each branch, ispsuperscriptsubscript𝑖𝑠𝑝\mathcal{L}_{i}^{sp}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_p end_POSTSUPERSCRIPT, is defined as follows:

isp(𝒯i,𝒯gt)=superscriptsubscript𝑖𝑠𝑝subscript𝒯𝑖subscript𝒯𝑔𝑡absent\displaystyle\mathcal{L}_{i}^{sp}(\mathcal{T}_{i},\mathcal{T}_{gt})=caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_p end_POSTSUPERSCRIPT ( caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_T start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT ) = cls(Ficls,Fgtcls)+obj(Fiobj,Fgtobj)subscript𝑐𝑙𝑠superscriptsubscript𝐹𝑖𝑐𝑙𝑠superscriptsubscript𝐹𝑔𝑡𝑐𝑙𝑠subscript𝑜𝑏𝑗superscriptsubscript𝐹𝑖𝑜𝑏𝑗superscriptsubscript𝐹𝑔𝑡𝑜𝑏𝑗\displaystyle\,\mathcal{L}_{cls}(F_{i}^{cls},F_{gt}^{cls})+\mathcal{L}_{obj}(F% _{i}^{obj},F_{gt}^{obj})caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_b italic_j end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_b italic_j end_POSTSUPERSCRIPT ) (4)
+reg(Fireg,Fgtreg),subscript𝑟𝑒𝑔superscriptsubscript𝐹𝑖𝑟𝑒𝑔superscriptsubscript𝐹𝑔𝑡𝑟𝑒𝑔\displaystyle+\mathcal{L}_{reg}({F}_{i}^{reg},{F}_{gt}^{reg}),+ caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_g end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_g end_POSTSUPERSCRIPT ) ,

where cls()subscript𝑐𝑙𝑠\mathcal{L}_{cls}(\cdot)caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT ( ⋅ ) and obj()subscript𝑜𝑏𝑗\mathcal{L}_{obj}(\cdot)caligraphic_L start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT ( ⋅ ) are defined as Mean Square Error (MSE) loss functions, while reg()subscript𝑟𝑒𝑔\mathcal{L}_{reg}(\cdot)caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT ( ⋅ ) is represented by the Generalized Intersection over Union (GIoU) loss.

Effective and Efficient (E22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT) Loss. During the training phase, streaming perception loss values from all branches are compiled into a vector vspKsuperscript𝑣𝑠𝑝superscript𝐾v^{sp}\in\mathbb{R}^{K}italic_v start_POSTSUPERSCRIPT italic_s italic_p end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, and inference time costs are aggregated into vtimeKsuperscript𝑣𝑡𝑖𝑚𝑒superscript𝐾v^{time}\in\mathbb{R}^{K}italic_v start_POSTSUPERSCRIPT italic_t italic_i italic_m italic_e end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, with K𝐾Kitalic_K indicating the total number of branches in DyRoNet. To account for hardware variability, a normalized inference time vector v^time=softmax(vtime)superscript^𝑣𝑡𝑖𝑚𝑒softmaxsuperscript𝑣𝑡𝑖𝑚𝑒\hat{v}^{time}=\mathrm{softmax}(v^{time})over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_t italic_i italic_m italic_e end_POSTSUPERSCRIPT = roman_softmax ( italic_v start_POSTSUPERSCRIPT italic_t italic_i italic_m italic_e end_POSTSUPERSCRIPT ) is introduced. This vector is derived using the Softmax function to minimize the influence of hardware discrepancies. The representation for effective and efficient (E22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT) decision-making is defined as:

fE2=𝒪N(argmink(softmax(vtime)vsp)),superscript𝑓superscript𝐸2subscript𝒪𝑁subscript𝑘softmaxsuperscript𝑣𝑡𝑖𝑚𝑒superscript𝑣𝑠𝑝f^{E^{2}}=\mathcal{O}_{N}(\mathop{\arg\min}_{k}(\mathrm{softmax}(v^{time})% \cdot v^{sp})),italic_f start_POSTSUPERSCRIPT italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = caligraphic_O start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( roman_softmax ( italic_v start_POSTSUPERSCRIPT italic_t italic_i italic_m italic_e end_POSTSUPERSCRIPT ) ⋅ italic_v start_POSTSUPERSCRIPT italic_s italic_p end_POSTSUPERSCRIPT ) ) , (5)

where 𝒪𝒪\mathcal{O}caligraphic_O denotes one-hot encoding, producing a boolean vector of length K𝐾Kitalic_K, with the value of 1111 at the index representing the estimated optimal branch at that moment. The E22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Loss is then formulated as:

E2=KL(fE2,fr),superscriptsuperscript𝐸2KLsuperscript𝑓superscript𝐸2superscript𝑓𝑟\mathcal{L}^{E^{2}}=\mathrm{KL}(f^{E^{2}},f^{r}),caligraphic_L start_POSTSUPERSCRIPT italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = roman_KL ( italic_f start_POSTSUPERSCRIPT italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_f start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) , (6)

where fr=(ΔIt)subscript𝑓𝑟Δsubscript𝐼𝑡f_{r}=\mathcal{R}(\Delta I_{t})italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = caligraphic_R ( roman_Δ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and KLKL\mathrm{KL}roman_KL represents the Kullback-Leibler divergence, utilized to constrain the distribution.

Overall, the process of training DyRoNet involves striking a meticulous balance between the SP loss, which ensures the efficacy of each branch, and the E22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT loss, which optimizes efficiency. The primary objective of this training is to develop a model that not only delivers high accuracy in perception tasks but also operates within acceptable latency constraints, which is a critical requirement for real-time applications. This balanced approach enables DyRoNet to adapt dynamically to varying computational resources and environmental conditions, thereby maintaining optimal performance in diverse streaming perception scenarios.

4 Experiments

Methods Latency (ms) sAP \uparrow sAP5050{}_{50}start_FLOATSUBSCRIPT 50 end_FLOATSUBSCRIPT \uparrow sAP7575{}_{75}start_FLOATSUBSCRIPT 75 end_FLOATSUBSCRIPT \uparrow sAPs𝑠{}_{s}start_FLOATSUBSCRIPT italic_s end_FLOATSUBSCRIPT \uparrow sAPm𝑚{}_{m}start_FLOATSUBSCRIPT italic_m end_FLOATSUBSCRIPT \uparrow sAPl𝑙{}_{l}start_FLOATSUBSCRIPT italic_l end_FLOATSUBSCRIPT \uparrow
Non-real-time detector-based methods
Adaptive Streamer Ghosh et al. (2021) - 21.3 37.3 21.1 4.4 18.7 47.1
Streamer (S=600) Li et al. (2020) - 20.4 35.6 20.8 3.6 18.0 47.2
Streamer (S=900) Li et al. (2020) - 18.2 35.3 16.8 4.7 14.4 34.6
Streamer+AdaScale Ghosh et al. (2021) - 13.8 23.4 14.2 0.2 9.0 39.9
Real-time detector-based methods
DAMO-StreamNetNet-L He et al. (2023) 26.6 37.8 59.1 38.6 16.1 39.0 64.6
LongShortNet-L Li et al. (2023) 20.1 37.1 57.8 37.7 15.2 37.3 63.8
StreamYOLO-L Yang et al. (2022) 18.2 36.1 57.6 35.6 13.8 37.1 63.3
DAMO-StreamNetNet-M He et al. (2023) 24.3 35.7 56.7 35.9 14.5 36.3 63.3
LongShortNet-M Li et al. (2023) 17.5 34.1 54.8 34.6 13.3 35.3 58.1
StreamYOLO-M Yang et al. (2022) 18.2 32.9 54.0 32.5 12.4 34.8 58.1
DAMO-StreamNetNet-S He et al. (2023) 21.3 31.8 52.3 31.0 11.4 32.9 58.7
LongShortNet-S Li et al. (2023) 14.6 29.8 50.4 29.5 11.0 30.6 52.8
StreamYOLO-S Yang et al. (2022) 14.2 28.8 50.3 27.6 9.7 30.7 53.1
DyRoNet
DyRoNet (DAMOM + LsubscriptDAMOM + L\text{DAMO}_{\text{M + L}}DAMO start_POSTSUBSCRIPT M + L end_POSTSUBSCRIPT) 37.61 37.8 (+2.1) 58.8 (+2.1) 38.8 (+2.9) 16.1 (+1.6) 39.0 (+2.7) 64.0 (+0.7)
DyRoNet (LSNM + LsubscriptLSNM + L\text{LSN}_{\text{M + L}}LSN start_POSTSUBSCRIPT M + L end_POSTSUBSCRIPT) 29.05 36.9 (+2.8) 58.2 (+3.4) 37.4 (+2.8) 14.9 (+1.6) 37.5 (+2.2) 63.3 (+5.2)
DyRoNet (sYOLOM + LsubscriptsYOLOM + L\text{sYOLO}_{\text{M + L}}sYOLO start_POSTSUBSCRIPT M + L end_POSTSUBSCRIPT) 23.51 35.0 (+2.1) 55.7 (+1.7) 35.5 (+3.0) 13.7 (+1.3) 36.2 (+1.4) 61.1 (+3.0)
Table 1: The comparison of DyRoNet and SOTA. In this table, the optimal values are highlighted in green font and the online evaluation latency reaches the real-time is shown in red font.

4.1 Dataset and Metric

Dataset. For the evaluation of our methods, we utilized the comprehensive Argoverse-HD dataset Li et al. (2020), specifically designed for streaming perception in autonomous driving scenarios. This dataset comprises high-resolution RGB images captured from urban city street drives, offering a realistic representation of diverse driving conditions. The dataset is structured into two main segments: a training set consisting of 65 video clips and a test set comprising 24 video clips. Each video segment in the dataset, on average, spans over 600 frames, contributing to a training set with approximately 39k frames and a validation set containing around 15k frames. Notably, the Argoverse-HD dataset provides high-frame-rate (30fps) 2D object detection annotations, ensuring accuracy and reliability without relying on interpolated data.

Evaluation Metric. We adopt the streaming Average Precision (sAP) as the primary metric for performance evaluation. The sAP metric, widely recognized for its effectiveness in streaming perception tasks Li et al. (2020), offers a comprehensive assessment by calculating the mean Average Precision (mAP) across various Intersection over Union (IoU) thresholds, ranging from 0.5 to 0.95. This metric allows us to evaluate detection performance across different object sizes, including large, medium, and small objects, providing a robust measure of our model’s capability in real-world streaming perception scenarios.

4.2 Implementation Details

We tested three state-of-the-art streaming perception models: StreamYOLOYang et al. (2022), LongShortNetLi et al. (2023), and DAMO-StreamNetHe et al. (2023). These models, integral to the DyRoNet architecture, come with pre-trained parameters across three distinct scales: small (S), medium (M), and large (L), catering to a variety of processing requirements. In constructing the model bank 𝒫𝒫\mathcal{P}caligraphic_P for DyRoNet, we strategically selected different model configurations to evaluate performance across diverse scenarios. For instance, the notation DyRoNet (DAMOS + MsubscriptDAMOS + M\text{DAMO}_{\text{S + M}}DAMO start_POSTSUBSCRIPT S + M end_POSTSUBSCRIPT) represents a configuration where DyRoNet employs the small (S) and medium (M) scales of DAMO-StreamNet as its two branches.333Similar notations are used for other model combinations, allowing for a systematic exploration of the framework’s adaptability and performance under varying computational constraints. All experiments were conducted on a high-performance computing platform equipped with Nvidia 3090Ti GPUs (x4), ensuring robust and reliable computational power to handle the intensive processing demands of the streaming perception models. This setup provided a consistent and controlled environment for evaluating the efficacy of DyRoNet across different model configurations, contributing to the thoroughness and validity of our results. For more implementation details, please refer to Appendix C.

4.3 Comparision with SOTA Methods

We compared our proposed approach with state-of-the-art methods to evaluate its performance. In this subsection, we directly copied the reported performance from their original papers as their results. The performance comparison was conducted on the Argoverse-HD dataset Li et al. (2020). An overview of the results reveals that our proposed DyRoNet with a model bank of DAMO-StreamNet series achieves 37.8% sAP in 39.60 ms latency, outperforming the current state-of-the-art methods in latency by a significant margin. For the StreamYOLO and LongShortNet model banks, our DyRoNet attains 36.9% and 37.1% sAP in 29.35 ms, and 30.48 ms latency respectively, surpassing the original model dramatically. This demonstrates the effectiveness of the systematic improvements in DyRoNet.

Model Bank Random LoRA + Router
StreamYOLOS + MsubscriptStreamYOLOS + M\text{StreamYOLO}_{\text{S + M}}StreamYOLO start_POSTSUBSCRIPT S + M end_POSTSUBSCRIPT 39.16 26.25
StreamYOLOS + LsubscriptStreamYOLOS + L\text{StreamYOLO}_{\text{S + L}}StreamYOLO start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT 24.04 29.35
StreamYOLOM + LsubscriptStreamYOLOM + L\text{StreamYOLO}_{\text{M + L}}StreamYOLO start_POSTSUBSCRIPT M + L end_POSTSUBSCRIPT 24.69 23.51
LongShortNetS + MsubscriptLongShortNetS + M\text{LongShortNet}_{\text{S + M}}LongShortNet start_POSTSUBSCRIPT S + M end_POSTSUBSCRIPT 24.79 21.47
LongShortNetS + LsubscriptLongShortNetS + L\text{LongShortNet}_{\text{S + L}}LongShortNet start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT 21.49 30.48
LongShortNetM + LsubscriptLongShortNetM + L\text{LongShortNet}_{\text{M + L}}LongShortNet start_POSTSUBSCRIPT M + L end_POSTSUBSCRIPT 24.75 29.05
DAMO-StreamNetS + MsubscriptDAMO-StreamNetS + M\text{DAMO-StreamNet}_{\text{S + M}}DAMO-StreamNet start_POSTSUBSCRIPT S + M end_POSTSUBSCRIPT 36.61 33.22
DAMO-StreamNetS + LsubscriptDAMO-StreamNetS + L\text{DAMO-StreamNet}_{\text{S + L}}DAMO-StreamNet start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT 35.12 39.60
DAMO-StreamNetM + LsubscriptDAMO-StreamNetM + L\text{DAMO-StreamNet}_{\text{M + L}}DAMO-StreamNet start_POSTSUBSCRIPT M + L end_POSTSUBSCRIPT 37.30 37.61
Table 2: Comparison of inference time (ms) on single RTX 3090. The optimal inference time between random and after train are consistently highlighted in green font.
Model Bank Full LoRA
StreamYOLOS + MsubscriptStreamYOLOS + M\text{StreamYOLO}_{\text{S + M}}StreamYOLO start_POSTSUBSCRIPT S + M end_POSTSUBSCRIPT 32.9 33.7
StreamYOLOS + LsubscriptStreamYOLOS + L\text{StreamYOLO}_{\text{S + L}}StreamYOLO start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT 36.1 36.9
StreamYOLOM + LsubscriptStreamYOLOM + L\text{StreamYOLO}_{\text{M + L}}StreamYOLO start_POSTSUBSCRIPT M + L end_POSTSUBSCRIPT 36.2 35.0
LongShortNetS + MsubscriptLongShortNetS + M\text{LongShortNet}_{\text{S + M}}LongShortNet start_POSTSUBSCRIPT S + M end_POSTSUBSCRIPT 29.0 30.5
LongShortNetS + LsubscriptLongShortNetS + L\text{LongShortNet}_{\text{S + L}}LongShortNet start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT 36.2 37.1
LongShortNetM + LsubscriptLongShortNetM + L\text{LongShortNet}_{\text{M + L}}LongShortNet start_POSTSUBSCRIPT M + L end_POSTSUBSCRIPT 36.3 36.9
DAMO-StreamNetS + MsubscriptDAMO-StreamNetS + M\text{DAMO-StreamNet}_{\text{S + M}}DAMO-StreamNet start_POSTSUBSCRIPT S + M end_POSTSUBSCRIPT 34.8 35.5
DAMO-StreamNetS + LsubscriptDAMO-StreamNetS + L\text{DAMO-StreamNet}_{\text{S + L}}DAMO-StreamNet start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT 31.1 37.8
DAMO-StreamNetM + LsubscriptDAMO-StreamNetM + L\text{DAMO-StreamNet}_{\text{M + L}}DAMO-StreamNet start_POSTSUBSCRIPT M + L end_POSTSUBSCRIPT 37.4 37.8
Table 3: Comparion of LoRA finetune and Full finetune. Full means the full fine-tuning and LoRA means the LoRA fine-tuning. And the best values between Full and LoRA are shown in red font.
Model Bank b0subscript𝑏0b_{0}italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT b1subscript𝑏1b_{1}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT b2subscript𝑏2b_{2}italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Random sAP
K=2𝐾2K=2italic_K = 2 same model DAMOS + MsubscriptDAMOS + M\text{DAMO}_{\text{S + M}}DAMO start_POSTSUBSCRIPT S + M end_POSTSUBSCRIPT 31.8 35.5 - 33.5 35.5
DAMOS + LsubscriptDAMOS + L\text{DAMO}_{\text{S + L}}DAMO start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT 31.8 37.8 - 34.5 37.8
DAMOM + LsubscriptDAMOM + L\text{DAMO}_{\text{M + L}}DAMO start_POSTSUBSCRIPT M + L end_POSTSUBSCRIPT 35.5 37.8 - 36.5 37.8
LSNS + MsubscriptLSNS + M\text{LSN}_{\text{S + M}}LSN start_POSTSUBSCRIPT S + M end_POSTSUBSCRIPT 29.8 34.1 - 31.8 30.5
LSNS + LsubscriptLSNS + L\text{LSN}_{\text{S + L}}LSN start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT 29.8 37.1 - 33.4 37.1
LSNM + LsubscriptLSNM + L\text{LSN}_{\text{M + L}}LSN start_POSTSUBSCRIPT M + L end_POSTSUBSCRIPT 34.1 37.1 - 35.6 36.9
sYOLOS + MsubscriptsYOLOS + M\text{sYOLO}_{\text{S + M}}sYOLO start_POSTSUBSCRIPT S + M end_POSTSUBSCRIPT 29.5 33.7 - 31.5 33.7
sYOLOS + LsubscriptsYOLOS + L\text{sYOLO}_{\text{S + L}}sYOLO start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT 29.5 36.9 - 33.2 36.9
sYOLOM + LsubscriptsYOLOM + L\text{sYOLO}_{\text{M + L}}sYOLO start_POSTSUBSCRIPT M + L end_POSTSUBSCRIPT 33.7 36.9 - 35.4 35.0
K=2𝐾2K=2italic_K = 2 different model DAMOSsubscriptDAMOS\text{DAMO}_{\text{S}}DAMO start_POSTSUBSCRIPT S end_POSTSUBSCRIPT + LSNSsubscriptLSNS\text{LSN}_{\text{S}}LSN start_POSTSUBSCRIPT S end_POSTSUBSCRIPT 31.8 29.8 - 30.7 30.5
DAMOSsubscriptDAMOS\text{DAMO}_{\text{S}}DAMO start_POSTSUBSCRIPT S end_POSTSUBSCRIPT + LSNMsubscriptLSNM\text{LSN}_{\text{M}}LSN start_POSTSUBSCRIPT M end_POSTSUBSCRIPT 31.8 34.1 - 32.6 34.1
DAMOSsubscriptDAMOS\text{DAMO}_{\text{S}}DAMO start_POSTSUBSCRIPT S end_POSTSUBSCRIPT + LSNLsubscriptLSNL\text{LSN}_{\text{L}}LSN start_POSTSUBSCRIPT L end_POSTSUBSCRIPT 31.8 37.1 - 34.3 31.8
DAMOMsubscriptDAMOM\text{DAMO}_{\text{M}}DAMO start_POSTSUBSCRIPT M end_POSTSUBSCRIPT + LSNSsubscriptLSNS\text{LSN}_{\text{S}}LSN start_POSTSUBSCRIPT S end_POSTSUBSCRIPT 35.5 29.8 - 32.6 29.8
DAMOLsubscriptDAMOL\text{DAMO}_{\text{L}}DAMO start_POSTSUBSCRIPT L end_POSTSUBSCRIPT + LSNSsubscriptLSNS\text{LSN}_{\text{S}}LSN start_POSTSUBSCRIPT S end_POSTSUBSCRIPT 37.8 29.8 - 33.8 29.8
K=3𝐾3K=3italic_K = 3 same model DAMOS + M + LsubscriptDAMOS + M + L\text{DAMO}_{\text{S + M + L}}DAMO start_POSTSUBSCRIPT S + M + L end_POSTSUBSCRIPT 31.8 35.5 37.8 34.8 37.7
LSNS + M + LsubscriptLSNS + M + L\text{LSN}_{\text{S + M + L}}LSN start_POSTSUBSCRIPT S + M + L end_POSTSUBSCRIPT 29.8 34.1 37.1 33.5 36.1
sYOLOS + M + LsubscriptsYOLOS + M + L\text{sYOLO}_{\text{S + M + L}}sYOLO start_POSTSUBSCRIPT S + M + L end_POSTSUBSCRIPT 29.5 33.7 36.9 33.4 36.6
Table 4: Ablation of model bank setting. K𝐾Kitalic_K means the number of the model in bank 𝒫𝒫\mathcal{P}caligraphic_P.

4.4 Inference Time

We conducted detailed experiments analyzing the trade-offs between DyRoNet’s inference time and performance under different model bank selection strategies. Table 2 systematically presents the findings, with optimal times in green. This highlights DyRoNet’s superior performance—maintaining competitive inference speed alongside accuracy gains versus the random approach. Specifically, DyRoNet achieves efficient speeds while preserving or enhancing performance. This balance enables meeting real-time needs without compromising perception quality, critical for autonomous driving where both factors are paramount. By validating effectiveness in inference time reductions and accuracy improvements, the results show the practicality and efficiency of DyRoNet’s dynamic model selection.

4.5 Ablation Study

Router Network. To validate the effectiveness of the Router Network based on frame difference, we conducted comparative experiments using frame difference ΔItΔsubscript𝐼𝑡\Delta I_{t}roman_Δ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the current frame Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and the concatenation of the current frame with the previous historical frame [It+It1]delimited-[]subscript𝐼𝑡subscript𝐼𝑡1[I_{t}+I_{t-1}][ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_I start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] as input modality of the Router Network. The experimental results are presented in Tab. 5. To control variables, in these experiments, we froze the model bank during training and only trained the Router Network. And only three different choices of StreamYOLO is involved in model bank. It can be obverse that using frame difference as input exhibits better performance than other two types of input modalities (35.0 of StreamYOLOS + LsubscriptStreamYOLOS + L\text{StreamYOLO}_{\text{S + L}}StreamYOLO start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT and 34.6 of StreamYOLOM + LsubscriptStreamYOLOM + L\text{StreamYOLO}_{\text{M + L}}StreamYOLO start_POSTSUBSCRIPT M + L end_POSTSUBSCRIPT). This indicates that utilizing frame differences offers significant advantages in comprehending and characterizing environmental speed. Conversely, it also underscores that employing single frames as input or using multiple frames as input renders the lightweight model bank selection model ineffective.

Model Bank Input Modality LoRA
StreamYOLOS + MsubscriptStreamYOLOS + M\text{StreamYOLO}_{\text{S + M}}StreamYOLO start_POSTSUBSCRIPT S + M end_POSTSUBSCRIPT Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 33.7
[It+It1]delimited-[]subscript𝐼𝑡subscript𝐼𝑡1[I_{t}+I_{t-1}][ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_I start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] 33.7
ΔItΔsubscript𝐼𝑡\Delta I_{t}roman_Δ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 32.6
StreamYOLOS + LsubscriptStreamYOLOS + L\text{StreamYOLO}_{\text{S + L}}StreamYOLO start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 34.1
[It+It1]delimited-[]subscript𝐼𝑡subscript𝐼𝑡1[I_{t}+I_{t-1}][ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_I start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] 30.2
ΔItΔsubscript𝐼𝑡\Delta I_{t}roman_Δ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 35.0
StreamYOLOM + LsubscriptStreamYOLOM + L\text{StreamYOLO}_{\text{M + L}}StreamYOLO start_POSTSUBSCRIPT M + L end_POSTSUBSCRIPT Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 33.7
[It+It1]delimited-[]subscript𝐼𝑡subscript𝐼𝑡1[I_{t}+I_{t-1}][ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_I start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] 33.7
ΔItΔsubscript𝐼𝑡\Delta I_{t}roman_Δ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 34.6
Table 5: Ablation of router network input. The optimal results are marked in red font under the same model bank setting.

Branch Selection. Our research on streaming perception models has shown that configuring these models across varying scales can optimize their performance. We found that combining large and small models strikes an optimal balance, resulting in significant speed improvements. This conclusion is supported by the empirical evidence presented in Tab. 4, which clearly shows that the large-small model pairing outperforms both the large-small and large-medium combinations. Our findings highlight the importance of strategic model scaling in streaming perception and provide a framework for future model optimization in similar domains.

Fine-tuning Scheme. In our evaluation, we contrasted the performance of direct fine-tuning with the Low-Rank Adapter (LoRA) fine-tuning strategy Zhu et al. (2023) for streaming perception models. Results are listed in Tab. 3. The results clearly demonstrated that LoRA fine-tuning surpasses direct fine-tuning, with the DAMO-Streamnet-based model bank configuration realizing an absolute gain of over 1.6%. This substantiates LoRA’s fine-tuning proficiency in circumventing the pitfalls of forgetting and data distribution bias inherent to direct fine-tuning. This experimental result demonstrates that LoRA fine-tuning can effectively mitigate the overfitting problem that may arise during model bank fine-tuning, leading to a stable and overall performance improvement.

LoRA Rank. To assess the impact of different LoRA ranks in DyRoNet, we conducted experiments with rank=32,16,8rank32168\mathrm{rank}=32,16,8roman_rank = 32 , 16 , 8 respectively. All these experiments were set to train for 5 epochs, and the training alternated between Router Network training and model bank fine-tuning. The results are presented in Tab. 6. It can be observed that the performance is better with rank=32𝑟𝑎𝑛𝑘32rank=32italic_r italic_a italic_n italic_k = 32 compared to rank=8𝑟𝑎𝑛𝑘8rank=8italic_r italic_a italic_n italic_k = 8 and rank=16𝑟𝑎𝑛𝑘16rank=16italic_r italic_a italic_n italic_k = 16, and only occupy 10% of the total model parameters. Therefore, based on these experiments, rank=32𝑟𝑎𝑛𝑘32rank=32italic_r italic_a italic_n italic_k = 32 was selected as the default setting for our experiments. Although a smaller LoRA rank occupies fewer parameters, it leads to a rapid performance decay. The experimental results clearly demonstrate that with LoRA fine-tuning, it is possible to achieve superior performance than a single model while utilizing a smaller parameter footprint.

Model Bank Rank branch 0 branch 1 after train Param.(%)
DAMOS + LsubscriptDAMOS + L\text{DAMO}_{\text{S + L}}DAMO start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT 32 31.8 37.8 37.8 14.35
DAMOS + LsubscriptDAMOS + L\text{DAMO}_{\text{S + L}}DAMO start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT 16 31.8 37.8 35.9 7.73
DAMOS + LsubscriptDAMOS + L\text{DAMO}_{\text{S + L}}DAMO start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT 8 31.8 37.8 35.9 4.02
LSNS + LsubscriptLSNS + L\text{LSN}_{\text{S + L}}LSN start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT 32 29.8 37.1 36.9 10.39
LSNS + LsubscriptLSNS + L\text{LSN}_{\text{S + L}}LSN start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT 16 29.8 37.1 30.6 5.48
LSNS + LsubscriptLSNS + L\text{LSN}_{\text{S + L}}LSN start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT 8 29.8 37.1 30.6 5.48
sYOLOS + LsubscriptsYOLOS + L\text{sYOLO}_{\text{S + L}}sYOLO start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT 32 29.5 36.9 36.6 10.21
sYOLOS + LsubscriptsYOLOS + L\text{sYOLO}_{\text{S + L}}sYOLO start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT 16 29.5 36.9 35.0 5.38
sYOLOS + LsubscriptsYOLOS + L\text{sYOLO}_{\text{S + L}}sYOLO start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT 8 29.5 36.9 35.0 2.7
Table 6: Ablation of LoRA rank: In the Param. column, we solely compare the proportion of parameters occupied by LoRA to the entire model. The best performance under the same model bank setting are highlighted in red font.

5 Conclusion

In conclusion, we present the Dynamic Routering Network (DyRoNet), a system that dynamically selects specialized detectors for varied environmental conditions with minimal computational overhead. Our innovative increase-boosting fine-tuning, featuring a Low-Rank Adapter, mitigates distribution bias and overfitting, enhancing scene-specific performance. Experimental results validate DyRoNet’s state-of-the-art performance, offering a benchmark for streaming perception and insights for future research. In the future, DyRoNet’s principles will undoubtedly inform the development of more advanced, reliable systems.

References

  • Bejnordi et al. [2019] Babak Ehteshami Bejnordi, Tijmen Blankevoort, and Max Welling. Batch-sha** for learning conditional channel gated networks. In International Conference on Learning Representations, 2019.
  • Cai et al. [2021] Shaofeng Cai, Yao Shu, and Wei Wang. Dynamic routing networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3588–3597, January 2021.
  • Chen et al. [2020] Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, and Zicheng Liu. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020.
  • Chen et al. [2023] Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers, 2023.
  • Ge et al. [2021] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
  • Ghosh et al. [2021] Anurag Ghosh, Akshay Nambi, Aditya Singh, Harish Yvs, and Tanuja Ganu. Adaptive streaming perception using deep reinforcement learning. arXiv preprint arXiv:2106.05665, 2021.
  • Guo et al. [2019] Junyao Guo, Unmesh Kurup, and Mohak Shah. Is it safe to drive? an overview of factors, metrics, and datasets for driveability assessment in autonomous driving. IEEE Transactions on Intelligent Transportation Systems, 21(8):3135–3151, 2019.
  • Han et al. [2021] Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang. Dynamic neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7436–7456, 2021.
  • He et al. [2023] Jun-Yan He, Zhi-Qi Cheng, Chenyang Li, Wangmeng Xiang, Binghui Chen, Bin Luo, Yifeng Geng, and Xuansong Xie. Damo-streamnet: Optimizing streaming perception in autonomous driving. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pages 810–818. International Joint Conferences on Artificial Intelligence Organization, 8 2023.
  • Hu et al. [2021] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  • Huang and Chen [2023] Yihui Huang and Ningjiang Chen. Mtd: Multi-timestep detector for delayed streaming perception. In Chinese Conference on Pattern Recognition and Computer Vision, pages 337–349. Springer, 2023.
  • Huang et al. [2018] Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kilian Weinberger. Multi-scale dense networks for resource efficient image classification. In International Conference on Learning Representations, 2018.
  • Jo et al. [2022] Wonwoo Jo, Kyungshin Lee, Jaewon Baik, Sangsun Lee, Dongho Choi, and Hyunkyoo Park. Dade: Delay-adoptive detector for streaming perception. arXiv preprint arXiv:2212.11558, 2022.
  • Jocher et al. [2021] Glenn Jocher, Alex Stoken, Jirka Borovec, Ayush Chaurasia, Liu Changyu, Adam Hogan, Jan Hajek, Laurentiu Diaconu, Yonghye Kwon, Yann Defretin, et al. ultralytics/yolov5: v5. 0-yolov5-p6 1280 models, aws, supervise. ly and youtube integrations. Zenodo, 2021.
  • Lan et al. [2023] **-Peng Lan, Zhi-Qi Cheng, Jun-Yan He, Chenyang Li, Bin Luo, Xu Bao, Wangmeng Xiang, Yifeng Geng, and Xuansong Xie. Procontext: Exploring progressive context transformer for tracking. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2023.
  • Li et al. [2020] Mengtian Li, Yu-Xiong Wang, and Deva Ramanan. Towards streaming perception. In Proceedings of the European Conference on Computer Vision, pages 473–488. Springer, 2020.
  • Li et al. [2023] Chenyang Li, Zhi-Qi Cheng, Jun-Yan He, Pengyu Li, Bin Luo, Hanyuan Chen, Yifeng Geng, **-Peng Lan, and Xuansong Xie. Longshortnet: Exploring temporal and semantic features fusion in streaming perception. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2023.
  • Lin et al. [2023] Zhihao Lin, Yongtao Wang, **he Zhang, and Xiaojie Chu. Dynamicdet: A unified dynamic architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6282–6291, June 2023.
  • Mahaur and Mishra [2023] Bharat Mahaur and KK Mishra. Small-object detection based on yolov5 in autonomous driving systems. Pattern Recognition Letters, 168:115–122, 2023.
  • Muhammad et al. [2020] Khan Muhammad, Amin Ullah, Jaime Lloret, Javier Del Ser, and Victor Hugo C de Albuquerque. Deep learning for safe autonomous driving: Current challenges and future directions. IEEE Transactions on Intelligent Transportation Systems, 22(7):4316–4336, 2020.
  • Qiao et al. [2022] Jian-Jun Qiao, Zhi-Qi Cheng, Xiao Wu, Wei Li, and Ji Zhang. Real-time semantic segmentation with parallel multiple views feature augmentation. In ACM International Conference on Multimedia, pages 6300–6308, 2022.
  • Sela et al. [2022] Gur-Eyal Sela, Ionel Gog, Justin Wong, Kumar Krishna Agrawal, Xiangxi Mo, Sukrit Kalra, Peter Schafhalter, Eric Leong, Xin Wang, Bharathan Balaji, et al. Context-aware streaming perception in dynamic environments. In Proceedings of the European Conference on Computer Vision, pages 621–638. Springer, 2022.
  • Shazeer et al. [2017] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  • Su et al. [2019] Hang Su, Varun Jampani, Deqing Sun, Orazio Gallo, Erik Learned-Miller, and Jan Kautz. Pixel-adaptive convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2019.
  • Wang et al. [2018] Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E. Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. In Proceedings of the European Conference on Computer Vision, September 2018.
  • Wang et al. [2020] Yulin Wang, Kangchen Lv, Rui Huang, Shiji Song, Le Yang, and Gao Huang. Glance and focus: a dynamic approach to reducing spatial redundancy in image classification. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 2432–2444. Curran Associates, Inc., 2020.
  • Wang et al. [2023a] Hao Wang, Zhi-Qi Cheng, **gdong Sun, Xin Yang, Xiao Wu, Hongyang Chen, and Yan Yang. Debunking free fusion myth: Online multi-view anomaly detection with disentangled product-of-experts modeling. In ACM International Conference on Multimedia, pages 3277–3286, 2023.
  • Wang et al. [2023b] Xiaofeng Wang, Zheng Zhu, Yunpeng Zhang, Guan Huang, Yun Ye, Wenbo Xu, Ziwei Chen, and Xingang Wang. Are we ready for vision-centric driving streaming perception? the asap benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9600–9610, 2023.
  • Wang et al. [2023c] Zixiao Wang, Weiwei Zhang, and Bo Zhao. Estimating optical flow with streaming perception and changing trend aiming to complex scenarios. Applied Sciences, 13(6):3907, 2023.
  • Yang et al. [2019] Brandon Yang, Gabriel Bender, Quoc V Le, and Jiquan Ngiam. Condconv: Conditionally parameterized convolutions for efficient inference. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • Yang et al. [2022] **rong Yang, Songtao Liu, Zeming Li, ** Li, and Jian Sun. Real-time object detection for streaming perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5385–5395, June 2022.
  • Zhang et al. [2023] Ji Zhang, Xiao Wu, Zhi-Qi Cheng, Qi He, and Wei Li. Improving anomaly segmentation with multi-granularity cross-domain alignment. In ACM International Conference on Multimedia, pages 8515–8524, 2023.
  • Zhu et al. [2019] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2019.
  • Zhu et al. [2023] Jiawen Zhu, Zhi-Qi Cheng, Jun-Yan He, Chenyang Li, Bin Luo, Huchuan Lu, Yifeng Geng, and Xuansong Xie. Tracking with human-intent reasoning. arXiv preprint arXiv:2312.17448, 2023.
  • Zou et al. [2023] Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jie** Ye. Object detection in 20 years: A survey. Proceedings of the IEEE, 2023.

DyRoNet: A Low-Rank Adapter Enhanced Dynamic Routing Network for Streaming Perception
(Supplementary Material)

The appendix completes the main paper by providing in-depth research details and extended experimental results. The structure of the appendix is organized as follows:

  1. 1.

    Analysis of Environmental Factors Affecting Streaming Perception: Sec. A

    • Impact of Weather Conditions: Sec. A.1

    • Quantitative Analysis of Objects: Sec. A.2

    • Proportion of Small Objects: Sec. A.3

    • Environmental Speed Dynamics: Sec. A.4

  2. 2.

    Expanded Experimental Results: Sec. B

    • Inference Time: Analysis Sec. B.1

  3. 3.

    Detailed Description of DyRoNet: Sec. C

    • Selection of Pre-trained Model: Sec. C.1

    • Hyperparameter Settings: Sec. C.2

Appendix A Factor Analysis in Streaming Perception

In development of DyRoNet, we undertook an extensive survey and analysis to identify key influencing factors in autonomous driving scenarios that could potentially impact streaming perception. This analysis utilized the Argoverse-HD dataset Li et al. [2020], a benchmark in the field of streaming perception. The primary goal of this factor analysis was to isolate the most critical factor affecting streaming perception performance. As elaborated in the main text, our comprehensive analysis led to the identification of the speed of the environment as the predominant factor. Consequently, DyRoNet is tailored to address this specific aspect. Our analysis focuses on four primary elements: weather conditions, object quantity, small object proportion, and environmental speed. We methodically examined each of these factors to evaluate their respective impacts on streaming perception within autonomous driving.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Figure 4: Illustrative Examples of Varied Weather Conditions and Times of Day: (a) Sunny during Daytime, (b) Cloudy during Daytime, (c) Rainy during Daytime, (d) Rainy during Nighttime, (e) Sunny during Nighttime.

A.1 Impact of Weather Conditions

The Argoverse-HD dataset, comprising testing, training, and validation sets, includes a diverse range of weather conditions. Specifically, the dataset contains 24, 65, and 24 video segments in the testing, training, and validation sets, respectively, with frame counts ranging from 400 to 900 per segment. Tab. 7 details the distribution of various weather types across these subsets. Fig. 4 provides visual examples of different weather conditions captured in the dataset. A clear variation in visual clarity and perception difficulty is observable under different conditions, with scenarios like Sunny + Day or Cloudy + Day appearing visually more challenging compared to Rainy + Night.

To evaluate the impact of weather conditions on streaming perception, we conducted tests using a range of pre-trained models from StreamYOLO Yang et al. [2022], LongShortNet Li et al. [2023], and DAMO-StreamNet He et al. [2023], employing various scales and settings. The results, presented in Tab. 8, indicate that performance is generally better during Day conditions compared to Night. This confirms that weather conditions indeed influence streaming perception.

However, it’s noteworthy that even within the same weather conditions, model performance varies significantly, with accuracy ranging from below 10% to above 70%. Fig. 5 illustrates this point by comparing frames from two video segments (Clip ids: 00c561 and 395560) under identical weather conditions, where the performance difference of the same model on these segments is as high as 32.1%. This observation suggests the presence of other critical environmental factors that affect streaming perception, indicating that weather, while influential, is not the sole determinant of model performance.

test train val
Sunny + Day 8 34 8
Cloudy + Day 13 27 15
Rainy + Day 1 1 0
Rainy + Night 1 0 0
Sunny + Night 1 3 1
Table 7: Distribution of Weather Conditions in Testing, Training, and Validation Sets: This figure illustrates the frequency of different weather conditions in the testing, training, and validation sets of the Argoverse-HD dataset, providing an overview of the environmental variability within each dataset subset.
Refer to caption
(a)
Refer to caption
(b)
Figure 5: Rapid Fluctuations in Performance Under Identical Weather Conditions: (a) Clip id: 00c561 shows a Streaming Average Precision (sAP) of 16.2% using the StreamYOLO-s model, (b) Clip id: 395560 demonstrates a significantly higher sAP of 48.3% under the same model and weather condition, illustrating the variability in model performance even under consistent environmental factors.
StreamYOLO LongShortNet DAMO-StreamNet
Clip ID Weather s 1x m 1x l 1x l 2x l still s 1x m 1x l 1x l high s 1x m 1x l 1x l high
1d6767 Cloudy + Day 20.9 22.8 24.9 7.0 26.7 20.9 23.4 25.0 36.4 21.3 24.6 26.0 34.2
5ab269 Cloudy + Day 25.6 30.0 31.6 6.9 33.3 25.2 29.5 31.4 40.1 26.9 29.0 31.7 41.2
70d2ae Cloudy + Day 26.3 31.4 37.9 9.4 41.0 25.2 31.0 37.5 44.7 27.7 34.8 34.3 44.9
337375 Cloudy + Day 24.8 24.8 33.4 17.1 35.3 27.2 27.9 34.7 38.0 26.4 37.5 28.8 39.1
7d37fc Cloudy + Day 32.5 36.4 41.5 15.5 42.1 33.6 37.7 40.8 45.8 35.2 40.1 39.4 45.7
f1008c Cloudy + Day 38.6 42.0 44.4 11.3 46.2 40.0 40.4 45.3 50.3 39.1 42.4 45.8 54.1
f9fa39 Cloudy + Day 35.7 39.5 41.8 9.9 48.1 33.2 39.8 42.9 50.1 38.8 44.1 44.3 51.4
cd6473 Cloudy + Day 40.0 45.7 44.0 11.3 52.7 36.6 47.3 47.3 54.0 40.2 44.6 47.9 54.7
cb762b Cloudy + Day 36.4 41.3 44.3 10.8 44.8 36.9 41.4 44.4 57.7 40.9 44.8 43.7 57.6
aeb73d Cloudy + Day 39.6 44.6 45.2 12.5 46.7 39.2 46.7 45.9 52.3 42.6 46.4 47.5 51.3
cb0cba Cloudy + Day 48.3 47.5 52.1 13.8 50.9 46.0 47.5 50.4 55.5 47.1 47.7 51.5 59.4
e9a962 Cloudy + Day 45.6 53.8 55.4 15.8 58.8 44.0 52.8 55.6 60.7 45.1 50.2 52.9 56.2
2d12da Cloudy + Day 50.8 56.5 56.2 11.9 58.8 48.5 54.6 56.6 59.1 53.1 54.8 57.5 63.8
85bc13 Cloudy + Day 56.2 56.8 60.1 19.5 62.1 55.3 58.2 59.2 63.5 54.9 58.3 59.6 67.3
00c561 Sunny + Day 16.2 19.0 20.5 5.1 22.2 17.6 20.1 20.2 26.4 17.9 19.3 21.5 25.2
c9d6eb Sunny + Day 22.5 28.9 32.5 07.5 35.3 22.6 28.8 32.9 39.1 24.5 26.0 28.4 38.6
cd5bb9 Sunny + Day 23.3 24.9 25.8 6.2 27.2 23.4 25.2 25.8 30.4 23.4 25.7 26.2 31.5
6db21f Sunny + Day 24.1 26.4 27.0 6.7 28.9 23.3 27.0 27.0 34.7 25.1 28.0 28.7 37.0
647240 Sunny + Day 27.1 29.3 31.2 07.8 34.1 26.5 30.1 31.5 38.8 26.9 32.0 32.0 38.4
da734d Sunny + Day 30.2 33.4 37.0 8.8 39.9 29.2 34.4 37.5 42.6 34.2 35.7 38.2 43.1
5f317f Sunny + Day 31.9 42.3 45.9 8.9 50.1 32.8 42.0 46.1 51.2 40.0 44.6 47.0 54.0
395560 Sunny + Day 49.3 61.2 60.6 11.3 72.1 51.7 60.7 58.5 65.4 58.9 63.4 57.8 59.6
b1ca08 Sunny + Day 60.0 62.1 68.4 22.4 67.9 61.7 61.4 67.7 70.6 59.6 65.0 67.7 68.6
033669 Sunny + Night 18.0 23.5 25.7 6.6 27.4 18.5 23.6 25.1 27.6 21.8 22.7 23.8 27.5
Overall 29.8 33.7 36.9 34.6 39.4 29.8 34.1 37.1 42.7 31.8 35.5 37.8 43.3
Table 8: Offline Evaluation Results on the Argoverse-HD Validation Dataset: It records the sAP scores across the 0.50 to 0.95 range for each clip. The optimal and worst results are highlighted in green and red font under the same weather conditions. The notation “l high” is used as an abbreviation for the resolution 1200x1920, providing a concise representation of the data.
Refer to caption
(a)
Refer to caption
(b)
Figure 6: Histograms Depicting Object Quantity in the Argoverse-HD Dataset: This figure presents two histograms, (a) representing the distribution of the number of objects per frame in the training set of Argoverse-HD, and (b) showing the same distribution in the validation set. These histograms provide a visual analysis of object frequency and variability within different sets of the dataset.

A.2 Analysis of Object Quantity Impact

To assess the impact of the number of objects on streaming perception, we conducted a statistical analysis of object counts per frame in the Argoverse-HD dataset, encompassing both training and validation sets. The results of this analysis are depicted in Fig 6, which showcases a histogram representing the distribution of the number of objects in individual frames. The variance in the distribution is notable, with values of 74.6674.6674.6674.66 for the training set and 75.3975.3975.3975.39 for the validation set, indicating significant fluctuation in the number of objects across frames. Additionally, as shown in Tab. 8, there is considerable variability in object counts across different video segments. This observation led us to further investigate the potential correlation between object quantity and model performance fluctuations.

To explore this correlation, we calculated the average number of objects per frame for each segment within the Argoverse-HD validation set. The findings, detailed in Tab. 9, include the average object counts alongside Spearman correlation coefficients, which measure the relationship between object quantity and model performance. The absolute values of these coefficients range from 1e-1 to 1e-2. This range of correlation coefficients suggests that the number of objects present in the environment does not exhibit a strong or significant correlation with the performance of streaming perception models. In other words, our analysis indicates that the sheer quantity of objects within the environment is not a predominant factor influencing the efficacy of streaming perception.

Clip ID Mean Obj \uparrow sYOLO LSN DAMO
1d6767 35.30 20.9 20.9 21.3
7d37fc 30.89 32.5 33.6 35.2
da734d 25.16 30.2 29.2 34.2
cd6473 23.75 40.0 36.6 40.2
5ab269 23.37 25.6 25.2 26.9
cb762b 23.31 36.4 36.9 40.9
f1008c 23.08 38.6 40.0 39.1
e9a962 21.58 45.6 44.0 45.1
70d2ae 21.38 26.3 25.2 27.7
2d12da 19.33 50.8 48.5 53.1
337375 18.19 24.8 27.2 26.4
f9fa39 17.46 35.7 33.2 38.8
aeb73d 16.82 39.6 39.2 42.6
6db21f 16.30 24.1 23.3 25.1
647240 14.18 27.1 26.5 26.9
b1ca08 14.08 60.0 61.7 59.6
85bc13 12.06 56.2 55.3 54.9
033669 11.89 18.0 18.5 21.8
00c561 10.06 16.2 17.6 17.9
cb0cba 10.04 48.3 46.0 47.1
395560 10.00 49.3 51.7 58.9
cd5bb9 8.95 23.3 23.4 23.4
c9d6eb 7.88 22.5 22.6 24.5
5f317f 6.92 31.9 32.8 40.0
Coefficient 0.052 0.035 -0.020
Table 9: Table 9 shows the analysis of the average number of objects per frame for each segment in the Argoverse-HD validation set, along with the Spearman correlation coefficients. These coefficients determine the relationship between the quantity of objects and the performance of streaming perception models. The coefficients range from 1e-1 to 1e-2, indicating a weak correlation. This data suggests that the total number of objects in the environment does not significantly affect the performance of streaming perception models, indicating that object quantity is not a primary factor that affects the efficacy of streaming perception tasks.

A.3 Analysis of the Proportion of Small Objects

The influence of small objects on perception models, particularly in autonomous driving scenarios, has been underscored in studies like Mahaur and Mishra [2023] and Yang et al. [2022]. In such scenarios, even minor shifts in viewing angles can cause notable relative displacement of small objects, posing a challenge for perception models in processing streaming data effectively. This observation prompted us to closely examine the proportion of small objects in the environment.

To begin, we analyzed the area ratios of objects in both the training and validation sets of the Argoverse-HD dataset. This involved calculating the ratio of the pixel area covered by an object’s bounding box to the total pixel area of the frame. We visualized these ratios in histograms shown in Fig. 7. The analysis revealed that the mean object area ratio is below 1e-2, indicating a substantial presence of small objects in the dataset. For simplicity in subsequent discussions, we define objects with an area ratio less than 1% as ‘small objects’.

Tab. 10 presents our findings on the proportion of small objects within the Argoverse-HD validation set. Despite some variability in the overall number of objects and small objects, the proportion of small objects remains relatively stable, as reflected in the variance of their proportion. This stability suggests that small objects are a consistent and prominent feature across various video segments, representing a persistent challenge of streaming perception.

Refer to caption
(a)
Refer to caption
(b)
Figure 7: Histograms of Object Area Proportions in Argoverse-HD Dataset: This figure showcases two histograms depicting the proportion of area occupied by objects relative to the entire frame, for (a) the training set and (b) the validation set of the Argoverse-HD dataset. These histograms provide insights into the spatial distribution and size variation of objects within the frames of the dataset.
sid # obj \uparrow # small obj proportion
12 27829 24033 86%
3 16557 15937 96%
14 15058 14260 95%
15 12685 10229 81%
9 12618 11216 89%
5 12189 9509 78%
21 11801 10259 87%
18 11073 9856 89%
20 11068 10203 92%
7 10962 9707 89%
23 10961 9839 90%
2 10717 9700 91%
10 10706 9001 84%
22 10122 8846 87%
11 9965 8976 90%
4 9180 7989 87%
1 9068 8153 90%
24 8293 7830 94%
19 8068 6552 81%
17 4709 4230 90%
6 4420 3708 84%
16 7001 6508 93%
13 5654 5251 93%
8 3237 2449 76%
mean 10580 9343 87.96%
var 0.0026
Table 10: Distribution of Small Objects in the Argoverse-HD Validation Set: This figure illustrates the count of objects in each video segment of the Argoverse-HD validation set, specifically focusing on objects with an area proportion less than 1%. The chart provides a detailed view of the prevalence and distribution of smaller-sized objects across different video segments in the dataset.

A.4 Impact of Environmental Speed

In Sec. A.3, we highlighted how motion within the observer’s viewpoint can affect the perception of small objects. This observation leads us to consider that the speed of the environment could interact with the proportion of small objects.

To investigate the relationship between the environmental speed and the performance variability of streaming perception models, we categorized the validation dataset into three distinct environmental states: stop, straight, and turning. We then manually divided the dataset based on these states. In this reorganized dataset, the clips with an ID’s first digit as 0 exclusively represent the stop state, while the digits 1 and 2 correspond to straight and turning states, respectively.

Fig. 8 showcases the performance of StreamYOLO, LongShortNet, and DAMO-StreamNet across each of these segments. Additionally, the mean performance under each motion state is calculated and presented. The data reveals a consistent pattern across all three models: the performance ranking in different environmental motion states follows the order of stop being better than straight, which in turn is better than turning. This trend indicates an association between the state of environmental motion and fluctuations.

Consequently, based on this analysis, we infer that the speed of the environment, particularly when considering the substantial proportion of small objects and their sensitivity to environmental dynamics, emerges as the most influential environmental factor in the context of streaming perception.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 8: Performance Analysis by Environmental Speed in Validation Segments: This figure displays the performance outcomes of three different models—(a) StreamYOLO, (b) LongShortNet, and (c) DAMO-StreamNet—across various segments of the Argoverse-HD validation set, categorized by environmental speed. The charts provide a comparative view of how each model responds to different speeds in the environment, highlighting their effectiveness in varying dynamic conditions.

Appendix B More Experiment Results

B.1 Inference Time Analysis

This subsection supplements Section 4.4 of the main paper, where we previously discussed the performance of DyRoNet but did not extensively delve into its inference time characteristics. To address this, Tab. 11 presents a detailed comparison of the inference times for each independent branch used in our model. It is important to note that the inference times reported here may show variations when compared to those published by the original authors of the models. This discrepancy is primarily due to differences in the hardware platforms used and the specific configurations of the corresponding models in our experiments.

An interesting observation from the results is that there are instances where DyRoNet exhibits a slower inference time compared to either the random selection method or branch 1. This slowdown is attributed to the incorporation of the speed router in our sample routing mechanism. Despite this, it is evident from the overall results that DyRoNet, employing the router strategy, still retains real-time processing capabilities across the various branches in the model bank. Moreover, in certain scenarios, DyRoNet demonstrates even faster inference speeds than when using individual branches independently. This detailed analysis underlines the dynamic and adaptive nature of DyRoNet in balancing between inference speed and accuracy, highlighting its capability to optimize streaming perception tasks in real-time scenarios.

Branches branch 0 branch 1 random DyRoNet
DAMOS + MsubscriptDAMOS + M\text{DAMO}_{\text{S + M}}DAMO start_POSTSUBSCRIPT S + M end_POSTSUBSCRIPT 29.26 33.65 36.61 33.22
DAMOS + LsubscriptDAMOS + L\text{DAMO}_{\text{S + L}}DAMO start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT 29.26 36.63 35.12 39.60
DAMOM + LsubscriptDAMOM + L\text{DAMO}_{\text{M + L}}DAMO start_POSTSUBSCRIPT M + L end_POSTSUBSCRIPT 33.65 36.63 37.30 37.61
LSNS + MsubscriptLSNS + M\text{LSN}_{\text{S + M}}LSN start_POSTSUBSCRIPT S + M end_POSTSUBSCRIPT 22.08 25.88 24.79 21.47
LSNS + LsubscriptLSNS + L\text{LSN}_{\text{S + L}}LSN start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT 22.08 31.24 21.49 30.48
LSNM + LsubscriptLSNM + L\text{LSN}_{\text{M + L}}LSN start_POSTSUBSCRIPT M + L end_POSTSUBSCRIPT 25.88 31.24 24.75 29.05
sYOLOS + MsubscriptsYOLOS + M\text{sYOLO}_{\text{S + M}}sYOLO start_POSTSUBSCRIPT S + M end_POSTSUBSCRIPT 18.76 23.01 39.16 26.25
sYOLOS + LsubscriptsYOLOS + L\text{sYOLO}_{\text{S + L}}sYOLO start_POSTSUBSCRIPT S + L end_POSTSUBSCRIPT 18.76 27.85 24.04 29.35
sYOLOM + LsubscriptsYOLOM + L\text{sYOLO}_{\text{M + L}}sYOLO start_POSTSUBSCRIPT M + L end_POSTSUBSCRIPT 23.01 27.85 24.69 23.51
Table 11: In-Depth Analysis of DyRoNet’s Inference Time: This table presents a detailed comparison of inference times between the random selection method and DyRoNet. For ease of analysis, the optimal values in each comparison are highlighted in green font. This highlighting assists in quickly identifying which method—random or DyRoNet —achieves superior performance in terms of inference speed under various conditions.

Appendix C More Details of DyRoNet

Model Scale # of params
StreamYOLO S 9,137,319
M 25,717,863
L 54,914,343
LongShortNet S 9,282,103
M 25,847,783
L 55,376,515
DAMO-StreamNet S 18,656,357
M 50,129,333
L 94,156,945
Table 12: Parameter Count of Selected Pre-trained Models: This table lists the number of parameters for each pre-trained model chosen for our analysis. It provides a quantitative overview of the complexity and size of the models, facilitating a comparison of their computational requirements.

C.1 Pre-trained Model Selection

As outlined in the main paper, our implementation of DyRoNet incorporates three existing models as branches within the Model Bank 𝒫𝒫\mathcal{P}caligraphic_P: StreamYOLOYang et al. [2022], LongShortNetLi et al. [2023], and DAMO-StreamNetHe et al. [2023]. These models were selected due to their specialized features and proven effectiveness in streaming perception tasks. StreamYOLO is unique for its two additional pre-trained weight variants, each tailored for different streaming processing speeds. This feature allows for adaptable performance depending on the speed requirements of the streaming task. In contrast, LongShortNet and DAMO-StreamNet are equipped with pre-trained weights optimized for high-resolution image processing, making them suitable for scenarios where image clarity is paramount.

To ensure a diverse and versatile range of options within the Model Bank, our implementation of DyRoNet selectively utilizes the Small (S), Medium (M), and Large (L) variants of the pre-trained weights from each model. This choice enables a balanced mix of processing speeds and resolution handling capabilities, catering to a wide range of streaming perception scenarios. The specific details regarding the number of parameters for these pre-trained models can be found in Table 12, which provides a comparative overview to help in understanding the computational complexity for different tasks.

C.2 Setting of Hyperparameters

For all our experiments, we maintained consistent training hyperparameters to ensure comparability and reproducibility of results. The experiments were executed on four RTX 3090 GPUs. Considering the need for selecting the optimal branch model for each sample during the routing process, we established a batch size of 4444, effectively allocating one sample to each GPU for parallel computation.

In alignment with the configuration used in StreamYOLO, we employed Stochastic Gradient Descent (SGD) as our optimization technique. The learning rate was set to 0.001×BatchSize/640.001BatchSize640.001\times\text{BatchSize}/640.001 × BatchSize / 64, adapting to the batch size proportionally. Additionally, we incorporated a cosine annealing schedule for the learning rate, integrated with a warm-up phase lasting one epoch to stabilize the initial training process.

Regarding data preprocessing, we ensured uniformity by resizing all input frames to 600×960600960600\times 960600 × 960 pixels. This standardization was crucial for maintaining consistency across different datasets and ensuring that our model could generalize well across various input dimensions.

Acknowledgments

Zhi-Qi Cheng’s research in this project was supported by the US Department of Transportation, Office of the Assistant Secretary for Research and Technology, under the University Transportation Center Program (Federal Grant Number 69A3551747111), and Intel and IBM Fellowships.

Contribution Statement

Xiang Huang, Jun-Yan He, and Zhi-Qi Cheng worked together on the research idea and were fully involved in the experimental work. Jun-Yan He and Zhi-Qi Cheng made revisions to the manuscript, offering valuable insights and recommendations for the experimental procedures. The manuscript was then reviewed by Chengyang Li, Wangmeng Xiang, Baigui Sun, and Xiao Wu. Zhi-Qi Cheng served as the corresponding author and led the entire project.