DyRoNet: Dynamic Routing and Low-Rank Adapters for Autonomous Driving Streaming Perception

Xiang Huang

{}^{1,2}

Internship at CMU Zhi-Qi Cheng

{}^{3}

Corresponding author Jun-Yan He

{}^{4}

Chenyang Li

{}^{4}

Wangmeng Xiang

{}^{4,5}

Baigui Sun

{}^{4}

Xiao Wu

{}^{1,2}

{}^{1}

Southwest Jiaotong University

{}^{2}

Engineering Research Center of Sustainable Urban Intelligent Transportation, China

{}^{3}

Language Technologies Institute, Carnegie Mellon University

{}^{4}

Alibaba Group

{}^{5}

The Hong Kong Polytechnic University
[email protected], [email protected], {junyanhe1989, marquezxm, wuxiaohk}@gmail.com,
[email protected], [email protected]

Abstract

The advancement of autonomous driving systems hinges on the ability to achieve low-latency and high-accuracy perception. To address this critical need, this paper introduces Dynamic Routering Network (DyRoNet), a low-rank enhanced dynamic routing framework designed for streaming perception in autonomous driving systems. DyRoNet integrates a suite of pre-trained branch networks, each meticulously fine-tuned to function under distinct environmental conditions. At its core, the framework offers a speed router module, developed to assess and route input data to the most suitable branch for processing. This approach not only addresses the inherent limitations of conventional models in adapting to diverse driving conditions but also ensures the balance between performance and efficiency. Extensive experimental evaluations demonstrating the adaptability of DyRoNet to diverse branch selection strategies, resulting in significant performance enhancements across different scenarios. This work not only establishes a new benchmark for streaming perception but also provides valuable engineering insights for future work.¹¹1Project: https://tastevision.github.io/DyRoNet/

1 Introduction

In autonomous driving systems, it is crucial to achieve low-latency and high-precision perception. Traditional object detection algorithms Zou et al. (2023), while effective in various contexts, often confront the challenge of latency due to inherent computational delays. This lag between algorithmic processing and real-world states can lead to notable discrepancies between predicted and actual object locations. Such latency issues have been extensively reported and are known to significantly impact the decision-making process in autonomous driving systems Chen et al. (2023).

Addressing these challenges, the concept of streaming perception has been introduced as a response Li et al. (2020). This perception task aims to predict “future” results by accounting for the delays incurred during the frame processing stage. Unlike traditional methods that primarily focus on detection at a given moment, streaming perception transcends this limitation by anticipating future environmental states, and aligning perceptual outputs closer to real-time dynamics. This new paradigm is key in addressing the critical gap between real-time processing and real-world changes, thereby enhancing the safety and reliability of autonomous driving systems Muhammad et al. (2020).

Refer to caption — Figure 1: Illustration of DyRoNet’s dynamic selection mechanism in streaming perception. This diagram showcases DyRoNet’s capability to adaptively choose the most suitable perception strategy, contrasting with the static approach of traditional methods in complex environments [Viewing in color and at an expanded scale].

Although the existing streaming approach seems promising, it still faces contradictions in real-world scenarios. These contradictions primarily stem from the diverse and unpredictable nature of driving environments. The factors such as camera motion, weather conditions, lighting variations, and the presence of small objects seriously impact the performance of perception measures, leading to fluctuations that challenge their robustness and reliability (see Sec. 3.1). This complexity in real-world scenarios underscores the limitations of a single, uniform model, which often struggles to adapt to the varied demands of different driving conditions Guo et al. (2019). In general, the challenges of streaming perception mainly include:

(1) Diverse Scenario Distribution: Autonomous driving environments are inherently complex and dynamic, showing a myriad of scenarios that a single perception model may not adequately address (see Fig. 1). The need to customize perception algorithms to specific environmental conditions, while ensuring that these models operate cohesively, poses a significant challenge. As discussed in Sec. 3.1, adapting models to various scenarios without compromising their core functionality is a crucial aspect of streaming perception.

(2) Performance-Efficiency Balance: To our knowledge, the integration of both large and small-scale models is essential to handle the varying complexities encountered in different driving scenes. The large models, while potentially more accurate, may suffer from increased latency, whereas smaller models may offer faster inference at the cost of reduced accuracy. Balancing performance and efficiency, therefore, becomes a challenging task. In Sec. 3.1, we explore the strategies for optimizing this balance, exploring how different model architectures can be effectively utilized to enhance streaming perception.

Generally speaking, these challenges highlight the demand for streaming perception. As we study in Sec. 3.1, addressing the diverse scenario distribution and achieving an optimal balance between performance and efficiency are key to advancing the state-of-the-art in autonomous driving. To address the intricate challenges presented by real-world streaming perception, we introduce DyRoNet, a framework designed to enhance dynamic routing capabilities in autonomous driving systems. DyRoNet stands as a low-rank enhanced dynamic routing framework, specifically crafted to cater to the requirements of streaming perception. It encapsulates a suite of pre-trained branch networks, each meticulously fine-tuned to optimally function under distinct environmental conditions. A key component of DyRoNet is the speed router module, ingeniously developed to assess and efficiently route input data to the most appropriate branch, as detailed in Sec. 3.2. To sum up, the contributions are listed as:

•

We emphasize the impact of environmental speed as a key determinant of streaming perception. Through analysis of various environmental factors, our research highlights the imperative need for adaptive perception responsive to dynamic conditions.
•

By utilizing a variety of sophisticated streaming perception techniques, DyRoNet provides the speed router as a major invention. This component dynamically determines the best route for handling each input, ensuring efficiency and accuracy in perception. The ability to adapt and be versatile is demonstrated by this dynamic route-choosing mechanism.
•

Extensive experimental evaluations have demonstrated that DyRoNet is capable of adapting to diverse branch selection strategies, resulting in a substantial enhancement of performance across various branch structures. This not only validates the framework’s wide-ranging applicability but also confirms its effectiveness in handling different real-world scenarios.

In summary, DyRoNet offers advancements for low-latency, high-accuracy perception in autonomous driving. By addressing challenges of environmental adaptability and dynamic branch selection, DyRoNet sets new benchmarks in achieving low-latency and high-accuracy perception.

2 Related Work

This section revisits developments in streaming perception and dynamic neural networks, highlighting differences from our proposed DyRoNet framework. While existing methods have made progress, limitations persist in addressing real-world autonomous driving complexity.

2.1 Streaming Perception

The existing streaming perception methods fall into three main categories. (1) The initial methods focused on single-frame, with models like YOLOv5 Jocher et al. (2021) and YOLOX Ge et al. (2021) achieving real-time performance. However, lacking motion trend capture, they struggle in dynamic scenarios. (2) The recent approaches incorporated current and historical frames, like StreamYOLO Yang et al. (2022) building on YOLOX with dual-flow fusion. LongShortNet Li et al. (2023) used longer histories and diverse fusion. DAMO-StreamNet He et al. (2023) added asymmetric distillation and deformable convolutions to improve large object perception. (3) Recognizing the limitations of single models, current methods explore dynamic multi-model systems. One approach Ghosh et al. (2021) adapts models to environments via reinforcement learning. DaDe Jo et al. (2022) extends StreamYOLO by calculating delays to determine frame steps. A later version Huang and Chen (2023) added multi-branch prediction heads. Beyond 2D detection, streaming perception expands into optical flow, tracking, and 3D detection, with innovations in metrics and benchmarks Wang et al. (2023c); Sela et al. (2022); Wang et al. (2023b). Distinct from these existing approaches, our proposed method, DyRoNet, introduces a low-rank enhanced dynamic routing mechanism specifically designed for streaming perception. DyRoNet stands out by integrating a suite of advanced branch networks, each fine-tuned for specific environmental conditions. Its key innovation lies in the speed router module, which not only routes input data efficiently but also dynamically adapts to the diverse and unpredictable nature of real-world driving scenarios.

2.2 Dynamic Neural Networks

Dynamic Neural Networks (DNNs) feature adaptive network selection, outperforming static models in efficiency and performance (Han et al. (2021); Lan et al. (2023); Zhang et al. (2023)). The existing research primarily focuses on structural design for core deep learning tasks like image classification (Huang et al. (2018); Wang et al. (2020, 2018)). DNNs follow two approaches: (1) Multi-branch models (Bejnordi et al. (2019); Cai et al. (2021); Shazeer et al. (2017); Wang et al. (2023a); Qiao et al. (2022)) rely on a lightweight router assessing inputs to direct them to appropriate branches, enabling tailored computation. (2) By generating new weights based on inputs (Yang et al. (2019); Chen et al. (2020); Su et al. (2019); Zhu et al. (2019)), these models dynamically alter computations to match diverse needs. DNN applications expand beyond conventional tasks. In object detection, DynamicDet (Lin et al. (2023)) categorizes inputs and processes them through distinct branches. This illustrates DNNs’ broader applicability and efficiency, promising contributions particularly for complex, dynamic environments.

3 Proposed Method

This section outlines the framework of our proposed DyRoNet. Beginning with its underlying motivation and the critical factors driving its design, we subsequently provide an overview of its architecture and training process.

3.1 Motivation for DyRoNet

Autonomous driving faces variability from weather, scene complexity, and vehicle velocity. By strategically analyzing key factors and routing logic, this section details the rationale behind the proposed DyRoNet.

Analysis of Influential Factors. Our statistical analysis of the Argoverse-HD dataset Li et al. (2020) underscores the profound influence of environmental dynamics on the effectiveness of streaming perception. While weather inconsistently impacts accuracy, suggesting the presence of other influential factors (see Appendix A.1), fluctuations in the object count show limited correlation with performance degradation (see Appendix A.2). Conversely, the presence of small objects across various scenes poses a significant challenge for detection, especially under varying motion states (see Appendix A.3). Notably, disparities in performance are most pronounced across different environmental motion states (see Appendix A.4), thereby motivating the need for a dynamic, velocity-aware routing mechanism in DyRoNet.

Rationale for Dynamic Routing. Analysis reveals that StreamYOLO’s reliance on a single historical frame falters at high velocities, in contrast to multi-frame models, highlighting a clear connection between velocity and detection performance (see Tab. 1). Dynamic adaptation of frame history, based on vehicular speed changes, enables DyRoNet to strike a balance between accuracy and latency (see Sec. 4.3). Through first-order differences, the system efficiently switches models to align with environmental motions. Specifically, the dynamic routing is designed to select the optimal architecture based on the vehicle’s speed profile, ensuring precision at lower velocities for detailed perception and efficiency at higher speeds for swift response. Such adaptable routing, informed by comprehensive speed analysis, positions DyRoNet as a robust solution for reliable perception across diverse autonomous driving scenarios. Next, we introduce DyRoNet in detail.

3.2 Architecture of DyRoNet

Overview of DyRoNet. The structure of DyRoNet, as depicted in Fig. 2, proposes a multi-branch structure. Each branch within DyRoNet framework functions as an independent streaming perception model, capable of processing both the current and historical frames. This dual-frame processing is central to DyRoNet’s capability, facilitating a nuanced understanding of temporal dynamics. Such a design is key in achieving a delicate balance between latency and accuracy, aspects crucial for real-time autonomous driving.

Mathematically, the core of DyRoNet lies the processing of a frame sequence, $\mathcal{S}=\{I_{t},\cdots,I_{t-N\delta t}\}$ , where $N$ indicates the number of frames and $\delta t$ the interval between successive frames. The process of the framework is formalized as:

\centering\mathcal{T}=\mathcal{F}(\mathcal{S},\mathcal{P},\mathcal{W}),\@add@centering

where $\mathcal{P}=\{P_{0},\cdots,P_{K-1}\}$ denotes a collection of streaming perception models, with each $P_{i}$ denoting an individual model within this suite. The architecture is further enhanced by incorporating a feature extractor $\mathcal{G}_{i}$ and a perception head $\mathcal{H}_{i}$ for each model. The Router Network, $\mathcal{R}$ , is instrumental in selecting the most suitable streaming perception model for each specific scenario.

Correspondingly, the weights of DyRoNet are denoted by $\mathcal{W}=\{W^{d},W^{l},W^{r}\}$ , where $W^{d}$ indicates the weights of the streaming perception model, $W^{l}$ relates to the Low-Rank Adaptation (LoRA) weights within each model, and $W^{r}$ pertains to the Router Network. The culmination of this process is the final output, $\mathcal{T}$ , a compilation of feature maps. These maps can be further decoded through $Decode(\mathcal{T})$ , revealing essential details like objects, categories, and locations. Below we introduce each module in detail.

Router Network. The Router Network in DyRoNet plays a crucial role in understanding and classifying the dynamics of the environment. This module is designed for both environmental classification and branch decision-making. To effectively and rapidly capture environmental speed, frame differences are employed as the input to the Router Network. As shown in Fig. 3, frame differences exhibit a high discriminative advantage for different environmental speeds.

Specifically, for frames at times $t$ and $t-1$ , represented as $I_{t}$ and $I_{t-1}$ respectively, the frame difference is computed as $\Delta I_{t}=I_{t}-I_{t-1}$ . The architecture of the Router Network, $\mathcal{R}$ , is simple yet efficient. It consists of a single convolutional layer followed by a linear layer. The network’s output, denoted as $f^{r}\in\mathbb{R}^{K}$ , captures the essence of the environmental dynamics. Based on this output, the index $\sigma$ of the optimal branch for processing the current input frame $I_{t}$ is determined through the following equation:

\sigma=\mathop{\arg\max}_{K}(\mathcal{R}(\Delta I_{t}),W^{r}),\quad\sigma\in\{% 0,\cdots,K-1\},

(1)

where $\sigma$ is the index of the branch deemed most suitable for the current environmental context. Once $\sigma$ is determined, the input frame $I_{t}$ is automatically routed to the corresponding branch by a dispatcher.

In particular, this strategy of using frame differences to gauge environmental speed is efficient. It offers a faster alternative to traditional methods such as optical flow fields. Moreover, it focuses on frame-level variations rather than the speed of individual objects, providing a more generalized representation of environmental dynamics. The sparsity of $\Delta I_{t}$ also contributes to the robustness of this method, reducing computational complexity and making the Router Network’s operations nearly negligible in the context of the overall model’s performance.

Model Bank & Dispatcher. The core of the DyRoNet framework is its model bank, which consists of an array of streaming perceptual models, denoted as $\mathcal{P}=\{{P}_{0},\cdots,{P}_{K-1}\}$ . Typically, the selection of the most suitable model for processing a given input is intelligently managed by the Router Network. This process is formalized as $P_{\sigma}=\text{Disp}(\mathcal{R},\mathcal{P})$ , where Disp acts as a dispatcher, facilitating the dynamic selection of models from $\mathcal{P}$ based on the input. The operational flow of DyRoNet can be mathematically defined as:

	$\displaystyle\mathcal{T}$	$\displaystyle=\mathcal{F}(\mathcal{S},\mathcal{P},W)$
		$\displaystyle=\text{Disp}(\mathcal{R}(\Delta I_{t}),\mathcal{P})(I_{t};W^{d}_{% \sigma},W^{l}_{\sigma})$

where $\mathcal{R}$ symbolizes the Router Network, and $\Delta I_{t}$ refers to the frame difference, a key input for model selection. The weights $W^{d}_{\sigma}$ and $W^{l}_{\sigma}$ correspond to the selected streaming perception model and its Low-Rank Adaptation (LoRA) parameters, respectively.

Note that the versatility of DyRoNet is further highlighted by its compatibility with a wide range of Streaming Perception models, even ones that rely solely on detectors Ge et al. (2021). To demonstrate the efficacy of DyRoNet, it has been evaluated using three contemporary streaming perception models: StreamYOLO Yang et al. (2022), LongShortNet Li et al. (2023), and DAMO-StreamNet He et al. (2023) (see Sec. 4.3). This Model Bank & Dispatcher strategy illustrates the adaptability and robustness of DyRoNet across different streaming perception scenarios.

Low-Rank Adaptation. A key challenge arises when fully fine-tuning individual branches, especially under the direction of Router Network. This strategy can lead to biases in the distribution of training data and inefficiencies in the learning process. Specifically, lighter branches may become predisposed to simpler cases, while more complex ones might be tailored to handle intricate scenarios, thereby heightening the risk of overfitting. Our experimental results, detailed in Sec. 4.3, support this observation.

To address these challenges, we have incorporated the Low-Rank Adapter Hu et al. (2021) into our streaming perception models. Within each model $P_{i}$ , initially pre-trained on a dataset, the key components are the convolution kernel and bias matrices, symbolized as $W^{d}_{i}$ . The rank of the Low-Rank Adaptation (LoRA) module is defined as $r$ , a value significantly smaller than the dimensionality of $W^{d}_{i}$ , to ensure efficient adaptation. The update to the weight matrix adheres to a low-rank decomposition form, represented as $W_{d}^{i}+\delta W=W_{d}^{i}+BA$ .²²2Here, $B$ is a matrix in $R^{d\times r}$ , and $A$ is in $R^{r\times k}$ , ensuring that the rank $r$ remains much smaller than $d$ . This adaptation strategy allows for the original weights $W_{d}^{i}$ to remain fixed, while the low-rank components $BA$ are trained and adjusted. The adaptation process is executed through the following projection:

W^{d}_{i}x+\Delta Wx=W^{d}_{i}x+W^{l}_{i}x,

(2)

where $x$ represents the input image or feature map, and $\Delta W=W^{l}_{i}=BA$ . The matrices $A$ and $B$ start from an initialized state and are fine-tuned during the adaptation process. This approach maintains the general applicability of the model by fixing $W_{d}^{i}$ , while also enabling specialization within specific sub-domains, as determined by Router Network.

Particularly, in DyRoNet, we employ a rank $r$ of $32$ for the LoRA module, though this can be adjusted based on specific requirements of the scenarios in question. This low-rank adaptation mechanism not only enhances the flexibility of the DyRoNet framework but also significantly mitigates the risk of overfitting, ensuring that each branch remains efficient and effective in its designated role.

3.3 Training Details of DyRoNet

The training process of DyRoNet focuses on two primary goals: (1) improving the performance of individual branches within the streaming perception model and (2) achieving an optimal balance between accuracy and computational efficiency. This dual-objective framework is represented by the overall loss function:

L=\mathcal{L}^{sp}+\mathcal{L}^{E^{2}},

(3)

where $\mathcal{L}^{sp}$ represents the streaming perception loss, and $\mathcal{L}^{E^{2}}$ denotes the effective and efficient (E ${}^{2}$ ) loss, which supervises branch selection.

Streaming Perception (SP) Loss. Each branch in DyRoNet is fine-tuned using its original loss function to maintain effectiveness. The router network is trained to select the optimal branch based on efficiency supervision. Let $\mathcal{T}_{i}=\{F_{i}^{cls},F_{i}^{reg},F_{i}^{obj}\}$ denote the logits produced by the $i$ -th branch and $\mathcal{T}_{gt}=\{F_{gt}^{cls},F_{gt}^{reg},F_{gt}^{obj}\}$ represent the corresponding ground-truth, where $F_{\cdot}^{cls}$ , $F_{\cdot}^{reg}$ , and $F_{\cdot}^{obj}$ are the classification, objectness, and regression logits, respectively. The streaming perception loss for each branch, $\mathcal{L}_{i}^{sp}$ , is defined as follows:

	$\displaystyle\mathcal{L}_{i}^{sp}(\mathcal{T}_{i},\mathcal{T}_{gt})=$	$\displaystyle\,\mathcal{L}_{cls}(F_{i}^{cls},F_{gt}^{cls})+\mathcal{L}_{obj}(F% _{i}^{obj},F_{gt}^{obj})$		(4)
		$\displaystyle+\mathcal{L}_{reg}({F}_{i}^{reg},{F}_{gt}^{reg}),$		(4)

where $\mathcal{L}_{cls}(\cdot)$ and $\mathcal{L}_{obj}(\cdot)$ are defined as Mean Square Error (MSE) loss functions, while $\mathcal{L}_{reg}(\cdot)$ is represented by the Generalized Intersection over Union (GIoU) loss.

Effective and Efficient (E ${}^{2}$ ) Loss. During the training phase, streaming perception loss values from all branches are compiled into a vector $v^{sp}\in\mathbb{R}^{K}$ , and inference time costs are aggregated into $v^{time}\in\mathbb{R}^{K}$ , with $K$ indicating the total number of branches in DyRoNet. To account for hardware variability, a normalized inference time vector $\hat{v}^{time}=\mathrm{softmax}(v^{time})$ is introduced. This vector is derived using the Softmax function to minimize the influence of hardware discrepancies. The representation for effective and efficient (E ${}^{2}$ ) decision-making is defined as:

f^{E^{2}}=\mathcal{O}_{N}(\mathop{\arg\min}_{k}(\mathrm{softmax}(v^{time})% \cdot v^{sp})),

(5)

where $\mathcal{O}$ denotes one-hot encoding, producing a boolean vector of length $K$ , with the value of $1$ at the index representing the estimated optimal branch at that moment. The E ${}^{2}$ Loss is then formulated as:

\mathcal{L}^{E^{2}}=\mathrm{KL}(f^{E^{2}},f^{r}),

(6)

where $f_{r}=\mathcal{R}(\Delta I_{t})$ and $\mathrm{KL}$ represents the Kullback-Leibler divergence, utilized to constrain the distribution.

Overall, the process of training DyRoNet involves striking a meticulous balance between the SP loss, which ensures the efficacy of each branch, and the E ${}^{2}$ loss, which optimizes efficiency. The primary objective of this training is to develop a model that not only delivers high accuracy in perception tasks but also operates within acceptable latency constraints, which is a critical requirement for real-time applications. This balanced approach enables DyRoNet to adapt dynamically to varying computational resources and environmental conditions, thereby maintaining optimal performance in diverse streaming perception scenarios.

4 Experiments

Methods	Latency (ms)	sAP $\uparrow$	sAP ${}_{50}$ $\uparrow$	sAP ${}_{75}$ $\uparrow$	sAP ${}_{s}$ $\uparrow$	sAP ${}_{m}$ $\uparrow$	sAP ${}_{l}$ $\uparrow$
Non-real-time detector-based methods
Adaptive Streamer Ghosh et al. (2021)	-	21.3	37.3	21.1	4.4	18.7	47.1
Streamer (S=600) Li et al. (2020)	-	20.4	35.6	20.8	3.6	18.0	47.2
Streamer (S=900) Li et al. (2020)	-	18.2	35.3	16.8	4.7	14.4	34.6
Streamer+AdaScale Ghosh et al. (2021)	-	13.8	23.4	14.2	0.2	9.0	39.9
Real-time detector-based methods
DAMO-StreamNetNet-L He et al. (2023)	26.6	37.8	59.1	38.6	16.1	39.0	64.6
LongShortNet-L Li et al. (2023)	20.1	37.1	57.8	37.7	15.2	37.3	63.8
StreamYOLO-L Yang et al. (2022)	18.2	36.1	57.6	35.6	13.8	37.1	63.3
DAMO-StreamNetNet-M He et al. (2023)	24.3	35.7	56.7	35.9	14.5	36.3	63.3
LongShortNet-M Li et al. (2023)	17.5	34.1	54.8	34.6	13.3	35.3	58.1
StreamYOLO-M Yang et al. (2022)	18.2	32.9	54.0	32.5	12.4	34.8	58.1
DAMO-StreamNetNet-S He et al. (2023)	21.3	31.8	52.3	31.0	11.4	32.9	58.7
LongShortNet-S Li et al. (2023)	14.6	29.8	50.4	29.5	11.0	30.6	52.8
StreamYOLO-S Yang et al. (2022)	14.2	28.8	50.3	27.6	9.7	30.7	53.1
DyRoNet
DyRoNet ( $\text{DAMO}_{\text{M + L}}$ )	37.61	37.8 (+2.1)	58.8 (+2.1)	38.8 (+2.9)	16.1 (+1.6)	39.0 (+2.7)	64.0 (+0.7)
DyRoNet ( $\text{LSN}_{\text{M + L}}$ )	29.05	36.9 (+2.8)	58.2 (+3.4)	37.4 (+2.8)	14.9 (+1.6)	37.5 (+2.2)	63.3 (+5.2)
DyRoNet ( $\text{sYOLO}_{\text{M + L}}$ )	23.51	35.0 (+2.1)	55.7 (+1.7)	35.5 (+3.0)	13.7 (+1.3)	36.2 (+1.4)	61.1 (+3.0)

Table 1: The comparison of DyRoNet and SOTA. In this table, the optimal values are highlighted in green font and the online evaluation latency reaches the real-time is shown in red font.

4.1 Dataset and Metric

Dataset. For the evaluation of our methods, we utilized the comprehensive Argoverse-HD dataset Li et al. (2020), specifically designed for streaming perception in autonomous driving scenarios. This dataset comprises high-resolution RGB images captured from urban city street drives, offering a realistic representation of diverse driving conditions. The dataset is structured into two main segments: a training set consisting of 65 video clips and a test set comprising 24 video clips. Each video segment in the dataset, on average, spans over 600 frames, contributing to a training set with approximately 39k frames and a validation set containing around 15k frames. Notably, the Argoverse-HD dataset provides high-frame-rate (30fps) 2D object detection annotations, ensuring accuracy and reliability without relying on interpolated data.

Evaluation Metric. We adopt the streaming Average Precision (sAP) as the primary metric for performance evaluation. The sAP metric, widely recognized for its effectiveness in streaming perception tasks Li et al. (2020), offers a comprehensive assessment by calculating the mean Average Precision (mAP) across various Intersection over Union (IoU) thresholds, ranging from 0.5 to 0.95. This metric allows us to evaluate detection performance across different object sizes, including large, medium, and small objects, providing a robust measure of our model’s capability in real-world streaming perception scenarios.

4.2 Implementation Details

We tested three state-of-the-art streaming perception models: StreamYOLOYang et al. (2022), LongShortNetLi et al. (2023), and DAMO-StreamNetHe et al. (2023). These models, integral to the DyRoNet architecture, come with pre-trained parameters across three distinct scales: small (S), medium (M), and large (L), catering to a variety of processing requirements. In constructing the model bank $\mathcal{P}$ for DyRoNet, we strategically selected different model configurations to evaluate performance across diverse scenarios. For instance, the notation DyRoNet ( $\text{DAMO}_{\text{S + M}}$ ) represents a configuration where DyRoNet employs the small (S) and medium (M) scales of DAMO-StreamNet as its two branches.³³3Similar notations are used for other model combinations, allowing for a systematic exploration of the framework’s adaptability and performance under varying computational constraints. All experiments were conducted on a high-performance computing platform equipped with Nvidia 3090Ti GPUs (x4), ensuring robust and reliable computational power to handle the intensive processing demands of the streaming perception models. This setup provided a consistent and controlled environment for evaluating the efficacy of DyRoNet across different model configurations, contributing to the thoroughness and validity of our results. For more implementation details, please refer to Appendix C.

4.3 Comparision with SOTA Methods

We compared our proposed approach with state-of-the-art methods to evaluate its performance. In this subsection, we directly copied the reported performance from their original papers as their results. The performance comparison was conducted on the Argoverse-HD dataset Li et al. (2020). An overview of the results reveals that our proposed DyRoNet with a model bank of DAMO-StreamNet series achieves 37.8% sAP in 39.60 ms latency, outperforming the current state-of-the-art methods in latency by a significant margin. For the StreamYOLO and LongShortNet model banks, our DyRoNet attains 36.9% and 37.1% sAP in 29.35 ms, and 30.48 ms latency respectively, surpassing the original model dramatically. This demonstrates the effectiveness of the systematic improvements in DyRoNet.

Model Bank	Random	LoRA + Router
$\text{StreamYOLO}_{\text{S + M}}$	39.16	26.25
$\text{StreamYOLO}_{\text{S + L}}$	24.04	29.35
$\text{StreamYOLO}_{\text{M + L}}$	24.69	23.51
$\text{LongShortNet}_{\text{S + M}}$	24.79	21.47
$\text{LongShortNet}_{\text{S + L}}$	21.49	30.48
$\text{LongShortNet}_{\text{M + L}}$	24.75	29.05
$\text{DAMO-StreamNet}_{\text{S + M}}$	36.61	33.22
$\text{DAMO-StreamNet}_{\text{S + L}}$	35.12	39.60
$\text{DAMO-StreamNet}_{\text{M + L}}$	37.30	37.61

Table 2: Comparison of inference time (ms) on single RTX 3090. The optimal inference time between random and after train are consistently highlighted in green font.

Model Bank	Full	LoRA
$\text{StreamYOLO}_{\text{S + M}}$	32.9	33.7
$\text{StreamYOLO}_{\text{S + L}}$	36.1	36.9
$\text{StreamYOLO}_{\text{M + L}}$	36.2	35.0
$\text{LongShortNet}_{\text{S + M}}$	29.0	30.5
$\text{LongShortNet}_{\text{S + L}}$	36.2	37.1
$\text{LongShortNet}_{\text{M + L}}$	36.3	36.9
$\text{DAMO-StreamNet}_{\text{S + M}}$	34.8	35.5
$\text{DAMO-StreamNet}_{\text{S + L}}$	31.1	37.8
$\text{DAMO-StreamNet}_{\text{M + L}}$	37.4	37.8

Table 3: Comparion of LoRA finetune and Full finetune. Full means the full fine-tuning and LoRA means the LoRA fine-tuning. And the best values between Full and LoRA are shown in red font.

	Model Bank	$b_{0}$	$b_{1}$	$b_{2}$	Random	sAP
$K=2$ same model	$\text{DAMO}_{\text{S + M}}$	31.8	35.5	-	33.5	35.5
	$\text{DAMO}_{\text{S + L}}$	31.8	37.8	-	34.5	37.8
	$\text{DAMO}_{\text{M + L}}$	35.5	37.8	-	36.5	37.8
	$\text{LSN}_{\text{S + M}}$	29.8	34.1	-	31.8	30.5
	$\text{LSN}_{\text{S + L}}$	29.8	37.1	-	33.4	37.1
	$\text{LSN}_{\text{M + L}}$	34.1	37.1	-	35.6	36.9
	$\text{sYOLO}_{\text{S + M}}$	29.5	33.7	-	31.5	33.7
	$\text{sYOLO}_{\text{S + L}}$	29.5	36.9	-	33.2	36.9
	$\text{sYOLO}_{\text{M + L}}$	33.7	36.9	-	35.4	35.0
$K=2$ different model	$\text{DAMO}_{\text{S}}$ + $\text{LSN}_{\text{S}}$	31.8	29.8	-	30.7	30.5
	$\text{DAMO}_{\text{S}}$ + $\text{LSN}_{\text{M}}$	31.8	34.1	-	32.6	34.1
	$\text{DAMO}_{\text{S}}$ + $\text{LSN}_{\text{L}}$	31.8	37.1	-	34.3	31.8
	$\text{DAMO}_{\text{M}}$ + $\text{LSN}_{\text{S}}$	35.5	29.8	-	32.6	29.8
	$\text{DAMO}_{\text{L}}$ + $\text{LSN}_{\text{S}}$	37.8	29.8	-	33.8	29.8
$K=3$ same model	$\text{DAMO}_{\text{S + M + L}}$	31.8	35.5	37.8	34.8	37.7
	$\text{LSN}_{\text{S + M + L}}$	29.8	34.1	37.1	33.5	36.1
	$\text{sYOLO}_{\text{S + M + L}}$	29.5	33.7	36.9	33.4	36.6

Table 4: Ablation of model bank setting.

K

means the number of the model in bank

\mathcal{P}

4.4 Inference Time

We conducted detailed experiments analyzing the trade-offs between DyRoNet’s inference time and performance under different model bank selection strategies. Table 2 systematically presents the findings, with optimal times in green. This highlights DyRoNet’s superior performance—maintaining competitive inference speed alongside accuracy gains versus the random approach. Specifically, DyRoNet achieves efficient speeds while preserving or enhancing performance. This balance enables meeting real-time needs without compromising perception quality, critical for autonomous driving where both factors are paramount. By validating effectiveness in inference time reductions and accuracy improvements, the results show the practicality and efficiency of DyRoNet’s dynamic model selection.

4.5 Ablation Study

Router Network. To validate the effectiveness of the Router Network based on frame difference, we conducted comparative experiments using frame difference $\Delta I_{t}$ , the current frame $I_{t}$ , and the concatenation of the current frame with the previous historical frame $[I_{t}+I_{t-1}]$ as input modality of the Router Network. The experimental results are presented in Tab. 5. To control variables, in these experiments, we froze the model bank during training and only trained the Router Network. And only three different choices of StreamYOLO is involved in model bank. It can be obverse that using frame difference as input exhibits better performance than other two types of input modalities (35.0 of $\text{StreamYOLO}_{\text{S + L}}$ and 34.6 of $\text{StreamYOLO}_{\text{M + L}}$ ). This indicates that utilizing frame differences offers significant advantages in comprehending and characterizing environmental speed. Conversely, it also underscores that employing single frames as input or using multiple frames as input renders the lightweight model bank selection model ineffective.

Model Bank	Input Modality	LoRA
$\text{StreamYOLO}_{\text{S + M}}$	$I_{t}$	33.7
	$[I_{t}+I_{t-1}]$	33.7
	$\Delta I_{t}$	32.6
$\text{StreamYOLO}_{\text{S + L}}$	$I_{t}$	34.1
	$[I_{t}+I_{t-1}]$	30.2
	$\Delta I_{t}$	35.0
$\text{StreamYOLO}_{\text{M + L}}$	$I_{t}$	33.7
	$[I_{t}+I_{t-1}]$	33.7
	$\Delta I_{t}$	34.6

Table 5: Ablation of router network input. The optimal results are marked in red font under the same model bank setting.

Branch Selection. Our research on streaming perception models has shown that configuring these models across varying scales can optimize their performance. We found that combining large and small models strikes an optimal balance, resulting in significant speed improvements. This conclusion is supported by the empirical evidence presented in Tab. 4, which clearly shows that the large-small model pairing outperforms both the large-small and large-medium combinations. Our findings highlight the importance of strategic model scaling in streaming perception and provide a framework for future model optimization in similar domains.

Fine-tuning Scheme. In our evaluation, we contrasted the performance of direct fine-tuning with the Low-Rank Adapter (LoRA) fine-tuning strategy Zhu et al. (2023) for streaming perception models. Results are listed in Tab. 3. The results clearly demonstrated that LoRA fine-tuning surpasses direct fine-tuning, with the DAMO-Streamnet-based model bank configuration realizing an absolute gain of over 1.6%. This substantiates LoRA’s fine-tuning proficiency in circumventing the pitfalls of forgetting and data distribution bias inherent to direct fine-tuning. This experimental result demonstrates that LoRA fine-tuning can effectively mitigate the overfitting problem that may arise during model bank fine-tuning, leading to a stable and overall performance improvement.

LoRA Rank. To assess the impact of different LoRA ranks in DyRoNet, we conducted experiments with $\mathrm{rank}=32,16,8$ respectively. All these experiments were set to train for 5 epochs, and the training alternated between Router Network training and model bank fine-tuning. The results are presented in Tab. 6. It can be observed that the performance is better with $rank=32$ compared to $rank=8$ and $rank=16$ , and only occupy 10% of the total model parameters. Therefore, based on these experiments, $rank=32$ was selected as the default setting for our experiments. Although a smaller LoRA rank occupies fewer parameters, it leads to a rapid performance decay. The experimental results clearly demonstrate that with LoRA fine-tuning, it is possible to achieve superior performance than a single model while utilizing a smaller parameter footprint.

Model Bank	Rank	branch 0	branch 1	after train	Param.(%)
$\text{DAMO}_{\text{S + L}}$	32	31.8	37.8	37.8	14.35
$\text{DAMO}_{\text{S + L}}$	16	31.8	37.8	35.9	7.73
$\text{DAMO}_{\text{S + L}}$	8	31.8	37.8	35.9	4.02
$\text{LSN}_{\text{S + L}}$	32	29.8	37.1	36.9	10.39
$\text{LSN}_{\text{S + L}}$	16	29.8	37.1	30.6	5.48
$\text{LSN}_{\text{S + L}}$	8	29.8	37.1	30.6	5.48
$\text{sYOLO}_{\text{S + L}}$	32	29.5	36.9	36.6	10.21
$\text{sYOLO}_{\text{S + L}}$	16	29.5	36.9	35.0	5.38
$\text{sYOLO}_{\text{S + L}}$	8	29.5	36.9	35.0	2.7

Table 6: Ablation of LoRA rank: In the Param. column, we solely compare the proportion of parameters occupied by LoRA to the entire model. The best performance under the same model bank setting are highlighted in red font.

5 Conclusion

In conclusion, we present the Dynamic Routering Network (DyRoNet), a system that dynamically selects specialized detectors for varied environmental conditions with minimal computational overhead. Our innovative increase-boosting fine-tuning, featuring a Low-Rank Adapter, mitigates distribution bias and overfitting, enhancing scene-specific performance. Experimental results validate DyRoNet’s state-of-the-art performance, offering a benchmark for streaming perception and insights for future research. In the future, DyRoNet’s principles will undoubtedly inform the development of more advanced, reliable systems.

References

Bejnordi et al. [2019] Babak Ehteshami Bejnordi, Tijmen Blankevoort, and Max Welling. Batch-sha** for learning conditional channel gated networks. In International Conference on Learning Representations, 2019.
Cai et al. [2021] Shaofeng Cai, Yao Shu, and Wei Wang. Dynamic routing networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3588–3597, January 2021.
Chen et al. [2020] Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, and Zicheng Liu. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020.
Chen et al. [2023] Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers, 2023.
Ge et al. [2021] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
Ghosh et al. [2021] Anurag Ghosh, Akshay Nambi, Aditya Singh, Harish Yvs, and Tanuja Ganu. Adaptive streaming perception using deep reinforcement learning. arXiv preprint arXiv:2106.05665, 2021.
Guo et al. [2019] Junyao Guo, Unmesh Kurup, and Mohak Shah. Is it safe to drive? an overview of factors, metrics, and datasets for driveability assessment in autonomous driving. IEEE Transactions on Intelligent Transportation Systems, 21(8):3135–3151, 2019.
Han et al. [2021] Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang. Dynamic neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7436–7456, 2021.
He et al. [2023] Jun-Yan He, Zhi-Qi Cheng, Chenyang Li, Wangmeng Xiang, Binghui Chen, Bin Luo, Yifeng Geng, and Xuansong Xie. Damo-streamnet: Optimizing streaming perception in autonomous driving. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pages 810–818. International Joint Conferences on Artificial Intelligence Organization, 8 2023.
Hu et al. [2021] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
Huang and Chen [2023] Yihui Huang and Ningjiang Chen. Mtd: Multi-timestep detector for delayed streaming perception. In Chinese Conference on Pattern Recognition and Computer Vision, pages 337–349. Springer, 2023.
Huang et al. [2018] Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kilian Weinberger. Multi-scale dense networks for resource efficient image classification. In International Conference on Learning Representations, 2018.
Jo et al. [2022] Wonwoo Jo, Kyungshin Lee, Jaewon Baik, Sangsun Lee, Dongho Choi, and Hyunkyoo Park. Dade: Delay-adoptive detector for streaming perception. arXiv preprint arXiv:2212.11558, 2022.
Jocher et al. [2021] Glenn Jocher, Alex Stoken, Jirka Borovec, Ayush Chaurasia, Liu Changyu, Adam Hogan, Jan Hajek, Laurentiu Diaconu, Yonghye Kwon, Yann Defretin, et al. ultralytics/yolov5: v5. 0-yolov5-p6 1280 models, aws, supervise. ly and youtube integrations. Zenodo, 2021.
Lan et al. [2023] **-Peng Lan, Zhi-Qi Cheng, Jun-Yan He, Chenyang Li, Bin Luo, Xu Bao, Wangmeng Xiang, Yifeng Geng, and Xuansong Xie. Procontext: Exploring progressive context transformer for tracking. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2023.
Li et al. [2020] Mengtian Li, Yu-Xiong Wang, and Deva Ramanan. Towards streaming perception. In Proceedings of the European Conference on Computer Vision, pages 473–488. Springer, 2020.
Li et al. [2023] Chenyang Li, Zhi-Qi Cheng, Jun-Yan He, Pengyu Li, Bin Luo, Hanyuan Chen, Yifeng Geng, **-Peng Lan, and Xuansong Xie. Longshortnet: Exploring temporal and semantic features fusion in streaming perception. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2023.
Lin et al. [2023] Zhihao Lin, Yongtao Wang, **he Zhang, and Xiaojie Chu. Dynamicdet: A unified dynamic architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6282–6291, June 2023.
Mahaur and Mishra [2023] Bharat Mahaur and KK Mishra. Small-object detection based on yolov5 in autonomous driving systems. Pattern Recognition Letters, 168:115–122, 2023.
Muhammad et al. [2020] Khan Muhammad, Amin Ullah, Jaime Lloret, Javier Del Ser, and Victor Hugo C de Albuquerque. Deep learning for safe autonomous driving: Current challenges and future directions. IEEE Transactions on Intelligent Transportation Systems, 22(7):4316–4336, 2020.
Qiao et al. [2022] Jian-Jun Qiao, Zhi-Qi Cheng, Xiao Wu, Wei Li, and Ji Zhang. Real-time semantic segmentation with parallel multiple views feature augmentation. In ACM International Conference on Multimedia, pages 6300–6308, 2022.
Sela et al. [2022] Gur-Eyal Sela, Ionel Gog, Justin Wong, Kumar Krishna Agrawal, Xiangxi Mo, Sukrit Kalra, Peter Schafhalter, Eric Leong, Xin Wang, Bharathan Balaji, et al. Context-aware streaming perception in dynamic environments. In Proceedings of the European Conference on Computer Vision, pages 621–638. Springer, 2022.
Shazeer et al. [2017] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
Su et al. [2019] Hang Su, Varun Jampani, Deqing Sun, Orazio Gallo, Erik Learned-Miller, and Jan Kautz. Pixel-adaptive convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2019.
Wang et al. [2018] Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E. Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. In Proceedings of the European Conference on Computer Vision, September 2018.
Wang et al. [2020] Yulin Wang, Kangchen Lv, Rui Huang, Shiji Song, Le Yang, and Gao Huang. Glance and focus: a dynamic approach to reducing spatial redundancy in image classification. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 2432–2444. Curran Associates, Inc., 2020.
Wang et al. [2023a] Hao Wang, Zhi-Qi Cheng, **gdong Sun, Xin Yang, Xiao Wu, Hongyang Chen, and Yan Yang. Debunking free fusion myth: Online multi-view anomaly detection with disentangled product-of-experts modeling. In ACM International Conference on Multimedia, pages 3277–3286, 2023.
Wang et al. [2023b] Xiaofeng Wang, Zheng Zhu, Yunpeng Zhang, Guan Huang, Yun Ye, Wenbo Xu, Ziwei Chen, and Xingang Wang. Are we ready for vision-centric driving streaming perception? the asap benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9600–9610, 2023.
Wang et al. [2023c] Zixiao Wang, Weiwei Zhang, and Bo Zhao. Estimating optical flow with streaming perception and changing trend aiming to complex scenarios. Applied Sciences, 13(6):3907, 2023.
Yang et al. [2019] Brandon Yang, Gabriel Bender, Quoc V Le, and Jiquan Ngiam. Condconv: Conditionally parameterized convolutions for efficient inference. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
Yang et al. [2022] **rong Yang, Songtao Liu, Zeming Li, ** Li, and Jian Sun. Real-time object detection for streaming perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5385–5395, June 2022.
Zhang et al. [2023] Ji Zhang, Xiao Wu, Zhi-Qi Cheng, Qi He, and Wei Li. Improving anomaly segmentation with multi-granularity cross-domain alignment. In ACM International Conference on Multimedia, pages 8515–8524, 2023.
Zhu et al. [2019] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2019.
Zhu et al. [2023] Jiawen Zhu, Zhi-Qi Cheng, Jun-Yan He, Chenyang Li, Bin Luo, Huchuan Lu, Yifeng Geng, and Xuansong Xie. Tracking with human-intent reasoning. arXiv preprint arXiv:2312.17448, 2023.
Zou et al. [2023] Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jie** Ye. Object detection in 20 years: A survey. Proceedings of the IEEE, 2023.

DyRoNet: A Low-Rank Adapter Enhanced Dynamic Routing Network for Streaming Perception
(Supplementary Material)

The appendix completes the main paper by providing in-depth research details and extended experimental results. The structure of the appendix is organized as follows:

1.
Analysis of Environmental Factors Affecting Streaming Perception: Sec. A
- •
  
  Impact of Weather Conditions: Sec. A.1
- •
  
  Quantitative Analysis of Objects: Sec. A.2
- •
  
  Proportion of Small Objects: Sec. A.3
- •
  
  Environmental Speed Dynamics: Sec. A.4
2.
Expanded Experimental Results: Sec. B
- •
  
  Inference Time: Analysis Sec. B.1
3.
Detailed Description of DyRoNet: Sec. C
- •
  
  Selection of Pre-trained Model: Sec. C.1
- •
  
  Hyperparameter Settings: Sec. C.2

Appendix A Factor Analysis in Streaming Perception

In development of DyRoNet, we undertook an extensive survey and analysis to identify key influencing factors in autonomous driving scenarios that could potentially impact streaming perception. This analysis utilized the Argoverse-HD dataset Li et al. [2020], a benchmark in the field of streaming perception. The primary goal of this factor analysis was to isolate the most critical factor affecting streaming perception performance. As elaborated in the main text, our comprehensive analysis led to the identification of the speed of the environment as the predominant factor. Consequently, DyRoNet is tailored to address this specific aspect. Our analysis focuses on four primary elements: weather conditions, object quantity, small object proportion, and environmental speed. We methodically examined each of these factors to evaluate their respective impacts on streaming perception within autonomous driving.

A.1 Impact of Weather Conditions

The Argoverse-HD dataset, comprising testing, training, and validation sets, includes a diverse range of weather conditions. Specifically, the dataset contains 24, 65, and 24 video segments in the testing, training, and validation sets, respectively, with frame counts ranging from 400 to 900 per segment. Tab. 7 details the distribution of various weather types across these subsets. Fig. 4 provides visual examples of different weather conditions captured in the dataset. A clear variation in visual clarity and perception difficulty is observable under different conditions, with scenarios like Sunny + Day or Cloudy + Day appearing visually more challenging compared to Rainy + Night.

To evaluate the impact of weather conditions on streaming perception, we conducted tests using a range of pre-trained models from StreamYOLO Yang et al. [2022], LongShortNet Li et al. [2023], and DAMO-StreamNet He et al. [2023], employing various scales and settings. The results, presented in Tab. 8, indicate that performance is generally better during Day conditions compared to Night. This confirms that weather conditions indeed influence streaming perception.

However, it’s noteworthy that even within the same weather conditions, model performance varies significantly, with accuracy ranging from below 10% to above 70%. Fig. 5 illustrates this point by comparing frames from two video segments (Clip ids: 00c561 and 395560) under identical weather conditions, where the performance difference of the same model on these segments is as high as 32.1%. This observation suggests the presence of other critical environmental factors that affect streaming perception, indicating that weather, while influential, is not the sole determinant of model performance.

	test	train	val
Sunny + Day	8	34	8
Cloudy + Day	13	27	15
Rainy + Day	1	1	0
Rainy + Night	1	0	0
Sunny + Night	1	3	1

Table 7: Distribution of Weather Conditions in Testing, Training, and Validation Sets: This figure illustrates the frequency of different weather conditions in the testing, training, and validation sets of the Argoverse-HD dataset, providing an overview of the environmental variability within each dataset subset.

		StreamYOLO					LongShortNet				DAMO-StreamNet
Clip ID	Weather	s 1x	m 1x	l 1x	l 2x	l still	s 1x	m 1x	l 1x	l high	s 1x	m 1x	l 1x	l high
1d6767	Cloudy + Day	20.9	22.8	24.9	7.0	26.7	20.9	23.4	25.0	36.4	21.3	24.6	26.0	34.2
5ab269	Cloudy + Day	25.6	30.0	31.6	6.9	33.3	25.2	29.5	31.4	40.1	26.9	29.0	31.7	41.2
70d2ae	Cloudy + Day	26.3	31.4	37.9	9.4	41.0	25.2	31.0	37.5	44.7	27.7	34.8	34.3	44.9
337375	Cloudy + Day	24.8	24.8	33.4	17.1	35.3	27.2	27.9	34.7	38.0	26.4	37.5	28.8	39.1
7d37fc	Cloudy + Day	32.5	36.4	41.5	15.5	42.1	33.6	37.7	40.8	45.8	35.2	40.1	39.4	45.7
f1008c	Cloudy + Day	38.6	42.0	44.4	11.3	46.2	40.0	40.4	45.3	50.3	39.1	42.4	45.8	54.1
f9fa39	Cloudy + Day	35.7	39.5	41.8	9.9	48.1	33.2	39.8	42.9	50.1	38.8	44.1	44.3	51.4
cd6473	Cloudy + Day	40.0	45.7	44.0	11.3	52.7	36.6	47.3	47.3	54.0	40.2	44.6	47.9	54.7
cb762b	Cloudy + Day	36.4	41.3	44.3	10.8	44.8	36.9	41.4	44.4	57.7	40.9	44.8	43.7	57.6
aeb73d	Cloudy + Day	39.6	44.6	45.2	12.5	46.7	39.2	46.7	45.9	52.3	42.6	46.4	47.5	51.3
cb0cba	Cloudy + Day	48.3	47.5	52.1	13.8	50.9	46.0	47.5	50.4	55.5	47.1	47.7	51.5	59.4
e9a962	Cloudy + Day	45.6	53.8	55.4	15.8	58.8	44.0	52.8	55.6	60.7	45.1	50.2	52.9	56.2
2d12da	Cloudy + Day	50.8	56.5	56.2	11.9	58.8	48.5	54.6	56.6	59.1	53.1	54.8	57.5	63.8
85bc13	Cloudy + Day	56.2	56.8	60.1	19.5	62.1	55.3	58.2	59.2	63.5	54.9	58.3	59.6	67.3
00c561	Sunny + Day	16.2	19.0	20.5	5.1	22.2	17.6	20.1	20.2	26.4	17.9	19.3	21.5	25.2
c9d6eb	Sunny + Day	22.5	28.9	32.5	07.5	35.3	22.6	28.8	32.9	39.1	24.5	26.0	28.4	38.6
cd5bb9	Sunny + Day	23.3	24.9	25.8	6.2	27.2	23.4	25.2	25.8	30.4	23.4	25.7	26.2	31.5
6db21f	Sunny + Day	24.1	26.4	27.0	6.7	28.9	23.3	27.0	27.0	34.7	25.1	28.0	28.7	37.0
647240	Sunny + Day	27.1	29.3	31.2	07.8	34.1	26.5	30.1	31.5	38.8	26.9	32.0	32.0	38.4
da734d	Sunny + Day	30.2	33.4	37.0	8.8	39.9	29.2	34.4	37.5	42.6	34.2	35.7	38.2	43.1
5f317f	Sunny + Day	31.9	42.3	45.9	8.9	50.1	32.8	42.0	46.1	51.2	40.0	44.6	47.0	54.0
395560	Sunny + Day	49.3	61.2	60.6	11.3	72.1	51.7	60.7	58.5	65.4	58.9	63.4	57.8	59.6
b1ca08	Sunny + Day	60.0	62.1	68.4	22.4	67.9	61.7	61.4	67.7	70.6	59.6	65.0	67.7	68.6
033669	Sunny + Night	18.0	23.5	25.7	6.6	27.4	18.5	23.6	25.1	27.6	21.8	22.7	23.8	27.5
Overall	–	29.8	33.7	36.9	34.6	39.4	29.8	34.1	37.1	42.7	31.8	35.5	37.8	43.3

Table 8: Offline Evaluation Results on the Argoverse-HD Validation Dataset: It records the sAP scores across the 0.50 to 0.95 range for each clip. The optimal and worst results are highlighted in green and red font under the same weather conditions. The notation “l high” is used as an abbreviation for the resolution 1200x1920, providing a concise representation of the data.

A.2 Analysis of Object Quantity Impact

To assess the impact of the number of objects on streaming perception, we conducted a statistical analysis of object counts per frame in the Argoverse-HD dataset, encompassing both training and validation sets. The results of this analysis are depicted in Fig 6, which showcases a histogram representing the distribution of the number of objects in individual frames. The variance in the distribution is notable, with values of $74.66$ for the training set and $75.39$ for the validation set, indicating significant fluctuation in the number of objects across frames. Additionally, as shown in Tab. 8, there is considerable variability in object counts across different video segments. This observation led us to further investigate the potential correlation between object quantity and model performance fluctuations.

To explore this correlation, we calculated the average number of objects per frame for each segment within the Argoverse-HD validation set. The findings, detailed in Tab. 9, include the average object counts alongside Spearman correlation coefficients, which measure the relationship between object quantity and model performance. The absolute values of these coefficients range from 1e-1 to 1e-2. This range of correlation coefficients suggests that the number of objects present in the environment does not exhibit a strong or significant correlation with the performance of streaming perception models. In other words, our analysis indicates that the sheer quantity of objects within the environment is not a predominant factor influencing the efficacy of streaming perception.

Clip ID	Mean Obj $\uparrow$	sYOLO	LSN	DAMO
1d6767	35.30	20.9	20.9	21.3
7d37fc	30.89	32.5	33.6	35.2
da734d	25.16	30.2	29.2	34.2
cd6473	23.75	40.0	36.6	40.2
5ab269	23.37	25.6	25.2	26.9
cb762b	23.31	36.4	36.9	40.9
f1008c	23.08	38.6	40.0	39.1
e9a962	21.58	45.6	44.0	45.1
70d2ae	21.38	26.3	25.2	27.7
2d12da	19.33	50.8	48.5	53.1
337375	18.19	24.8	27.2	26.4
f9fa39	17.46	35.7	33.2	38.8
aeb73d	16.82	39.6	39.2	42.6
6db21f	16.30	24.1	23.3	25.1
647240	14.18	27.1	26.5	26.9
b1ca08	14.08	60.0	61.7	59.6
85bc13	12.06	56.2	55.3	54.9
033669	11.89	18.0	18.5	21.8
00c561	10.06	16.2	17.6	17.9
cb0cba	10.04	48.3	46.0	47.1
395560	10.00	49.3	51.7	58.9
cd5bb9	8.95	23.3	23.4	23.4
c9d6eb	7.88	22.5	22.6	24.5
5f317f	6.92	31.9	32.8	40.0
Coefficient	–	0.052	0.035	-0.020

Table 9: Table 9 shows the analysis of the average number of objects per frame for each segment in the Argoverse-HD validation set, along with the Spearman correlation coefficients. These coefficients determine the relationship between the quantity of objects and the performance of streaming perception models. The coefficients range from 1e-1 to 1e-2, indicating a weak correlation. This data suggests that the total number of objects in the environment does not significantly affect the performance of streaming perception models, indicating that object quantity is not a primary factor that affects the efficacy of streaming perception tasks.

A.3 Analysis of the Proportion of Small Objects

The influence of small objects on perception models, particularly in autonomous driving scenarios, has been underscored in studies like Mahaur and Mishra [2023] and Yang et al. [2022]. In such scenarios, even minor shifts in viewing angles can cause notable relative displacement of small objects, posing a challenge for perception models in processing streaming data effectively. This observation prompted us to closely examine the proportion of small objects in the environment.

To begin, we analyzed the area ratios of objects in both the training and validation sets of the Argoverse-HD dataset. This involved calculating the ratio of the pixel area covered by an object’s bounding box to the total pixel area of the frame. We visualized these ratios in histograms shown in Fig. 7. The analysis revealed that the mean object area ratio is below 1e-2, indicating a substantial presence of small objects in the dataset. For simplicity in subsequent discussions, we define objects with an area ratio less than 1% as ‘small objects’.

Tab. 10 presents our findings on the proportion of small objects within the Argoverse-HD validation set. Despite some variability in the overall number of objects and small objects, the proportion of small objects remains relatively stable, as reflected in the variance of their proportion. This stability suggests that small objects are a consistent and prominent feature across various video segments, representing a persistent challenge of streaming perception.

sid	# obj $\uparrow$	# small obj	proportion
12	27829	24033	86%
3	16557	15937	96%
14	15058	14260	95%
15	12685	10229	81%
9	12618	11216	89%
5	12189	9509	78%
21	11801	10259	87%
18	11073	9856	89%
20	11068	10203	92%
7	10962	9707	89%
23	10961	9839	90%
2	10717	9700	91%
10	10706	9001	84%
22	10122	8846	87%
11	9965	8976	90%
4	9180	7989	87%
1	9068	8153	90%
24	8293	7830	94%
19	8068	6552	81%
17	4709	4230	90%
6	4420	3708	84%
16	7001	6508	93%
13	5654	5251	93%
8	3237	2449	76%
mean	10580	9343	87.96%
var	–	–	0.0026

Table 10: Distribution of Small Objects in the Argoverse-HD Validation Set: This figure illustrates the count of objects in each video segment of the Argoverse-HD validation set, specifically focusing on objects with an area proportion less than 1%. The chart provides a detailed view of the prevalence and distribution of smaller-sized objects across different video segments in the dataset.

A.4 Impact of Environmental Speed

In Sec. A.3, we highlighted how motion within the observer’s viewpoint can affect the perception of small objects. This observation leads us to consider that the speed of the environment could interact with the proportion of small objects.

To investigate the relationship between the environmental speed and the performance variability of streaming perception models, we categorized the validation dataset into three distinct environmental states: stop, straight, and turning. We then manually divided the dataset based on these states. In this reorganized dataset, the clips with an ID’s first digit as 0 exclusively represent the stop state, while the digits 1 and 2 correspond to straight and turning states, respectively.

Fig. 8 showcases the performance of StreamYOLO, LongShortNet, and DAMO-StreamNet across each of these segments. Additionally, the mean performance under each motion state is calculated and presented. The data reveals a consistent pattern across all three models: the performance ranking in different environmental motion states follows the order of stop being better than straight, which in turn is better than turning. This trend indicates an association between the state of environmental motion and fluctuations.

Consequently, based on this analysis, we infer that the speed of the environment, particularly when considering the substantial proportion of small objects and their sensitivity to environmental dynamics, emerges as the most influential environmental factor in the context of streaming perception.

Appendix B More Experiment Results

B.1 Inference Time Analysis

This subsection supplements Section 4.4 of the main paper, where we previously discussed the performance of DyRoNet but did not extensively delve into its inference time characteristics. To address this, Tab. 11 presents a detailed comparison of the inference times for each independent branch used in our model. It is important to note that the inference times reported here may show variations when compared to those published by the original authors of the models. This discrepancy is primarily due to differences in the hardware platforms used and the specific configurations of the corresponding models in our experiments.

An interesting observation from the results is that there are instances where DyRoNet exhibits a slower inference time compared to either the random selection method or branch 1. This slowdown is attributed to the incorporation of the speed router in our sample routing mechanism. Despite this, it is evident from the overall results that DyRoNet, employing the router strategy, still retains real-time processing capabilities across the various branches in the model bank. Moreover, in certain scenarios, DyRoNet demonstrates even faster inference speeds than when using individual branches independently. This detailed analysis underlines the dynamic and adaptive nature of DyRoNet in balancing between inference speed and accuracy, highlighting its capability to optimize streaming perception tasks in real-time scenarios.

Branches	branch 0	branch 1	random	DyRoNet
$\text{DAMO}_{\text{S + M}}$	29.26	33.65	36.61	33.22
$\text{DAMO}_{\text{S + L}}$	29.26	36.63	35.12	39.60
$\text{DAMO}_{\text{M + L}}$	33.65	36.63	37.30	37.61
$\text{LSN}_{\text{S + M}}$	22.08	25.88	24.79	21.47
$\text{LSN}_{\text{S + L}}$	22.08	31.24	21.49	30.48
$\text{LSN}_{\text{M + L}}$	25.88	31.24	24.75	29.05
$\text{sYOLO}_{\text{S + M}}$	18.76	23.01	39.16	26.25
$\text{sYOLO}_{\text{S + L}}$	18.76	27.85	24.04	29.35
$\text{sYOLO}_{\text{M + L}}$	23.01	27.85	24.69	23.51

Table 11: In-Depth Analysis of DyRoNet’s Inference Time: This table presents a detailed comparison of inference times between the random selection method and DyRoNet. For ease of analysis, the optimal values in each comparison are highlighted in green font. This highlighting assists in quickly identifying which method—random or DyRoNet —achieves superior performance in terms of inference speed under various conditions.

Appendix C More Details of DyRoNet

Model	Scale	# of params
StreamYOLO	S	9,137,319
	M	25,717,863
	L	54,914,343
LongShortNet	S	9,282,103
	M	25,847,783
	L	55,376,515
DAMO-StreamNet	S	18,656,357
	M	50,129,333
	L	94,156,945

Table 12: Parameter Count of Selected Pre-trained Models: This table lists the number of parameters for each pre-trained model chosen for our analysis. It provides a quantitative overview of the complexity and size of the models, facilitating a comparison of their computational requirements.

C.1 Pre-trained Model Selection

As outlined in the main paper, our implementation of DyRoNet incorporates three existing models as branches within the Model Bank $\mathcal{P}$ : StreamYOLOYang et al. [2022], LongShortNetLi et al. [2023], and DAMO-StreamNetHe et al. [2023]. These models were selected due to their specialized features and proven effectiveness in streaming perception tasks. StreamYOLO is unique for its two additional pre-trained weight variants, each tailored for different streaming processing speeds. This feature allows for adaptable performance depending on the speed requirements of the streaming task. In contrast, LongShortNet and DAMO-StreamNet are equipped with pre-trained weights optimized for high-resolution image processing, making them suitable for scenarios where image clarity is paramount.

To ensure a diverse and versatile range of options within the Model Bank, our implementation of DyRoNet selectively utilizes the Small (S), Medium (M), and Large (L) variants of the pre-trained weights from each model. This choice enables a balanced mix of processing speeds and resolution handling capabilities, catering to a wide range of streaming perception scenarios. The specific details regarding the number of parameters for these pre-trained models can be found in Table 12, which provides a comparative overview to help in understanding the computational complexity for different tasks.

C.2 Setting of Hyperparameters

For all our experiments, we maintained consistent training hyperparameters to ensure comparability and reproducibility of results. The experiments were executed on four RTX 3090 GPUs. Considering the need for selecting the optimal branch model for each sample during the routing process, we established a batch size of $4$ , effectively allocating one sample to each GPU for parallel computation.

In alignment with the configuration used in StreamYOLO, we employed Stochastic Gradient Descent (SGD) as our optimization technique. The learning rate was set to $0.001\times\text{BatchSize}/64$ , adapting to the batch size proportionally. Additionally, we incorporated a cosine annealing schedule for the learning rate, integrated with a warm-up phase lasting one epoch to stabilize the initial training process.

Regarding data preprocessing, we ensured uniformity by resizing all input frames to $600\times 960$ pixels. This standardization was crucial for maintaining consistency across different datasets and ensuring that our model could generalize well across various input dimensions.

Acknowledgments

Zhi-Qi Cheng’s research in this project was supported by the US Department of Transportation, Office of the Assistant Secretary for Research and Technology, under the University Transportation Center Program (Federal Grant Number 69A3551747111), and Intel and IBM Fellowships.

Contribution Statement

Xiang Huang, Jun-Yan He, and Zhi-Qi Cheng worked together on the research idea and were fully involved in the experimental work. Jun-Yan He and Zhi-Qi Cheng made revisions to the manuscript, offering valuable insights and recommendations for the experimental procedures. The manuscript was then reviewed by Chengyang Li, Wangmeng Xiang, Baigui Sun, and Xiao Wu. Zhi-Qi Cheng served as the corresponding author and led the entire project.