Edge Computing Enabled Real-Time Video Analysis via Adaptive Spatial-Temporal Semantic Filtering
Abstract
This paper proposes a novel edge computing enabled real-time video analysis system for intelligent visual devices. The proposed system consists of a tracking-assisted object detection module (TAODM) and a region of interesting module (ROIM). TAODM adaptively determines the offloading decision to process each video frame locally with a tracking algorithm or to offload it to the edge server inferred by an object detection model. ROIM determines each offloading frame’s resolution and detection model configuration to ensure that the analysis results can return in time. TAODM and ROIM interact jointly to filter the repetitive spatial-temporal semantic information to maximize the processing rate while ensuring high video analysis accuracy. Unlike most existing works, this paper investigates the real-time video analysis systems where the intelligent visual device connects to the edge server through a wireless network with fluctuating network conditions. We decompose the real-time video analysis problem into the offloading decision and configurations selection sub-problems. To solve these two sub-problems, we introduce a double deep Q network (DDQN) based offloading approach and a contextual multi-armed bandit (CMAB) based adaptive configurations selection approach, respectively. A DDQN-CMAB reinforcement learning (DCRL) training framework is further developed to integrate these two approaches to improve the overall video analyzing performance. Extensive simulations are conducted to evaluate the performance of the proposed solution, and demonstrate its superiority over counterparts.
I Introduction
With the fast burgeoning of the Internet of Things (IoT), intelligent visual devices (e.g., smartphones, auto-driving vehicle, unmanned aerial vehicle) have encouraged the emergence of far-reaching innovative mobile applications, ranging from autonomous systems to extended reality[1]. These intelligent visual devices capture surroundings and generate skyrocketing amounts of video data that need to be processed in real-time. For example, self-driving applications leverage intelligent camera to accurately detect the lanes, cars, and pedestrians to avoid collisions. However, processing such massive computation-intensive and delay-sensitive video analysis tasks locally is significantly challenging due to the limited resources of intelligent visual devices[2]. Although an alternative way is to offload tasks to powerful cloud servers, it will bring unacceptable delays because they are typically deployed far away from the devices.
Fortunately, the paradigm of edge computing[3], which deploys powerful edge servers at the edge of networks, provides pervasive, reliable and fast-responsive computing services for the devices, and thereby offloading video analysis tasks to edge servers is a promising way. However, due to the limited uplink bandwidth and volatility of the celluar network, it is difficult to offload all video frames with massive sizes to edge servers in real time[4]. Besides, vast repetitive spatial-temporal semantic information among video frames, which generates large transmission delays, severely impacts real-time analyzing performance. Moreover, due to the huge amount of video traffic, users have to pay huge expense on data traffic. Above factors motivate us to design an adaptive spatial-temporal semantic filtering based video analysis offloading strategy with low bandwidth cost and high detection accuracy under the volatile network conditions.
Nevertheless, designing such a strategy is significantly challenging due to the following reasons: i) in particular, real-time video will generate massive traffic per unit of time, resulting in the visual device being unable to transmit all the images to the edge server (ES) timely, and ii) constantly growing intelligent visual devices with frequent video analysis task requests would drain the resources of edge servers fast. In addition, there are considerable differences in computing power among various intelligent visual devices, leading to the performance of the sames algorithm varying wildly.
Aiming to solve the above problems, we study a long-term semantic-filtering offloading optimization problem to maximize the processing rate, i.e., the number of frames of detection results returned in time, and the detection accuracy under the time-varying network conditions. It is observed that this problem involves two coupled intricate sub-problems, an offloading decision problem and a configurations selection problem. On this basis, we propose a double deep Q network (DDQN) based offloading decision scheme and a contextual multi-armed bandit (CMAB) based adaptive configurations selection scheme for these two sub-problems, respectively. To obtain the optimal solution, we further propose a two-layer training framework, namely DDQN-CMAB reinforcement learning (DCRL) framework, for jointly solving these two sub-problems effectively. The main contributions are summarized as follows.
-
•
A novel edge computing enabled real-time video analysis system based on spatial-temporal semantic filtering is proposed, which realizes real-time object detection with high accuracy under volatile and fluctuating network conditions.
-
•
A DCRL framework is designed for solving the overall semantic-filtering offloading optimization problem under unpredictable network conditions.
-
•
Simulations are conducted to examine the performance of the proposed DCRL framework on the multi-camera pedestrian video dataset[5].
The rest of this paper is organized as follows: Section II presents the system model and problem formulation. In Section III, two specific approaches are proposed to solve the decomposed sub-problems, and a DCRL framework is proposed to integrate these two approaches. Simulation results are given in Section IV, followed by conclusions in Section V.
II System Model and Problem Formulation
II-A System Model
![Refer to caption](x1.png)
Consider an collaborative edge-device real-time video analysis system, as illustrated in Fig. 1, which consists of a tracking-assisted object detection module (TAODM) and a region of interesting module (ROIM) deployed on a mobile intelligent visual device to process road condition video stream and a dedicated ES with various advanced object detection models deployed at the base station. Besides, a frame-based time-slotted operation framework is studied to depict time-varying uncertainties of the video analysis system, where denotes the index of time slots and each time slot represents a decision interval. In the proposed edge computing enabled real-time video analysis system, an intelligent visual device with computing power connects to an ES with computing power through the cellular network with bandwidth . The intelligent visual device captures the real-time video stream and generates a real-time frame in each time slot by sampling. The real-time frame is denoted as , which is assumed to be generated at beginning of time slot and will be processed by TAODM. TAODM adaptively determines the offloading mode to process the frame locally with a tracking algorithm or to offload it to the ES. If the latter, the offloaded frame would be further processed by ROIM, which determines its resolution and detection model configuration according to the information density and bandwidth for ensuring the analysis results return in time. Since the tracking algorithm requires the detection results from ES to initialize tracking targets, TAODM and ROIM are highly coupled. TAODM and ROIM interact jointly to maximize the processing rate while ensuring high video analysis accuracy.
Specifically, three decisions would be made by TAODM, namely, i) offloading decision , which indicates the frame to be processed locally or to be offloaded to the ES ii) ROI decision indicates whether to adopt ROI extraction or not and iii) the tracking mode to be adopted, where , and the Skip mode would directly reuse the detection result of the last frame if the current frame has tiny new semantic information compared to the last one, and KCF[6] and CSRT[7] are well-known trackers working well in resource-limited IoT devices. When , the frame will be processed by TAODM with tracking mode , and we use to represent the processing accuracy of tracking mode . Additionally, the processing time of tracking mode can be calculated as , where indicates the computing intensity of tracking mode and represents the data size of frame . Note that, when . When , the ROI extraction decision will be made according to the during time slot bandwidth and then the frame is further processed by ROIM.
In ROIM, the frame would firstly go through the process of rough pre-detection by a local lightweight deep neural network (DNN) for obtaining the bounding boxes and the amount of them, which indicate the information density , with inference time . Then, if , all bounding boxes will be merged into several ROI blocks with extraction time , where represents the computing intensity of ROI . After that, only those ROI blocks are transmitted to the edge server for further processing. When , the whole frame will be transmitted instead. Considering the inference requests of a frame can be served by different detection models, we use the set to denote the detection models held on the ES. YOLOv5 is the most popular object detection algorithm nowadays and provides various size models to meet different scenarios. Besides, we use a set to denote the set of offloading resolutions to accelerate the model inference and reduce the transmission delay. Thus, ROIM would make two decisions according to the information density and bandwidth before transmitting the frame to the edge servers for real-time performance, i.e., i) the model selected for object detection in the edge server and ii) offloading resolution for each frame or block. Then, based on the decision of offloading resolution , frame or all blocks are resized to the offloading resolution and the size of offloading data can be obtained as , where represents the number of blocks, is the data size of one pixel and is the pixel number of block in resolution , meanwhile, transmitting a whole frame is treated as transmitting a block for simplification. Finally, the frame or all ROI blocks with the corresponding model configurations has the transmission delay , and detected by the advanced detection model.
Based on above formulation, the completion time of frame can be expressed as
(1) |
where is the inference delay in the ES and represents the computing intensity of model in resolution . The size of inference results is commonly negligible compared to that of the offloading frame, so we ignore the delay of inference results transmitting back to the intelligent visual device from the ES. Since the video frames need to be analyzed in real-time, each frame should be processed within a time duration , which can be formulated as
(2) |
where is the length of a time slot. Besides, the detection accuracy of video analysis system can be mathematically expressed as
(3) |
where represents the number of blocks, indicates whether the tracking mode is selected ( = 1 signifies the mode is selected, otherwise = 0). Similarly, and indicate whether the detection model configuration and offloading resolution are selected.
Note that, different tracking modes have different processing time. Skip mode has almost no latency, but detection accuracy degrades quickly when semantic information changes. The KCF mode is faster with lower accuracy than the CSRT mode. In addition, different configurations of detection models and offloading resolutions have different precision and inference delays. Adopting the advanced object detection model and high offloading resolution will have a better accuracy but lead to a longer inference delay and transmission delay, which damages the real-time performance. Meanwhile, selecting the entire frame can improve accuracy for that ROI extraction may miss some key information but also incurs a higher transmission delay, decreasing the real-time video analysis performance.
II-B Problem Formulation
Since the quality of the proposed video analysis system highly depends on the frame processing rate (i.e., the frame number of detection results returned in time) and detection accuracy, in this work, we aim to maximize the system utility, which is a weighted sum of these two factors. Thus, the optimization problem of video analysis can be mathematically expressed as
(4) | |||
(5) | |||
(6) | |||
(7) | |||
(8) | |||
(9) | |||
(10) | |||
(11) | |||
(12) |
where parameter 0 is used to depict the weight of detection accuracy, and denotes whether frame is processed successfully in the time slot . Specifically, if , it means that frame is not processed completely within and will be abandoned, , and must be established to make constraint (6) true. If , means that the detection accuracy of frame is lower than a threshold thus is not successfully processed, while means that frame is really successfully processed. Constraint (10) means a tracking mode is adopted when . Otherwise, the configurations of models and resolutions will be chosen for offloading, represented by Constraint (11) and Constraint (12).
Observe that the formulated optimization problem is a mixed-integer nonlinear programming problem, which is NP-hard problem [8]. To address this problem, we partition the problem into two sub-problems, i.e., deciding the optimal offloading strategy for the frame, and selecting object detection model and offloading resolution configuration for each block. In the next section, a real-time DDQN-CMAB reinforcement learning framework is adopted to connect a DDQN-based offloading approach and a CMAB-based configuration selection approach to generate the optimal strategy for real-time video analysis.
III DDQN-CMAB Reinforcement Learning Strategy
III-A DDQN-based Offloading Decision
In this subsection, we reformulate the first sub-problem, the offloading decision problem, into a Markov decision process (MDP), then propose a DDQN-based offloading scheme to dynamically generate the appropriate strategy to determine , and tracking mode .
For each time slot , the offloading strategy derived by TAODM only depends on the states consisting of hash similarity, bandwidth, tracking complexity and continuous tracking time of the last time slot , which means that the state transition satisfies the Markov property[9][10]. Thus, the offloading decision problem can be formulated as an MDP and defined by a tuple , which represents the state space, action space and reward space, respectively.
1) State: For TAODM, its state in time slot can be represented as , where is the hash similarity between and the last processed frame , indicates the bandwidth at time slot , stands for tracking complexity that is the number of objects detected by the advanced DNN model in the last offloading frame and is the number of continuous time slots of processing frames with the same tracking mode.
2) Action: At time slot , the action of TAODM is frame processing mode selection. The action space is formulated as the set of five possible decisions of offloading choices , where Offload-full indicates offloading the full frame and Offload-ROI indicates offloading ROI blocks. In pursuit of solution efficiency, we merge the ROI decision into TAODM.
3) Reward: The reward for TAODM is defined as the weighted sum of actual accuracy and completion time , which can be expressed as , where is the actual accuracy of frame given by environment, which is used to replace the sum of and , and is the parameter that controls the weight to balance completion time and detection accuracy.
For TAODM’s MDP, the deep Q network (DQN) could be applied to train the agent for discrete action space, which leverages the deep neural network to analyze states and actions to optimize the value and has a low computation time. In the DQN training process, TAODM first observes the state from the environment and selects the best action for maximizing the action-value function, and can be expressed as , where is the network parameters matrix and is updated by a back propagation training process, whose loss function is , where is the target value for time slot . However, the DQN-based algorithm may cause a large deviation in its model due to overestimating the value of Q-target, which indicates the quality of a strategy. To avoid such overestimation, we present a DDQN-based algorithm, which decouples the action selection using DQN and evaluation of Q-target using target network. In the DDQN-based algorithm, the value of Q-target is calculated by , where is the target network parameters matrix and is updated periodically.
An overview of DDQN-based offloading approach is given in Algorithm 1. In each training step , the specific process is to first initialize hyperparameters, network parameters and replay buffer. Second, TAODM observes the state from the environment and select the action using -greedy method. Then, TAODM executes the action and calculates the reward . Third, a replay buffer is adopted to store the tuple (, , ) for learning Q-network better. Finally, TAODM calculate the Q-value and loss to update network parameters matrix , but update target network parameters matrix every times.
III-B CMAB-based Adaptive Configurations Selection
In this subsection, we reformulate the second sub-problem, adaptive configurations selection, into a contextual MAB problem and employ several multi-armed bandit models to dynamically decide the configurations , where represents the detection model and represents the offloading resolution for each block .
Based on whether the information density of block and bandwidth at time slot are greater than the average information density and average bandwidth , we maintain MABs for four different contexts: i) high information density with high bandwidth ii) low information density with high bandwidth iii) high information density with low bandwidth and iv) low information density with low bandwidth. The motivation behind these four contexts is that if the information density is lower than the average information density , an inferior DNN model would be more likely to achieve similar performance with the advanced one because only few giant objects appear in the block. However, in high information density blocks, multiple small targets tend to appear and utilizing the advanced model will obtain better accuracy. Similarly, if the bandwidth is lower than the average bandwidth , the high resolution generates high accuracy but may cause processing failure resulted from an unacceptable transmission delay.
We use the exponential moving averages method to update the average information density and average bandwidth as follows , , where stand for the weight factors for the most recent information density and bandwidth observation , respectively. The above equations give higher weights to the latest information density and bandwidth , allowing the model to quickly respond to recent changes.
Considering that the configurations selection strategy should be different in various contexts, we maintain four independent MAB models denoted as , , and to decide the configurations of detection models and offloading resolutions. We use to denote the context set and use to denote the configurations set. For each context and configuration , we define MAB reward as , where is the same reward as that in TAODM, is the contextual indicating function. Thus, each MAB model gets the reward for its decisions allowing independent training. Besides, the reward estimate is updated as follows , where is the decay parameter. Thereby, each reward estimate is updated by the corresponding reward . For all MAB models, a -greedy based method is used to take decision to train the MAB models as follows
(13) |
where is the probability of choosing a random choice. Since we already have precise average estimate of information density and bandwidth after training, we set to only use exploitation method at in the testing process.
III-C Real-time Tracking-assisted Video Analysis Framework
![Refer to caption](x2.png)
Since TAODM and ROIM are highly coupled and only when TAODM decides to offload the frame to ES, ROIM will be used to decide the configurations for each block. Based on these insights, we propose a double-layer DDQN-CMAB reinforcement learning (DCRL) framework to jointly train DDQN-based offloading agent and CMAB-based configurations selection agent.
As shown in Fig. 2, the proposed DCRL framework consists of two layers: i) in the upper layer, we assign TAODM as the first controller, which observes the hash similarity , bandwidth , tracking complexity and continuous tracking time from the environment, then makes the offloading decision. If TAODM decides to process the frame locally, the frame results will be obtained with a tracking mode . Otherwise, the frame will be sent to the lower layer; ii) in the lower layer, ROIM observes the information density and bandwidth , and determines the configuration of detection models and offloading resolution for each block.
In each training step of DCRL, we first offload the entire frame of to initialize the last detection results for tracking in the future. Then at each time slot , TAODM determines by DDQN. If the mode is adopted, directly use the last detection results as the detection results of . If the KCF or CSRT mode is adopted, TAODM executes the tracking algorithm corresponding to to get detection results. Otherwise, the frame would be processed by ROIM, where CMAB is performed to obtain the configuration of each block and then offload all blocks to the ES. MAB reward is calculated to update estimates when receiving the results of all blocks. Meanwhile, TAODM receives detection results, and the rewards of DDQN are calculated for updating its DNN parameters matrix and target Q-network parameters matrix . The detailed steps of the DCRL training framework are summarized in Algorithm 2.
IV Simulation Results
![Refer to caption](extracted/5439478/pic/cum_r.png)
![Refer to caption](extracted/5439478/pic/Mean_Average_Precision.png)
![Refer to caption](extracted/5439478/pic/Processing_Rate.png)
![Refer to caption](extracted/5439478/pic/Average_Lantency.png)
In this section, simulations are conducted to evaluate the performance of the proposed DCRL on an open dateset, i.e., Garden2 of Multi-camera Pedestrian Video Dataset[5].To ensure fairness, 80% of each video is used for training and the remaining for testing. Furthermore, let MB/Pixel, , , , , and .
For comparison purposes, except the proposed DCRL, the following schemes are also simulated as benchmarks. RAND-RAND (R-R): TAODM and ROIM randomly determine offloading and configurations of detection model and offloading resolution. RAND-CMAB (R-C): TAODM randomly determines offloading, while ROIM determines configurations of detection model and offloading resolution using CMAB-based adaptive configurations selection algorithm. DDQN-RAND (D-R): ROIM randomly determines configurations of detection model and offloading resolution, while TAODM determines offloading using DDQN-based offloading algorithm. FULL-HIGH (F-B): TAODM only adopts Offload-full decision and ROIM adopts the highest offloading resolution with the most accurate model configuration.
Fig. 2(a) shows the cumulative rewards with time, it can be observed the proposed DDQN-CMAB (D-C) outperforms the other benchmarks, as it filters the repeated spatial-temporal semantic information to ensure the results from ES return in time while obtaining high accuracy. In addition, F-B obtains the worst performance because offloading full frames generates a large transmission delay under fluctuating network conditions.
Fig. 2(b) illustrates the mean average precision (mAP) with different methods, we can see D-C obtains the highest accuracy except for F-B. Fig. 2(c) compares the processing rate of different methods, and it can be seen that D-C achieves the best processing rate compared with others. Specifically, DCRL improves the processing rate by up to 66.3% compared to F-B. Compare Fig. 2(b) to Fig. 2(c), it can be observed that i) F-H realizes the best mAP performance while the lowest processing rate for large transmission delays, ii) R-H achieves higher mAP with higher processing rate since the repeated spatial-temporal semantic information is filtered, where the effectiveness of ROI is fully illustrated.
Fig. 2(d) compares the average latency of different methods, and D-C achieves the lowest latency compared with others. Exploring the reasons, we find that TAODM adopts offload-ROI to minimize the adverse effects of fluctuating network conditions. When network conditions are stressful, ROIM can effectively reduce data volume to ensure that critical information is offloaded to the ES for accurate detection. Besides, dramatic network fluctuations are often temporary, and a local tracking algorithm is used to tide over this difficult period to improve accuracy when network conditions are poor.
V Conclusion
In this paper, a DCRL framework integrating a DDQN-based optimal offloading decision approach and a CMAB-based adaptive configuration selection approach is proposed to address the challenge of balancing frame processing rate and accuracy performance of edge-based real-time video analysis systems in intelligent visual devices. By decomposing the optimization problem of video analysis into two subproblems and generating an optimal strategy for offloading mode and the configurations of detection model and resolution selection, our proposed framework can effectively solve the overall optimization problem. Experimental results show that our approach outperforms counterparts in terms of ensuring a high process rate with high detection accuracy.
References
- [1] J. Chen, C. Yi, et al., “Networking architecture and key supporting technologies for human digital twin in personalized healthcare: A comprehensive survey,” IEEE Commun. Surv. Tutor., pp. 1–1, 2023.
- [2] H. Liu and G. Cao, “Deep learning video analytics through online learning based edge computing,” IEEE Trans. Wirel. Commun., vol. 21, no. 10, pp. 8193–8204, 2022.
- [3] Y. Shi, C. Yi, R. Wang et al., “Service migration or task rerouting: A two-timescale online resource optimization for mec,” IEEE Trans. Wirel. Commun., pp. 1–1, 2023.
- [4] L. Dong, Z. Yang et al., “WAVE: Edge-device cooperated real-time object detection for open-air applications,” IEEE Trans. Mob. Comput., pp. 1–1, 2022.
- [5] Y. Xu, X. Liu et al., “Cross-view people tracking by scene-centered spatio-temporal parsing,” in Proc. AAAI, vol. 31, no. 1, 2017.
- [6] J. F. Henriques, R. Caseiro et al., “High-speed tracking with kernelized correlation filters,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 583–596, 2015.
- [7] A. Lukežic, T. Vojír et al., “Discriminative correlation filter with channel and spatial reliability,” in Proc. IEEE CVPR, 2017, pp. 4847–4856.
- [8] K. Zhao, Z. Zhou et al., “Edgeadaptor: Online configuration adaption, model selection and resource provisioning for edge DNN inference serving at scale,” IEEE Trans. Mob. Comput., pp. 1–16, 2022.
- [9] J. Chen, C. Yi et al., “Learning aided joint sensor activation and mobile charging vehicle scheduling for energy-efficient wrsn-based industrial iot,” IEEE Trans. Veh. Technol., vol. 72, no. 4, pp. 5064–5078, 2023.
- [10] R. Chen, C. Yi, K. Zhu et al., “A three-party hierarchical game for physical layer security aware wireless communications with dynamic trilateral coalitions,” IEEE Trans. Wirel. Commun., pp. 1–1, 2023.