License: arXiv.org perpetual non-exclusive license
arXiv:2402.18927v1 [cs.CV] 29 Feb 2024

Edge Computing Enabled Real-Time Video Analysis via Adaptive Spatial-Temporal Semantic Filtering

Xiang Chen{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT, Wenjie Zhu{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT, Jiayuan Chen{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT, Tong Zhang{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT, Changyan Yi{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT and Jun Cai{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT
{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPTCollege of Computer Science and Technology, Nan**g University of Aeronautics and Astronautics, Nan**g, China
{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPTDepartment of Electrical and Computer Engineering, Concordia University, Montréal, QC, H3G 1M8, Canada
Email: {chenxiang, wenjie.zhu, jiayuan.chen, zhangt, changyan.yi}@nuaa.edu.cn,  [email protected]
Abstract

This paper proposes a novel edge computing enabled real-time video analysis system for intelligent visual devices. The proposed system consists of a tracking-assisted object detection module (TAODM) and a region of interesting module (ROIM). TAODM adaptively determines the offloading decision to process each video frame locally with a tracking algorithm or to offload it to the edge server inferred by an object detection model. ROIM determines each offloading frame’s resolution and detection model configuration to ensure that the analysis results can return in time. TAODM and ROIM interact jointly to filter the repetitive spatial-temporal semantic information to maximize the processing rate while ensuring high video analysis accuracy. Unlike most existing works, this paper investigates the real-time video analysis systems where the intelligent visual device connects to the edge server through a wireless network with fluctuating network conditions. We decompose the real-time video analysis problem into the offloading decision and configurations selection sub-problems. To solve these two sub-problems, we introduce a double deep Q network (DDQN) based offloading approach and a contextual multi-armed bandit (CMAB) based adaptive configurations selection approach, respectively. A DDQN-CMAB reinforcement learning (DCRL) training framework is further developed to integrate these two approaches to improve the overall video analyzing performance. Extensive simulations are conducted to evaluate the performance of the proposed solution, and demonstrate its superiority over counterparts.

I Introduction

With the fast burgeoning of the Internet of Things (IoT), intelligent visual devices (e.g., smartphones, auto-driving vehicle, unmanned aerial vehicle) have encouraged the emergence of far-reaching innovative mobile applications, ranging from autonomous systems to extended reality[1]. These intelligent visual devices capture surroundings and generate skyrocketing amounts of video data that need to be processed in real-time. For example, self-driving applications leverage intelligent camera to accurately detect the lanes, cars, and pedestrians to avoid collisions. However, processing such massive computation-intensive and delay-sensitive video analysis tasks locally is significantly challenging due to the limited resources of intelligent visual devices[2]. Although an alternative way is to offload tasks to powerful cloud servers, it will bring unacceptable delays because they are typically deployed far away from the devices.

Fortunately, the paradigm of edge computing[3], which deploys powerful edge servers at the edge of networks, provides pervasive, reliable and fast-responsive computing services for the devices, and thereby offloading video analysis tasks to edge servers is a promising way. However, due to the limited uplink bandwidth and volatility of the celluar network, it is difficult to offload all video frames with massive sizes to edge servers in real time[4]. Besides, vast repetitive spatial-temporal semantic information among video frames, which generates large transmission delays, severely impacts real-time analyzing performance. Moreover, due to the huge amount of video traffic, users have to pay huge expense on data traffic. Above factors motivate us to design an adaptive spatial-temporal semantic filtering based video analysis offloading strategy with low bandwidth cost and high detection accuracy under the volatile network conditions.

Nevertheless, designing such a strategy is significantly challenging due to the following reasons: i) in particular, real-time video will generate massive traffic per unit of time, resulting in the visual device being unable to transmit all the images to the edge server (ES) timely, and ii) constantly growing intelligent visual devices with frequent video analysis task requests would drain the resources of edge servers fast. In addition, there are considerable differences in computing power among various intelligent visual devices, leading to the performance of the sames algorithm varying wildly.

Aiming to solve the above problems, we study a long-term semantic-filtering offloading optimization problem to maximize the processing rate, i.e., the number of frames of detection results returned in time, and the detection accuracy under the time-varying network conditions. It is observed that this problem involves two coupled intricate sub-problems, an offloading decision problem and a configurations selection problem. On this basis, we propose a double deep Q network (DDQN) based offloading decision scheme and a contextual multi-armed bandit (CMAB) based adaptive configurations selection scheme for these two sub-problems, respectively. To obtain the optimal solution, we further propose a two-layer training framework, namely DDQN-CMAB reinforcement learning (DCRL) framework, for jointly solving these two sub-problems effectively. The main contributions are summarized as follows.

  • A novel edge computing enabled real-time video analysis system based on spatial-temporal semantic filtering is proposed, which realizes real-time object detection with high accuracy under volatile and fluctuating network conditions.

  • A DCRL framework is designed for solving the overall semantic-filtering offloading optimization problem under unpredictable network conditions.

  • Simulations are conducted to examine the performance of the proposed DCRL framework on the multi-camera pedestrian video dataset[5].

The rest of this paper is organized as follows: Section II presents the system model and problem formulation. In Section III, two specific approaches are proposed to solve the decomposed sub-problems, and a DCRL framework is proposed to integrate these two approaches. Simulation results are given in Section IV, followed by conclusions in Section V.

II System Model and Problem Formulation

II-A System Model

Refer to caption
Figure 1: Overview of proposed real-time video analysis system.

Consider an collaborative edge-device real-time video analysis system, as illustrated in Fig. 1, which consists of a tracking-assisted object detection module (TAODM) and a region of interesting module (ROIM) deployed on a mobile intelligent visual device to process road condition video stream and a dedicated ES with various advanced object detection models deployed at the base station. Besides, a frame-based time-slotted operation framework is studied to depict time-varying uncertainties of the video analysis system, where t{1,2,,T}𝑡12𝑇t\in\{1,2,\dots,T\}italic_t ∈ { 1 , 2 , … , italic_T } denotes the index of time slots and each time slot represents a decision interval. In the proposed edge computing enabled real-time video analysis system, an intelligent visual device with computing power fdevicesuperscript𝑓𝑑𝑒𝑣𝑖𝑐𝑒f^{device}italic_f start_POSTSUPERSCRIPT italic_d italic_e italic_v italic_i italic_c italic_e end_POSTSUPERSCRIPT connects to an ES with computing power fedgesuperscript𝑓𝑒𝑑𝑔𝑒f^{edge}italic_f start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT through the cellular network with bandwidth btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The intelligent visual device captures the real-time video stream and generates a real-time frame in each time slot by sampling. The real-time frame is denoted as ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which is assumed to be generated at beginning of time slot t𝑡titalic_t and will be processed by TAODM. TAODM adaptively determines the offloading mode to process the frame locally with a tracking algorithm or to offload it to the ES. If the latter, the offloaded frame would be further processed by ROIM, which determines its resolution and detection model configuration according to the information density and bandwidth for ensuring the analysis results return in time. Since the tracking algorithm requires the detection results from ES to initialize tracking targets, TAODM and ROIM are highly coupled. TAODM and ROIM interact jointly to maximize the processing rate while ensuring high video analysis accuracy.

Specifically, three decisions would be made by TAODM, namely, i) offloading decision αt{0,1}subscript𝛼𝑡01\alpha_{t}\in\{0,1\}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 }, which indicates the frame ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to be processed locally or to be offloaded to the ES ii) ROI decision βt{0,1}subscript𝛽𝑡01\beta_{t}\in\{0,1\}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } indicates whether to adopt ROI extraction or not and iii) the tracking mode k𝒦𝑘𝒦k\in\mathcal{K}italic_k ∈ caligraphic_K to be adopted, where 𝒦={Skip, KCF, CSRT}𝒦Skip, KCF, CSRT\mathcal{K}=\{\emph{Skip, KCF, CSRT}\}caligraphic_K = { Skip, KCF, CSRT }, and the Skip mode would directly reuse the detection result of the last frame if the current frame has tiny new semantic information compared to the last one, and KCF[6] and CSRT[7] are well-known trackers working well in resource-limited IoT devices. When αt=0subscript𝛼𝑡0\alpha_{t}=0italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0, the frame will be processed by TAODM with tracking mode k𝒦𝑘𝒦k\in\mathcal{K}italic_k ∈ caligraphic_K, and we use acc(k)𝑎𝑐𝑐𝑘acc(k)italic_a italic_c italic_c ( italic_k ) to represent the processing accuracy of tracking mode k𝑘kitalic_k. Additionally, the processing time of tracking mode k𝑘kitalic_k can be calculated as ltpro,loc=ukdt/fdevicesuperscriptsubscript𝑙𝑡𝑝𝑟𝑜𝑙𝑜𝑐superscript𝑢𝑘subscript𝑑𝑡superscript𝑓𝑑𝑒𝑣𝑖𝑐𝑒l_{t}^{pro,loc}=u^{k}d_{t}/f^{device}italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_r italic_o , italic_l italic_o italic_c end_POSTSUPERSCRIPT = italic_u start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / italic_f start_POSTSUPERSCRIPT italic_d italic_e italic_v italic_i italic_c italic_e end_POSTSUPERSCRIPT, where uksuperscript𝑢𝑘u^{k}italic_u start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT indicates the computing intensity of tracking mode k𝑘kitalic_k and dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the data size of frame ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Note that, lpro,loc=0superscript𝑙𝑝𝑟𝑜𝑙𝑜𝑐0l^{pro,loc}=0italic_l start_POSTSUPERSCRIPT italic_p italic_r italic_o , italic_l italic_o italic_c end_POSTSUPERSCRIPT = 0 when k=skip𝑘𝑠𝑘𝑖𝑝k=skipitalic_k = italic_s italic_k italic_i italic_p. When αt=1subscript𝛼𝑡1\alpha_{t}=1italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1, the ROI extraction decision βt{0,1}subscript𝛽𝑡01\beta_{t}\in\{0,1\}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } will be made according to the during time slot t𝑡titalic_t bandwidth and then the frame ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is further processed by ROIM.

In ROIM, the frame ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT would firstly go through the process of rough pre-detection by a local lightweight deep neural network (DNN) for obtaining the bounding boxes and the amount of them, which indicate the information density ntsubscript𝑛𝑡n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, with inference time ltdnn,locsuperscriptsubscript𝑙𝑡𝑑𝑛𝑛𝑙𝑜𝑐l_{t}^{dnn,loc}italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_n italic_n , italic_l italic_o italic_c end_POSTSUPERSCRIPT. Then, if βt=1subscript𝛽𝑡1\beta_{t}=1italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1, all bounding boxes will be merged into several ROI blocks with extraction time ltroi,loc=uroidt/fdevicesuperscriptsubscript𝑙𝑡𝑟𝑜𝑖𝑙𝑜𝑐superscript𝑢𝑟𝑜𝑖subscript𝑑𝑡superscript𝑓𝑑𝑒𝑣𝑖𝑐𝑒l_{t}^{roi,loc}={u^{roi}d_{t}}/{f^{device}}italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_o italic_i , italic_l italic_o italic_c end_POSTSUPERSCRIPT = italic_u start_POSTSUPERSCRIPT italic_r italic_o italic_i end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / italic_f start_POSTSUPERSCRIPT italic_d italic_e italic_v italic_i italic_c italic_e end_POSTSUPERSCRIPT, where uroisuperscript𝑢𝑟𝑜𝑖u^{roi}italic_u start_POSTSUPERSCRIPT italic_r italic_o italic_i end_POSTSUPERSCRIPT represents the computing intensity of ROI . After that, only those ROI blocks are transmitted to the edge server for further processing. When βt=0subscript𝛽𝑡0\beta_{t}=0italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0, the whole frame will be transmitted instead. Considering the inference requests of a frame can be served by different detection models, we use the set ={Yolov5x,Yolov5l,Yolov5m}𝑌𝑜𝑙𝑜𝑣5𝑥𝑌𝑜𝑙𝑜𝑣5𝑙𝑌𝑜𝑙𝑜𝑣5𝑚\mathcal{M}=\{Yolov5x,Yolov5l,Yolov5m\}caligraphic_M = { italic_Y italic_o italic_l italic_o italic_v 5 italic_x , italic_Y italic_o italic_l italic_o italic_v 5 italic_l , italic_Y italic_o italic_l italic_o italic_v 5 italic_m } to denote the detection models held on the ES. YOLOv5 is the most popular object detection algorithm nowadays and provides various size models to meet different scenarios. Besides, we use a set ={640p,480p,320p}640𝑝480𝑝320𝑝\mathcal{R}=\{640p,480p,320p\}caligraphic_R = { 640 italic_p , 480 italic_p , 320 italic_p } to denote the set of offloading resolutions to accelerate the model inference and reduce the transmission delay. Thus, ROIM would make two decisions according to the information density nt,isubscript𝑛𝑡𝑖n_{t,i}italic_n start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT and bandwidth btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT before transmitting the frame to the edge servers for real-time performance, i.e., i) the model m𝑚m\in\mathcal{M}italic_m ∈ caligraphic_M selected for object detection in the edge server and ii) offloading resolution r𝑟r\in\mathcal{R}italic_r ∈ caligraphic_R for each frame or block. Then, based on the decision of offloading resolution r𝑟r\in\mathcal{R}italic_r ∈ caligraphic_R, frame or all blocks are resized to the offloading resolution and the size of offloading data can be obtained as dt=i=1stτb(rt,i)subscript𝑑𝑡superscriptsubscript𝑖1subscript𝑠𝑡𝜏𝑏subscript𝑟𝑡𝑖d_{t}=\sum_{i=1}^{s_{t}}{\tau b(r_{t,i})}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_τ italic_b ( italic_r start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ), where stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the number of blocks, τ𝜏\tauitalic_τ is the data size of one pixel and b(rt,i)𝑏subscript𝑟𝑡𝑖b(r_{t,i})italic_b ( italic_r start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) is the pixel number of block i𝑖iitalic_i in resolution r𝑟ritalic_r, meanwhile, transmitting a whole frame is treated as transmitting a block for simplification. Finally, the frame or all ROI blocks with the corresponding model configurations has the transmission delay lttrans=dt/btsuperscriptsubscript𝑙𝑡𝑡𝑟𝑎𝑛𝑠subscript𝑑𝑡subscript𝑏𝑡l_{t}^{trans}={d_{t}}/{b_{t}}italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r italic_a italic_n italic_s end_POSTSUPERSCRIPT = italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and detected by the advanced detection model.

Based on above formulation, the completion time of frame ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be expressed as

lttotal=subscriptsuperscript𝑙𝑡𝑜𝑡𝑎𝑙𝑡absent\displaystyle l^{total}_{t}=italic_l start_POSTSUPERSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = αt(βtltroi,loc+ltdnn,loc+lttrans+ltpro,edge)subscript𝛼𝑡subscript𝛽𝑡superscriptsubscript𝑙𝑡𝑟𝑜𝑖𝑙𝑜𝑐superscriptsubscript𝑙𝑡𝑑𝑛𝑛𝑙𝑜𝑐superscriptsubscript𝑙𝑡𝑡𝑟𝑎𝑛𝑠superscriptsubscript𝑙𝑡𝑝𝑟𝑜𝑒𝑑𝑔𝑒\displaystyle\alpha_{t}(\beta_{t}l_{t}^{roi,loc}+l_{t}^{dnn,loc}+l_{t}^{trans}% +l_{t}^{pro,edge})italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_o italic_i , italic_l italic_o italic_c end_POSTSUPERSCRIPT + italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_n italic_n , italic_l italic_o italic_c end_POSTSUPERSCRIPT + italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r italic_a italic_n italic_s end_POSTSUPERSCRIPT + italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_r italic_o , italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT )
+(1αt)ltpro,loc,1subscript𝛼𝑡superscriptsubscript𝑙𝑡𝑝𝑟𝑜𝑙𝑜𝑐\displaystyle+(1-\alpha_{t})l_{t}^{pro,loc},+ ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_r italic_o , italic_l italic_o italic_c end_POSTSUPERSCRIPT , (1)

where ltpro,edge=um,rdt/fedgesuperscriptsubscript𝑙𝑡𝑝𝑟𝑜𝑒𝑑𝑔𝑒superscript𝑢𝑚𝑟subscript𝑑𝑡superscript𝑓𝑒𝑑𝑔𝑒l_{t}^{pro,edge}={u^{m,r}d_{t}}/{f^{edge}}italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_r italic_o , italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT = italic_u start_POSTSUPERSCRIPT italic_m , italic_r end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / italic_f start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT is the inference delay in the ES and um,rsuperscript𝑢𝑚𝑟u^{m,r}italic_u start_POSTSUPERSCRIPT italic_m , italic_r end_POSTSUPERSCRIPT represents the computing intensity of model m𝑚mitalic_m in resolution r𝑟ritalic_r. The size of inference results is commonly negligible compared to that of the offloading frame, so we ignore the delay of inference results transmitting back to the intelligent visual device from the ES. Since the video frames need to be analyzed in real-time, each frame should be processed within a time duration lmaxsuperscript𝑙𝑚𝑎𝑥l^{max}italic_l start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT, which can be formulated as

lttotallmax,subscriptsuperscript𝑙𝑡𝑜𝑡𝑎𝑙𝑡superscript𝑙𝑚𝑎𝑥l^{total}_{t}\leq l^{max},italic_l start_POSTSUPERSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_l start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT , (2)

where lmaxsuperscript𝑙𝑚𝑎𝑥l^{max}italic_l start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT is the length of a time slot. Besides, the detection accuracy of video analysis system acct𝑎𝑐subscript𝑐𝑡acc_{t}italic_a italic_c italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be mathematically expressed as

acct=𝑎𝑐subscript𝑐𝑡absent\displaystyle acc_{t}=italic_a italic_c italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = αtsti=1stmryt,imzt,iracc(m,r)subscript𝛼𝑡subscript𝑠𝑡superscriptsubscript𝑖1subscript𝑠𝑡subscript𝑚subscript𝑟superscriptsubscript𝑦𝑡𝑖𝑚subscriptsuperscript𝑧𝑟𝑡𝑖𝑎𝑐𝑐𝑚𝑟\displaystyle\frac{\alpha_{t}}{s_{t}}\sum\nolimits_{i=1}^{s_{t}}\sum\nolimits_% {m\in\mathcal{M}}\sum\nolimits_{r\in\mathcal{R}}y_{t,i}^{m}z^{r}_{t,i}acc(m,r)divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m ∈ caligraphic_M end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_r ∈ caligraphic_R end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_z start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT italic_a italic_c italic_c ( italic_m , italic_r )
+(1αt)k𝒦xtkacc(k),1subscript𝛼𝑡subscript𝑘𝒦superscriptsubscript𝑥𝑡𝑘𝑎𝑐𝑐𝑘\displaystyle+(1-\alpha_{t})\sum\nolimits_{k\in\mathcal{K}}{x_{t}^{k}acc(k)},+ ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_K end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_a italic_c italic_c ( italic_k ) , (3)

where stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the number of blocks, xtk{0,1}superscriptsubscript𝑥𝑡𝑘01x_{t}^{k}\in\{0,1\}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ { 0 , 1 } indicates whether the tracking mode k𝑘kitalic_k is selected (xtksuperscriptsubscript𝑥𝑡𝑘x_{t}^{k}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = 1 signifies the mode k𝑘kitalic_k is selected, otherwise xtksuperscriptsubscript𝑥𝑡𝑘x_{t}^{k}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = 0). Similarly, yt,im{0,1}superscriptsubscript𝑦𝑡𝑖𝑚01y_{t,i}^{m}\in\{0,1\}italic_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∈ { 0 , 1 } and zt,ir{0,1}superscriptsubscript𝑧𝑡𝑖𝑟01z_{t,i}^{r}\in\{0,1\}italic_z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ∈ { 0 , 1 } indicate whether the detection model configuration m𝑚mitalic_m and offloading resolution r𝑟ritalic_r are selected.

Note that, different tracking modes have different processing time. Skip mode has almost no latency, but detection accuracy degrades quickly when semantic information changes. The KCF mode is faster with lower accuracy than the CSRT mode. In addition, different configurations of detection models and offloading resolutions have different precision and inference delays. Adopting the advanced object detection model and high offloading resolution will have a better accuracy but lead to a longer inference delay and transmission delay, which damages the real-time performance. Meanwhile, selecting the entire frame can improve accuracy for that ROI extraction may miss some key information but also incurs a higher transmission delay, decreasing the real-time video analysis performance.

II-B Problem Formulation

Since the quality of the proposed video analysis system highly depends on the frame processing rate (i.e., the frame number of detection results returned in time) and detection accuracy, in this work, we aim to maximize the system utility, which is a weighted sum of these two factors. Thus, the optimization problem of video analysis can be mathematically expressed as

max{αt,βt,xtk,yt,im,zt,ir}1Tt=1Tqt+ηt=1T(qtacct)t=1Tqt,subscriptsubscript𝛼𝑡subscript𝛽𝑡superscriptsubscript𝑥𝑡𝑘superscriptsubscript𝑦𝑡𝑖𝑚superscriptsubscript𝑧𝑡𝑖𝑟1𝑇superscriptsubscript𝑡1𝑇subscript𝑞𝑡𝜂superscriptsubscript𝑡1𝑇subscript𝑞𝑡𝑎𝑐subscript𝑐𝑡superscriptsubscript𝑡1𝑇subscript𝑞𝑡\displaystyle\max_{\{\alpha_{t},\beta_{t},x_{t}^{k},y_{t,i}^{m},z_{t,i}^{r}\}}% ~{}\frac{1}{T}\sum\nolimits_{t=1}^{T}q_{t}+\eta\frac{\sum\nolimits_{t=1}^{T}(q% _{t}acc_{t})}{\sum\nolimits_{t=1}^{T}q_{t}},roman_max start_POSTSUBSCRIPT { italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_η divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_a italic_c italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , (4)
s.t.αt,βt{0,1},s.t.subscript𝛼𝑡subscript𝛽𝑡01\displaystyle\quad\text{s.t.}~{}\alpha_{t},\beta_{t}\in\{0,1\},s.t. italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } , (5)
(lttotallmax)qt0,qt{0,1},formulae-sequencesubscriptsuperscript𝑙𝑡𝑜𝑡𝑎𝑙𝑡superscript𝑙𝑚𝑎𝑥subscript𝑞𝑡0subscript𝑞𝑡01\displaystyle\qquad~{}(l^{total}_{t}-l^{max})q_{t}\leq 0,~{}q_{t}\in\{0,1\},( italic_l start_POSTSUPERSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_l start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT ) italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ 0 , italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } , (6)
k𝒦xtk1,xtk{0,1},k𝒦,formulae-sequencesubscript𝑘𝒦subscriptsuperscript𝑥𝑘𝑡1formulae-sequencesubscriptsuperscript𝑥𝑘𝑡01𝑘𝒦\displaystyle\qquad~{}\sum\nolimits_{k\in\mathcal{K}}x^{k}_{t}\leq 1,x^{k}_{t}% \in\{0,1\},~{}k\in\mathcal{K},∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_K end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ 1 , italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } , italic_k ∈ caligraphic_K , (7)
myt,im1,yt,im{0,1},m,formulae-sequencesubscript𝑚subscriptsuperscript𝑦𝑚𝑡𝑖1formulae-sequencesubscriptsuperscript𝑦𝑚𝑡𝑖01𝑚\displaystyle\qquad~{}\sum\nolimits_{m\in\mathcal{M}}y^{m}_{t,i}\leq 1,y^{m}_{% t,i}\in\{0,1\},~{}m\in\mathcal{M},∑ start_POSTSUBSCRIPT italic_m ∈ caligraphic_M end_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ≤ 1 , italic_y start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } , italic_m ∈ caligraphic_M , (8)
rzt,ir1,zt,ir{0,1},r,formulae-sequencesubscript𝑟subscriptsuperscript𝑧𝑟𝑡𝑖1formulae-sequencesubscriptsuperscript𝑧𝑟𝑡𝑖01𝑟\displaystyle\qquad~{}\sum\nolimits_{r\in\mathcal{R}}z^{r}_{t,i}\leq 1,z^{r}_{% t,i}\in\{0,1\},~{}r\in\mathcal{R},∑ start_POSTSUBSCRIPT italic_r ∈ caligraphic_R end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ≤ 1 , italic_z start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } , italic_r ∈ caligraphic_R , (9)
k𝒦xtk=αt,k𝒦,formulae-sequencesubscript𝑘𝒦subscriptsuperscript𝑥𝑘𝑡subscript𝛼𝑡𝑘𝒦\displaystyle\qquad~{}\sum\nolimits_{k\in\mathcal{K}}x^{k}_{t}=\alpha_{t},~{}k% \in\mathcal{K},∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_K end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_k ∈ caligraphic_K , (10)
k𝒦xtk+myt,im=1,k𝒦,m,formulae-sequencesubscript𝑘𝒦subscriptsuperscript𝑥𝑘𝑡subscript𝑚subscriptsuperscript𝑦𝑚𝑡𝑖1formulae-sequence𝑘𝒦𝑚\displaystyle\qquad~{}\sum\nolimits_{k\in\mathcal{K}}x^{k}_{t}+\sum\nolimits_{% m\in\mathcal{M}}y^{m}_{t,i}=1,~{}k\in\mathcal{K},m\in\mathcal{M},∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_K end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_m ∈ caligraphic_M end_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = 1 , italic_k ∈ caligraphic_K , italic_m ∈ caligraphic_M , (11)
k𝒦xtk+rzt,ir=1,k𝒦,r,formulae-sequencesubscript𝑘𝒦subscriptsuperscript𝑥𝑘𝑡subscript𝑟subscriptsuperscript𝑧𝑟𝑡𝑖1formulae-sequence𝑘𝒦𝑟\displaystyle\qquad~{}\sum\nolimits_{k\in\mathcal{K}}x^{k}_{t}+\sum\nolimits_{% r\in\mathcal{R}}z^{r}_{t,i}=1,~{}k\in\mathcal{K},r\in\mathcal{R},∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_K end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_r ∈ caligraphic_R end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = 1 , italic_k ∈ caligraphic_K , italic_r ∈ caligraphic_R , (12)

where parameter η>𝜂absent\eta>italic_η > 0 is used to depict the weight of detection accuracy, and qt{0,1}subscript𝑞𝑡01q_{t}\in\{0,1\}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } denotes whether frame ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is processed successfully in the time slot t𝑡titalic_t. Specifically, if lttotal>lmaxsuperscriptsubscript𝑙𝑡𝑡𝑜𝑡𝑎𝑙superscript𝑙𝑚𝑎𝑥l_{t}^{total}>l^{max}italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUPERSCRIPT > italic_l start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT, it means that frame ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is not processed completely within lmaxsuperscript𝑙𝑚𝑎𝑥l^{max}italic_l start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT and will be abandoned, , and qt=0subscript𝑞𝑡0q_{t}=0italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0 must be established to make constraint (6) true. If lttotallmaxsuperscriptsubscript𝑙𝑡𝑡𝑜𝑡𝑎𝑙superscript𝑙𝑚𝑎𝑥l_{t}^{total}\leq l^{max}italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUPERSCRIPT ≤ italic_l start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT, qt=0subscript𝑞𝑡0q_{t}=0italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0 means that the detection accuracy of frame ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is lower than a threshold thus is not successfully processed, while qt=1subscript𝑞𝑡1q_{t}=1italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 means that frame ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is really successfully processed. Constraint (10) means a tracking mode is adopted when αt=1subscript𝛼𝑡1\alpha_{t}=1italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1. Otherwise, the configurations of models and resolutions will be chosen for offloading, represented by Constraint (11) and Constraint (12).

Observe that the formulated optimization problem is a mixed-integer nonlinear programming problem, which is NP-hard problem [8]. To address this problem, we partition the problem into two sub-problems, i.e., deciding the optimal offloading strategy for the frame, and selecting object detection model and offloading resolution configuration for each block. In the next section, a real-time DDQN-CMAB reinforcement learning framework is adopted to connect a DDQN-based offloading approach and a CMAB-based configuration selection approach to generate the optimal strategy for real-time video analysis.

III DDQN-CMAB Reinforcement Learning Strategy

Output: The network parameters θ*superscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT
1 Initialize: Randomly initialize the parameters θ𝜃\thetaitalic_θ; initialize the target Q-network parameter θ=θsuperscript𝜃𝜃\theta^{\prime}=\thetaitalic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_θ; reset the replay buffer Bdsuperscript𝐵𝑑B^{d}italic_B start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT of experience playback; maximum training steps LOOP𝐿𝑂𝑂𝑃LOOPitalic_L italic_O italic_O italic_P and loop=0𝑙𝑜𝑜𝑝0loop=0italic_l italic_o italic_o italic_p = 0.
2 for loop𝑙𝑜𝑜𝑝loopitalic_l italic_o italic_o italic_p \leq LOOP𝐿𝑂𝑂𝑃LOOPitalic_L italic_O italic_O italic_P do
3       Obtain initial observation state s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and replay buffer Bdsuperscript𝐵𝑑B^{d}italic_B start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT;
4       for t=1𝑡1t=1italic_t = 1 to T𝑇Titalic_T do
5             Observe the state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT;
6             if random \geq ϵddqnsuperscriptitalic-ϵ𝑑𝑑𝑞𝑛\epsilon^{ddqn}italic_ϵ start_POSTSUPERSCRIPT italic_d italic_d italic_q italic_n end_POSTSUPERSCRIPT then
7                   atargmaxaAtQ(st,at;θt)subscript𝑎𝑡subscriptargmax𝑎subscript𝐴𝑡𝑄subscript𝑠𝑡subscript𝑎𝑡subscript𝜃𝑡a_{t}\leftarrow\operatorname*{arg\,max}_{a\in A_{t}}{Q(s_{t},a_{t};\theta_{t})}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a ∈ italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT );
8            else
9                   Randomly select atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT;
10            Execute atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and receive reward R𝑅Ritalic_R;
11             Store tuple (st,at,R)subscript𝑠𝑡subscript𝑎𝑡𝑅(s_{t},a_{t},R)( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_R ) into Bdsuperscript𝐵𝑑B^{d}italic_B start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT;
12             Sample random batch of tuples from Bdsuperscript𝐵𝑑B^{d}italic_B start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT;
13             Compute the value of Q-traget;
14             Update current Q-network parameters θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT according to loss function;
15             if t mod C=0𝑡 mod 𝐶0t\text{ mod }C=0italic_t mod italic_C = 0 then
16                   Update target Q-network parameters θtθtsubscriptsuperscript𝜃𝑡subscript𝜃𝑡\theta^{\prime}_{t}\leftarrow\theta_{t}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT;
17                  
18            stst+1subscript𝑠𝑡subscript𝑠𝑡1s_{t}\leftarrow s_{t+1}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1;
19            
20      looploop+1𝑙𝑜𝑜𝑝𝑙𝑜𝑜𝑝1loop\leftarrow loop+1italic_l italic_o italic_o italic_p ← italic_l italic_o italic_o italic_p + 1
Algorithm 1 DDQN-based Offloading Algorithm

III-A DDQN-based Offloading Decision

In this subsection, we reformulate the first sub-problem, the offloading decision problem, into a Markov decision process (MDP), then propose a DDQN-based offloading scheme to dynamically generate the appropriate strategy to determine αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and tracking mode k𝑘kitalic_k.

For each time slot t𝑡titalic_t, the offloading strategy derived by TAODM only depends on the states consisting of hash similarity, bandwidth, tracking complexity and continuous tracking time of the last time slot t1𝑡1t-1italic_t - 1, which means that the state transition satisfies the Markov property[9][10]. Thus, the offloading decision problem can be formulated as an MDP and defined by a tuple {St,At,Rt}subscript𝑆𝑡subscript𝐴𝑡subscript𝑅𝑡\{S_{t},A_{t},R_{t}\}{ italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, which represents the state space, action space and reward space, respectively.

1) State: For TAODM, its state stStsubscript𝑠𝑡subscript𝑆𝑡s_{t}\in S_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in time slot t𝑡titalic_t can be represented as st={ht,bt,ct,pt}subscript𝑠𝑡subscript𝑡subscript𝑏𝑡subscript𝑐𝑡subscript𝑝𝑡s_{t}=\{h_{t},b_{t},c_{t},p_{t}\}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, where htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the hash similarity between ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the last processed frame ft1subscript𝑓𝑡1f_{t-1}italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT indicates the bandwidth at time slot t𝑡titalic_t, ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT stands for tracking complexity that is the number of objects detected by the advanced DNN model in the last offloading frame and ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the number of continuous time slots of processing frames with the same tracking mode.

2) Action: At time slot t𝑡titalic_t, the action of TAODM is frame processing mode selection. The action space is formulated as the set of five possible decisions of offloading choices At={Skip, KCF, CSRT, Offload-Full, Offload-ROI}subscript𝐴𝑡Skip, KCF, CSRT, Offload-Full, Offload-ROIA_{t}=\{\emph{Skip, KCF, CSRT, Offload-Full, Offload-ROI}\}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { Skip, KCF, CSRT, Offload-Full, Offload-ROI }, where Offload-full indicates offloading the full frame and Offload-ROI indicates offloading ROI blocks. In pursuit of solution efficiency, we merge the ROI decision into TAODM.

3) Reward: The reward R𝑅Ritalic_R for TAODM is defined as the weighted sum of actual accuracy acct𝑎𝑐subscript𝑐𝑡acc_{t}italic_a italic_c italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and completion time lttotalsuperscriptsubscript𝑙𝑡𝑡𝑜𝑡𝑎𝑙l_{t}^{total}italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUPERSCRIPT, which can be expressed as R=acct+λmax((lmaxlttotal)/lmax,0)𝑅𝑎𝑐subscript𝑐𝑡𝜆superscript𝑙𝑚𝑎𝑥superscriptsubscript𝑙𝑡𝑡𝑜𝑡𝑎𝑙superscript𝑙𝑚𝑎𝑥0R=acc_{t}+\lambda\max((l^{max}-l_{t}^{total})/l^{max},0)italic_R = italic_a italic_c italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ roman_max ( ( italic_l start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUPERSCRIPT ) / italic_l start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT , 0 ), where acct𝑎𝑐subscript𝑐𝑡acc_{t}italic_a italic_c italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the actual accuracy of frame ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT given by environment, which is used to replace the sum of acc(m,r)𝑎𝑐𝑐𝑚𝑟acc(m,r)italic_a italic_c italic_c ( italic_m , italic_r ) and acc(k)𝑎𝑐𝑐𝑘acc(k)italic_a italic_c italic_c ( italic_k ), and λ>0𝜆0\lambda>0italic_λ > 0 is the parameter that controls the weight to balance completion time and detection accuracy.

For TAODM’s MDP, the deep Q network (DQN) could be applied to train the agent for discrete action space, which leverages the deep neural network to analyze states and actions to optimize the Q𝑄Qitalic_Q value and has a low computation time. In the DQN training process, TAODM first observes the state st𝒮tsubscript𝑠𝑡subscript𝒮𝑡s_{t}\in\mathcal{S}_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the environment and selects the best action at*𝒜tsubscriptsuperscript𝑎𝑡subscript𝒜𝑡a^{*}_{t}\in\mathcal{A}_{t}italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for maximizing the action-value function, and can be expressed as at*=argmaxat𝔼[rt+γmaxat+1Q(st+1,at+1;θt)]subscriptsuperscript𝑎𝑡subscriptargmaxsubscript𝑎𝑡𝔼delimited-[]subscript𝑟𝑡𝛾subscriptsubscript𝑎𝑡1𝑄subscript𝑠𝑡1subscript𝑎𝑡1subscript𝜃𝑡a^{*}_{t}=\operatorname*{arg\,max}_{a_{t}}{\mathbb{E}[r_{t}+\gamma\max_{a_{t+1% }}Q(s_{t+1},a_{t+1};\theta_{t})]}italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E [ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ roman_max start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ], where θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the network parameters matrix and is updated by a back propagation training process, whose loss function is L(θt)=𝔼[(ytQ(st,at;θt))2]𝐿subscript𝜃𝑡𝔼delimited-[]superscriptsubscript𝑦𝑡𝑄subscript𝑠𝑡subscript𝑎𝑡subscript𝜃𝑡2L(\theta_{t})=\mathbb{E}[(y_{t}-Q(s_{t},a_{t};\theta_{t}))^{2}]italic_L ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E [ ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_Q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ], where ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the target value for time slot t𝑡titalic_t. However, the DQN-based algorithm may cause a large deviation in its model due to overestimating the value of Q-target, which indicates the quality of a strategy. To avoid such overestimation, we present a DDQN-based algorithm, which decouples the action selection using DQN and evaluation of Q-target using target network. In the DDQN-based algorithm, the value of Q-target is calculated by yt=rt+γQ(st+1,argmaxat+1Q(st+1,at+1;θt);θt)subscript𝑦𝑡subscript𝑟𝑡𝛾𝑄subscript𝑠𝑡1subscriptargmaxsubscript𝑎𝑡1𝑄subscript𝑠𝑡1subscript𝑎𝑡1subscript𝜃𝑡subscriptsuperscript𝜃𝑡y_{t}=r_{t}+\gamma Q\left(s_{t+1},\operatorname*{arg\,max}_{a_{t+1}}Q(s_{t+1},% a_{t+1};\theta_{t});\theta^{\prime}_{t}\right)italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ italic_Q ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ; italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where θtsuperscriptsubscript𝜃𝑡\theta_{t}^{\prime}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the target network parameters matrix and is updated periodically.

An overview of DDQN-based offloading approach is given in Algorithm 1. In each training step loop𝑙𝑜𝑜𝑝loopitalic_l italic_o italic_o italic_p, the specific process is to first initialize hyperparameters, network parameters and replay buffer. Second, TAODM observes the state st𝒮tsubscript𝑠𝑡subscript𝒮𝑡s_{t}\in\mathcal{S}_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the environment and select the action using ϵDDQNsuperscriptitalic-ϵ𝐷𝐷𝑄𝑁\epsilon^{DDQN}italic_ϵ start_POSTSUPERSCRIPT italic_D italic_D italic_Q italic_N end_POSTSUPERSCRIPT-greedy method. Then, TAODM executes the action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and calculates the reward R𝑅Ritalic_R. Third, a replay buffer Bdsuperscript𝐵𝑑B^{d}italic_B start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is adopted to store the tuple (stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, R𝑅Ritalic_R) for learning Q-network better. Finally, TAODM calculate the Q-value and loss to update network parameters matrix θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, but update target network parameters matrix θtsuperscriptsubscript𝜃𝑡\theta_{t}^{\prime}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT every C𝐶Citalic_C times.

III-B CMAB-based Adaptive Configurations Selection

Input: Pre-trained MAB models
Output: DDQN patameters θ𝜃\thetaitalic_θ, MAB models
1 Initialize: maximum training steps LOOP𝐿𝑂𝑂superscript𝑃LOOP^{\prime}italic_L italic_O italic_O italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and loop=0𝑙𝑜𝑜superscript𝑝0loop^{\prime}=0italic_l italic_o italic_o italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0
2 for loopLOOP𝑙𝑜𝑜superscript𝑝normal-′𝐿𝑂𝑂superscript𝑃normal-′loop^{\prime}\leq LOOP^{\prime}italic_l italic_o italic_o italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_L italic_O italic_O italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT do
3       Offloading f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT initialize last detection results;
4       for t=1𝑡1t=1italic_t = 1 to T𝑇Titalic_T do
5             Get new frame ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, bandwidth btN(ρ,σ)similar-tosubscript𝑏𝑡𝑁𝜌𝜎b_{t}\sim N(\rho,\sigma)italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_N ( italic_ρ , italic_σ ), tracking complexity ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, continuous tracking times ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT;
6             Calculate hash similarity htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT;
7             atsubscript𝑎𝑡absenta_{t}\leftarrowitalic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← TAODM(ht,bt,ct,ptsubscript𝑡subscript𝑏𝑡subscript𝑐𝑡subscript𝑝𝑡h_{t},b_{t},c_{t},p_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT);
8             if at=Skipsubscript𝑎𝑡𝑆𝑘𝑖𝑝a_{t}=Skipitalic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_S italic_k italic_i italic_p  then
9                   Get the last detection results as results of ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT;
10            else if at=KCFsubscript𝑎𝑡𝐾𝐶𝐹a_{t}=KCFitalic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_K italic_C italic_F or at=CSRTsubscript𝑎𝑡𝐶𝑆𝑅𝑇a_{t}=CSRTitalic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_C italic_S italic_R italic_T then
11                   TAODM excute tracking algorithm corresponding to atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT;
12                   Get tracking detection results as results of ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT;
13                  
14            else
15                   Detected by lightweight DNN model;
16                   if at=subscript𝑎𝑡absenta_{t}=italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = Offload-ROI then
17                         extract ROI blocks;
18                        
19                  for each block i𝑖iitalic_i do
20                         Get information intensity nt,isubscript𝑛𝑡𝑖n_{t,i}italic_n start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT;
21                         gt,isubscript𝑔𝑡𝑖absentg_{t,i}\leftarrowitalic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ← ROIM(nt,i,btsubscript𝑛𝑡𝑖subscript𝑏𝑡n_{t,i},b_{t}italic_n start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT);
22                         Offload block i𝑖iitalic_i with gt,isubscript𝑔𝑡𝑖g_{t,i}italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT to ES;
23                        
24                  Merge detection results of blocks;
25                   Calculate MAB reward Re,ce,csuperscript𝑅𝑒𝑐for-all𝑒𝑐R^{e,c}\ \forall e,citalic_R start_POSTSUPERSCRIPT italic_e , italic_c end_POSTSUPERSCRIPT ∀ italic_e , italic_c;
26                   Update gain estimates Qe,ce,csuperscript𝑄𝑒𝑐for-all𝑒𝑐Q^{e,c}\ \forall e,citalic_Q start_POSTSUPERSCRIPT italic_e , italic_c end_POSTSUPERSCRIPT ∀ italic_e , italic_c;
27                  
28            Perform Algorithm 1 to update θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and θtsuperscriptsubscript𝜃𝑡\theta_{t}^{\prime}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT;
29             tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1;
30            
31      looploop+1𝑙𝑜𝑜superscript𝑝𝑙𝑜𝑜superscript𝑝1loop^{\prime}\leftarrow loop^{\prime}+1italic_l italic_o italic_o italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_l italic_o italic_o italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1;
Algorithm 2 DDQN-CMAB RL Framework

In this subsection, we reformulate the second sub-problem, adaptive configurations selection, into a contextual MAB problem and employ several multi-armed bandit models to dynamically decide the configurations gt,i={yt,im,zt,ir}subscript𝑔𝑡𝑖superscriptsubscript𝑦𝑡𝑖𝑚superscriptsubscript𝑧𝑡𝑖𝑟g_{t,i}=\{y_{t,i}^{m},z_{t,i}^{r}\}italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = { italic_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT }, where yt,imsubscriptsuperscript𝑦𝑚𝑡𝑖y^{m}_{t,i}italic_y start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT represents the detection model and zt,irsubscriptsuperscript𝑧𝑟𝑡𝑖z^{r}_{t,i}italic_z start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT represents the offloading resolution for each block pt,isubscript𝑝𝑡𝑖p_{t,i}italic_p start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT.

Based on whether the information density nt,isubscript𝑛𝑡𝑖n_{t,i}italic_n start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT of block pt,isubscript𝑝𝑡𝑖p_{t,i}italic_p start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT and bandwidth btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time slot t𝑡titalic_t are greater than the average information density Eisuperscript𝐸𝑖E^{i}italic_E start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and average bandwidth Ebsuperscript𝐸𝑏E^{b}italic_E start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT, we maintain MABs for four different contexts: i) high information density with high bandwidth ii) low information density with high bandwidth iii) high information density with low bandwidth and iv) low information density with low bandwidth. The motivation behind these four contexts is that if the information density nt,isubscript𝑛𝑡𝑖n_{t,i}italic_n start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT is lower than the average information density Eisuperscript𝐸𝑖E^{i}italic_E start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, an inferior DNN model would be more likely to achieve similar performance with the advanced one because only few giant objects appear in the block. However, in high information density blocks, multiple small targets tend to appear and utilizing the advanced model will obtain better accuracy. Similarly, if the bandwidth btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is lower than the average bandwidth Ebsuperscript𝐸𝑏E^{b}italic_E start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT, the high resolution generates high accuracy but may cause processing failure resulted from an unacceptable transmission delay.

We use the exponential moving averages method to update the average information density Eisuperscript𝐸𝑖E^{i}italic_E start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and average bandwidth Ebsuperscript𝐸𝑏E^{b}italic_E start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT as follows Ei=ξ1nt,i+(1ξ1)Eisuperscript𝐸𝑖subscript𝜉1subscript𝑛𝑡𝑖1subscript𝜉1superscript𝐸𝑖E^{i}=\xi_{1}n_{t,i}+(1-\xi_{1})E^{i}italic_E start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT + ( 1 - italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_E start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, Eb=ξ2bt+(1ξ2)Ebsuperscript𝐸𝑏subscript𝜉2subscript𝑏𝑡1subscript𝜉2superscript𝐸𝑏E^{b}=\xi_{2}b_{t}+(1-\xi_{2})E^{b}italic_E start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( 1 - italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_E start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT, where ξ1,ξ2[0,1]subscript𝜉1subscript𝜉201\xi_{1},\xi_{2}\in[0,1]italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ [ 0 , 1 ] stand for the weight factors for the most recent information density nt,isubscript𝑛𝑡𝑖n_{t,i}italic_n start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT and bandwidth observation btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, respectively. The above equations give higher weights to the latest information density nt,isubscript𝑛𝑡𝑖n_{t,i}italic_n start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT and bandwidth btsubscript𝑏𝑡{b_{t}}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, allowing the model to quickly respond to recent changes.

Considering that the configurations selection strategy should be different in various contexts, we maintain four independent MAB models denoted as MABHBHI𝑀𝐴subscriptsuperscript𝐵𝐻𝐼𝐻𝐵MAB^{HI}_{HB}italic_M italic_A italic_B start_POSTSUPERSCRIPT italic_H italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H italic_B end_POSTSUBSCRIPT, MABHBLI𝑀𝐴subscriptsuperscript𝐵𝐿𝐼𝐻𝐵MAB^{LI}_{HB}italic_M italic_A italic_B start_POSTSUPERSCRIPT italic_L italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H italic_B end_POSTSUBSCRIPT, MABLBHI𝑀𝐴subscriptsuperscript𝐵𝐻𝐼𝐿𝐵MAB^{HI}_{LB}italic_M italic_A italic_B start_POSTSUPERSCRIPT italic_H italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L italic_B end_POSTSUBSCRIPT and MABLBLI𝑀𝐴subscriptsuperscript𝐵𝐿𝐼𝐿𝐵MAB^{LI}_{LB}italic_M italic_A italic_B start_POSTSUPERSCRIPT italic_L italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L italic_B end_POSTSUBSCRIPT to decide the configurations of detection models and offloading resolutions. We use ={,}×{,}\mathcal{E}=\{\mathcal{HI,LI}\}\times\{\mathcal{HB,LB}\}caligraphic_E = { caligraphic_H caligraphic_I , caligraphic_L caligraphic_I } × { caligraphic_H caligraphic_B , caligraphic_L caligraphic_B } to denote the context set and use 𝒢=×𝒢\mathcal{G}=\mathcal{M}\times\mathcal{R}caligraphic_G = caligraphic_M × caligraphic_R to denote the configurations set. For each context e𝑒e\in\mathcal{E}italic_e ∈ caligraphic_E and configuration g𝒢𝑔𝒢g\in\mathcal{G}italic_g ∈ caligraphic_G, we define MAB reward Re,gsuperscript𝑅𝑒𝑔R^{e,g}italic_R start_POSTSUPERSCRIPT italic_e , italic_g end_POSTSUPERSCRIPT as Re,g=R𝟙(et=egt=g)superscript𝑅𝑒𝑔𝑅1subscript𝑒𝑡𝑒subscript𝑔𝑡𝑔R^{e,g}=R\cdot\mathbbm{1}(e_{t}=e\wedge g_{t}=g)italic_R start_POSTSUPERSCRIPT italic_e , italic_g end_POSTSUPERSCRIPT = italic_R ⋅ blackboard_1 ( italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_e ∧ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_g ), where R𝑅Ritalic_R is the same reward as that in TAODM, 𝟙(et=egt=g)1subscript𝑒𝑡𝑒subscript𝑔𝑡𝑔\mathbbm{1}(e_{t}=e\wedge g_{t}=g)blackboard_1 ( italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_e ∧ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_g ) is the contextual indicating function. Thus, each MAB model gets the reward for its decisions allowing independent training. Besides, the reward estimate Qe,gsuperscript𝑄𝑒𝑔Q^{e,g}italic_Q start_POSTSUPERSCRIPT italic_e , italic_g end_POSTSUPERSCRIPT is updated as follows Qe,g=Qe,g+φ(Re,gQe,g),e,g𝒢formulae-sequencesuperscript𝑄𝑒𝑔superscript𝑄𝑒𝑔𝜑superscript𝑅𝑒𝑔superscript𝑄𝑒𝑔formulae-sequencefor-all𝑒for-all𝑔𝒢Q^{e,g}=Q^{e,g}+\varphi\left(R^{e,g}-Q^{e,g}\right),\forall e\in\mathcal{E},% \forall g\in\mathcal{G}italic_Q start_POSTSUPERSCRIPT italic_e , italic_g end_POSTSUPERSCRIPT = italic_Q start_POSTSUPERSCRIPT italic_e , italic_g end_POSTSUPERSCRIPT + italic_φ ( italic_R start_POSTSUPERSCRIPT italic_e , italic_g end_POSTSUPERSCRIPT - italic_Q start_POSTSUPERSCRIPT italic_e , italic_g end_POSTSUPERSCRIPT ) , ∀ italic_e ∈ caligraphic_E , ∀ italic_g ∈ caligraphic_G, where φ𝜑\varphiitalic_φ is the decay parameter. Thereby, each reward estimate Qe,gsuperscript𝑄𝑒𝑔Q^{e,g}italic_Q start_POSTSUPERSCRIPT italic_e , italic_g end_POSTSUPERSCRIPT is updated by the corresponding reward Re,gsuperscript𝑅𝑒𝑔R^{e,g}italic_R start_POSTSUPERSCRIPT italic_e , italic_g end_POSTSUPERSCRIPT. For all MAB models, a ϵitalic-ϵ\epsilonitalic_ϵ-greedy based method is used to take decision to train the MAB models as follows

gt,i={argmaxg𝒢Qe,g,1ϵcmabrandom choice,ϵcmab,e,formulae-sequencesubscript𝑔𝑡𝑖casessubscriptargmax𝑔𝒢superscript𝑄𝑒𝑔1superscriptitalic-ϵ𝑐𝑚𝑎𝑏random choicesuperscriptitalic-ϵ𝑐𝑚𝑎𝑏for-all𝑒g_{t,i}=\begin{cases}\operatorname*{arg\,max}_{g\in\mathcal{G}}Q^{e,g},&1-% \epsilon^{cmab}\\ \text{random choice},&\epsilon^{cmab}\end{cases},\quad\forall e\in\mathcal{E},italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_e , italic_g end_POSTSUPERSCRIPT , end_CELL start_CELL 1 - italic_ϵ start_POSTSUPERSCRIPT italic_c italic_m italic_a italic_b end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL random choice , end_CELL start_CELL italic_ϵ start_POSTSUPERSCRIPT italic_c italic_m italic_a italic_b end_POSTSUPERSCRIPT end_CELL end_ROW , ∀ italic_e ∈ caligraphic_E , (13)

where ϵcmabsuperscriptitalic-ϵ𝑐𝑚𝑎𝑏\epsilon^{cmab}italic_ϵ start_POSTSUPERSCRIPT italic_c italic_m italic_a italic_b end_POSTSUPERSCRIPT is the probability of choosing a random choice. Since we already have precise average estimate of information density and bandwidth after training, we set ϵCMAB=0superscriptitalic-ϵ𝐶𝑀𝐴𝐵0\epsilon^{CMAB}=0italic_ϵ start_POSTSUPERSCRIPT italic_C italic_M italic_A italic_B end_POSTSUPERSCRIPT = 0 to only use exploitation method at in the testing process.

III-C Real-time Tracking-assisted Video Analysis Framework

Refer to caption
Figure 2: The proposed DCRL framework.

Since TAODM and ROIM are highly coupled and only when TAODM decides to offload the frame to ES, ROIM will be used to decide the configurations for each block. Based on these insights, we propose a double-layer DDQN-CMAB reinforcement learning (DCRL) framework to jointly train DDQN-based offloading agent and CMAB-based configurations selection agent.

As shown in Fig. 2, the proposed DCRL framework consists of two layers: i) in the upper layer, we assign TAODM as the first controller, which observes the hash similarity htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, bandwidth btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, tracking complexity ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and continuous tracking time ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the environment, then makes the offloading decision. If TAODM decides to process the frame locally, the frame results will be obtained with a tracking mode k𝑘kitalic_k. Otherwise, the frame will be sent to the lower layer; ii) in the lower layer, ROIM observes the information density nt,isubscript𝑛𝑡𝑖n_{t,i}italic_n start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT and bandwidth btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and determines the configuration gt,isubscript𝑔𝑡𝑖g_{t,i}italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT of detection models m𝑚mitalic_m and offloading resolution r𝑟ritalic_r for each block.

In each training step of DCRL, we first offload the entire frame of f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to initialize the last detection results for tracking in the future. Then at each time slot t𝑡titalic_t, TAODM determines atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by DDQN. If the Skip𝑆𝑘𝑖𝑝Skipitalic_S italic_k italic_i italic_p mode is adopted, directly use the last detection results as the detection results of ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. If the KCF or CSRT mode is adopted, TAODM executes the tracking algorithm corresponding to atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to get detection results. Otherwise, the frame would be processed by ROIM, where CMAB is performed to obtain the configuration gt,isubscript𝑔𝑡𝑖g_{t,i}italic_g start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT of each block and then offload all blocks to the ES. MAB reward Re,gsuperscript𝑅𝑒𝑔R^{e,g}italic_R start_POSTSUPERSCRIPT italic_e , italic_g end_POSTSUPERSCRIPT is calculated to update estimates Qe,gsuperscript𝑄𝑒𝑔Q^{e,g}italic_Q start_POSTSUPERSCRIPT italic_e , italic_g end_POSTSUPERSCRIPT when receiving the results of all blocks. Meanwhile, TAODM receives detection results, and the rewards of DDQN are calculated for updating its DNN parameters matrix θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and target Q-network parameters matrix θtsubscriptsuperscript𝜃𝑡\theta^{\prime}_{t}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The detailed steps of the DCRL training framework are summarized in Algorithm 2.

IV Simulation Results

Refer to caption
(a) Performance comparison on cumulative rewards.
Refer to caption
(b) Performance comparison on mAP.
Refer to caption
(c) Performance comparison on processing rate.
Refer to caption
(d) Performance comparison on average latency.

In this section, simulations are conducted to evaluate the performance of the proposed DCRL on an open dateset, i.e., Garden2 of Multi-camera Pedestrian Video Dataset[5].To ensure fairness, 80% of each video is used for training and the remaining for testing. Furthermore, let τ{5.6e4,6.25e4,6.5e4}𝜏5.6superscript𝑒46.25superscript𝑒46.5superscript𝑒4\tau\in\{5.6e^{-4},6.25e^{-4},6.5e^{-4}\}italic_τ ∈ { 5.6 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , 6.25 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , 6.5 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT } MB/Pixel, εddqn=0.3superscript𝜀𝑑𝑑𝑞𝑛0.3\varepsilon^{ddqn}=0.3italic_ε start_POSTSUPERSCRIPT italic_d italic_d italic_q italic_n end_POSTSUPERSCRIPT = 0.3, εcmab=0.3superscript𝜀𝑐𝑚𝑎𝑏0.3\varepsilon^{cmab}=0.3italic_ε start_POSTSUPERSCRIPT italic_c italic_m italic_a italic_b end_POSTSUPERSCRIPT = 0.3, ρ=10𝜌10\rho=10italic_ρ = 10, σ=5𝜎5\sigma=5italic_σ = 5, and fedge:fdevice=2:1:superscript𝑓𝑒𝑑𝑔𝑒superscript𝑓𝑑𝑒𝑣𝑖𝑐𝑒2:1f^{edge}:f^{device}=2:1italic_f start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT : italic_f start_POSTSUPERSCRIPT italic_d italic_e italic_v italic_i italic_c italic_e end_POSTSUPERSCRIPT = 2 : 1.

For comparison purposes, except the proposed DCRL, the following schemes are also simulated as benchmarks. RAND-RAND (R-R): TAODM and ROIM randomly determine offloading and configurations of detection model and offloading resolution. RAND-CMAB (R-C): TAODM randomly determines offloading, while ROIM determines configurations of detection model and offloading resolution using CMAB-based adaptive configurations selection algorithm. DDQN-RAND (D-R): ROIM randomly determines configurations of detection model and offloading resolution, while TAODM determines offloading using DDQN-based offloading algorithm. FULL-HIGH (F-B): TAODM only adopts Offload-full decision and ROIM adopts the highest offloading resolution with the most accurate model configuration.

Fig. 2(a) shows the cumulative rewards with time, it can be observed the proposed DDQN-CMAB (D-C) outperforms the other benchmarks, as it filters the repeated spatial-temporal semantic information to ensure the results from ES return in time while obtaining high accuracy. In addition, F-B obtains the worst performance because offloading full frames generates a large transmission delay under fluctuating network conditions.

Fig. 2(b) illustrates the mean average precision (mAP) with different methods, we can see D-C obtains the highest accuracy except for F-B. Fig. 2(c) compares the processing rate of different methods, and it can be seen that D-C achieves the best processing rate compared with others. Specifically, DCRL improves the processing rate by up to 66.3% compared to F-B. Compare Fig. 2(b) to Fig. 2(c), it can be observed that i) F-H realizes the best mAP performance while the lowest processing rate for large transmission delays, ii) R-H achieves higher mAP with higher processing rate since the repeated spatial-temporal semantic information is filtered, where the effectiveness of ROI is fully illustrated.

Fig. 2(d) compares the average latency of different methods, and D-C achieves the lowest latency compared with others. Exploring the reasons, we find that TAODM adopts offload-ROI to minimize the adverse effects of fluctuating network conditions. When network conditions are stressful, ROIM can effectively reduce data volume to ensure that critical information is offloaded to the ES for accurate detection. Besides, dramatic network fluctuations are often temporary, and a local tracking algorithm is used to tide over this difficult period to improve accuracy when network conditions are poor.

V Conclusion

In this paper, a DCRL framework integrating a DDQN-based optimal offloading decision approach and a CMAB-based adaptive configuration selection approach is proposed to address the challenge of balancing frame processing rate and accuracy performance of edge-based real-time video analysis systems in intelligent visual devices. By decomposing the optimization problem of video analysis into two subproblems and generating an optimal strategy for offloading mode and the configurations of detection model and resolution selection, our proposed framework can effectively solve the overall optimization problem. Experimental results show that our approach outperforms counterparts in terms of ensuring a high process rate with high detection accuracy.

References

  • [1] J. Chen, C. Yi, et al., “Networking architecture and key supporting technologies for human digital twin in personalized healthcare: A comprehensive survey,” IEEE Commun. Surv. Tutor., pp. 1–1, 2023.
  • [2] H. Liu and G. Cao, “Deep learning video analytics through online learning based edge computing,” IEEE Trans. Wirel. Commun., vol. 21, no. 10, pp. 8193–8204, 2022.
  • [3] Y. Shi, C. Yi, R. Wang et al., “Service migration or task rerouting: A two-timescale online resource optimization for mec,” IEEE Trans. Wirel. Commun., pp. 1–1, 2023.
  • [4] L. Dong, Z. Yang et al., “WAVE: Edge-device cooperated real-time object detection for open-air applications,” IEEE Trans. Mob. Comput., pp. 1–1, 2022.
  • [5] Y. Xu, X. Liu et al., “Cross-view people tracking by scene-centered spatio-temporal parsing,” in Proc. AAAI, vol. 31, no. 1, 2017.
  • [6] J. F. Henriques, R. Caseiro et al., “High-speed tracking with kernelized correlation filters,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 583–596, 2015.
  • [7] A. Lukežic, T. Vojír et al., “Discriminative correlation filter with channel and spatial reliability,” in Proc. IEEE CVPR, 2017, pp. 4847–4856.
  • [8] K. Zhao, Z. Zhou et al., “Edgeadaptor: Online configuration adaption, model selection and resource provisioning for edge DNN inference serving at scale,” IEEE Trans. Mob. Comput., pp. 1–16, 2022.
  • [9] J. Chen, C. Yi et al., “Learning aided joint sensor activation and mobile charging vehicle scheduling for energy-efficient wrsn-based industrial iot,” IEEE Trans. Veh. Technol., vol. 72, no. 4, pp. 5064–5078, 2023.
  • [10] R. Chen, C. Yi, K. Zhu et al., “A three-party hierarchical game for physical layer security aware wireless communications with dynamic trilateral coalitions,” IEEE Trans. Wirel. Commun., pp. 1–1, 2023.