Online Resource Allocation for Edge Intelligence with Colocated Model Retraining and Inference

Huaiguang Cai, Zhi Zhou, Qianyi Huang School of Computer Science and Engineering, Sun Yat-Sen University, China Email: [email protected], [email protected], [email protected]
Abstract

With edge intelligence, AI models are increasingly pushed to the edge to serve ubiquitous users. However, due to the drift of model, data, and task, AI model deployed at the edge suffers from degraded accuracy in the inference serving phase. Model retraining handles such drifts by periodically retraining the model with newly arrived data. When colocating model retraining and model inference serving for the same model on resource-limited edge servers, a fundamental challenge arises in balancing the resource allocation for model retraining and inference, aiming to maximize long-term inference accuracy. This problem is particularly difficult due to the underlying mathematical formulation being time-coupled, non-convex, and NP-hard. To address these challenges, we introduce a lightweight and explainable online approximation algorithm, named ORRIC, designed to optimize resource allocation for adaptively balancing the accuracy of model training and inference. The competitive ratio of ORRIC outperforms that of the traditional Inference-Only paradigm, especially when data drift persists for a sufficiently lengthy time. This highlights the advantages and applicable scenarios of colocating model retraining and inference. Notably, ORRIC can be translated into several heuristic algorithms for different resource environments. Experiments conducted in real scenarios validate the effectiveness of ORRIC.

I Introduction

Edge intelligence, the marriage of AI and edge computing, promises to provide ubiquitous users with low-latency, energy-efficient, and privacy-protecting machine learning services by processing data in proximity[1]. However, various types of drift reduce the accuracy of machine learning models in practice, and even worse in edge scenarios when computing resources are limited. Specifically, we classify these drifts into three types: 1) model drift: the distribution of model parameters is changed after deployment (e.g., model compression). 2) data drift: the distribution of features or labels shift over time (e.g., domain adaptation, test-time adaptation [2]). 3) task drift: the model may be applied to perform unseen tasks (e.g., fine tuning[3], embodied AI[4]).

Numerous methods have been proposed to alleviate these drifts, including retraining the deployed model [3, 2, 5, 6] or modifying inference results based on certain data distribution assumptions [7, 8]. However, these methods, mainly proposed by researchers from the machine learning community, tend to emphasize accuracy while overlooking resource consumption. Fortunately, in the field of edge computing, significant research such as Ekya[9], RECL[10] and Shoggoth[11], has been proposed to handle drifts by navigating the trade-off between the tasks of model retraining and model inference under the constraints of limited edge resources. Here, we define the scheme of retraining the model and performing inference simultaneously on new data as the model retraining and inference co-location paradigm.

Nevertheless, the absence of formal modeling and the reliance on heuristic algorithms in these previous works limit our understanding of the model retraining and inference co-location paradigm. Modeling this paradigm not only provides insights into its advantages and application scenarios but also aids in designing more rational and explainable algorithms that may enjoy better performance and theoretical guarantees.

Intuitively, due to limited edge resources, the task of retraining the model on new data and the task of performing inference on new data form a competitive relationship. If there are more retraining resources currently assigned, the current inference accuracy is low and the future accuracy is high; on the contrary, if the retraining resources are currently assigned less, the current inference accuracy is high but the future accuracy is low. Then a central question arises:

How can resources be credibly allocated for model retraining and inference co-location to optimize long-term model performance under various drifts?

To answer this question, our work makes the following contributions:

  1. 1.

    We provide a natural modeling of the model retraining and inference co-location paradigm and demonstrate a corresponding typical and practical system (Section III).

  2. 2.

    We design a lightweight and explainable algorithm ORRIC (Section IV) for the paradigm. The proved competitive ratio of ORRIC is strictly better than that of the traditional Inference-Only paradigm when data drift occurs for a sufficiently lengthy time, implying the advantages and application scenarios of model retraining and inference co-location paradigm (Section V).

  3. 3.

    Our experimental results of ORRIC on CIFAR-10-C validate the effectiveness of model retraining and inference co-location in drift scenarios (Section VI). Our code is available at https://github.com/caihuaiguang/ORRIC.

II Background and Related Works

We motivate our work with prior studies on 1) drifts in machine learning and 2) inference and retraining configuration adapting in edge computing.

II-A Drift in Machine Learning

The basic process of machine learning is to collect a large amount of data for a task and then use the data to train a machine learning model. However, in practice, the model, data, and task may change after the deployment of the model. We categorize the inconsistency between the training phase and inference phase as model drift, data drift, and task drift.

II-A1 Model drift

DNN compression[12] is commonly adopted for lower latency and improved energy efficiency[1]. However, the distribution of model parameters is changed [13] after compression, usually leading to a decrease in the accuracy of model. We classify this inconsistency between the model training phase and the inference phase as model drift. Even though model performance on training data remains the same after the compression, the compressed model has less generalization power on unseen data[14], necessitating model retraining[9].

II-A2 Data Drift

This type of drift represents a shift in the distribution of features or labels. Specifically, let X𝑋Xitalic_X denote the feature vector and y𝑦yitalic_y denote the label. We use Pt(X,y)subscript𝑃𝑡𝑋𝑦P_{t}(X,y)italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X , italic_y ) to represent the joint probability density function of X,y𝑋𝑦X,yitalic_X , italic_y at time t𝑡titalic_t. Concept drift [6] occurs when there exists a time t𝑡titalic_t such that Pt(X,y)Pt+1(X,y)subscript𝑃𝑡𝑋𝑦subscript𝑃𝑡1𝑋𝑦P_{t}(X,y)\neq P_{t+1}(X,y)italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X , italic_y ) ≠ italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_X , italic_y ). Label shift[7] occurs when there exists a time t𝑡titalic_t such that Pt(y)Pt+1(y)subscript𝑃𝑡𝑦subscript𝑃𝑡1𝑦P_{t}(y)\neq P_{t+1}(y)italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) ≠ italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_y ) but for all t𝑡titalic_t, Pt(X|y)=Pt+1(X|y)subscript𝑃𝑡conditional𝑋𝑦subscript𝑃𝑡1conditional𝑋𝑦P_{t}(X|y)=P_{t+1}(X|y)italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X | italic_y ) = italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_X | italic_y ). In a broader sense, we classify any shift in Pt(X)subscript𝑃𝑡𝑋P_{t}(X)italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X ), Pt(y)subscript𝑃𝑡𝑦P_{t}(y)italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ), Pt(X|y)subscript𝑃𝑡conditional𝑋𝑦P_{t}(X|y)italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X | italic_y ), Pt(y|X)subscript𝑃𝑡conditional𝑦𝑋P_{t}(y|X)italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | italic_X ), Pt(X,y)subscript𝑃𝑡𝑋𝑦P_{t}(X,y)italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X , italic_y ) (such as concept drift, label shift, domain adaptation, or test-time adaptation[2]) as data drift. In real-world scenarios, the proportion of pedestrians, cars, and bicycles may vary throughout the day[9], corresponding to label shift (Pt(y)subscript𝑃𝑡𝑦P_{t}(y)italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) shift). Another common phenomenon is that, due to variations in angles, weather conditions, lighting, and sensors[9], the appearance of the same class of objects (Pt(X|y)subscript𝑃𝑡conditional𝑋𝑦P_{t}(X|y)italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X | italic_y )) may differ from that during training, while the true label of the object (Pt(y)subscript𝑃𝑡𝑦P_{t}(y)italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y )) remains unchanged, corresponding to concept drift.

Several methods, such as the unbiased risk estimator [7] and the online ensemble algorithm [5], can alleviate the negative effect of data drift on model performance without retraining based on assumptions on data distribution, such as label shift [7] or gradual data evolution [8]. However, these methods are heavily dependent on the assumed type of drift and may not be universally applicable. As a result, retraining the model [2] remains a mainstream approach [5].

II-A3 Task drift

This drift encompasses changes in tasks during both the training and inference phases, including meta-learning[15], continual learning [2], transfer learning[5], and fine-tuning[3]. Additionally, embodied AI has gained considerable attention recently, necessitating models to learn through interactions with the real world[4]. All these studies expect the model to perform well on new tasks, and retraining the model is nearly the only viable approach.

Motivated by the extensive attention to drift problems in the machine learning research community and the prevalent use of model retraining, we seek to formulate the model performance under these drifts. Moreover, in contrast to prior works that concentrate solely on model accuracy, our approach considers both model accuracy and the computational cost of model retraining. This makes it more suitable for practical deployment in resource-constrained edge environments.

II-B Inference and Training Configuration Adaption

Resources provisioned for edge computing are limited[9], motivating research on reducing resource consumption for model retraining or inference on edge while meeting basic accuracy requirements, known as model inference or retraining configuration adaptation.

Inference Configuration Adaptation refers to adapting the content of the inference request or the model used, such as the frame rate of the input video, the resolution of input images[16], the type of model[17], or the extent of early exiting in model inference[18]. This adaptation, influencing the corresponding output of the model, is typically determined by the available computing, storage, bandwidth resources, and the difficulty level of the input[16].

Training Configuration Adaption refers to adapting the hyperparameters of the training, such as epochs, training data size[19], or the layers performing back-propagation[2]. This adaptation, influencing the model itself, is typically determined by available computing, storage, bandwidth resources, and the performance of the model being used[9].

Although some studies such as Ekya[9] and RECL[10] have explored the trade-off between the tasks of model retraining and model inference under the constraints of limited edge resources, they lack the formal modeling of the model retraining and inference co-location paradigm, and the employed algorithms are heuristic. Differing from these works, we aim to gain a deeper understanding of the paradigm and design a more rational and explainable algorithm by formally modeling the paradigm and proposing a theoretically guaranteed algorithm.

Remark: Existing researches on model retraining and inference co-location typically deploy the model on edge [9] or cloud[10]. However, we argue that with hardware upgrades [20] and technological advances [21], model retraining and inference co-location on devices holds promise for enhanced privacy protection, reduced bandwidth usage and personalized AI models. While some existing frameworks support on-device model training, such as MNN [22], nntrainer [23], TensorFlow Lite [24], PyTorch Mobile [25], their documentation is not comprehensive, and some lack regular maintenance. We call for further efforts in this direction.

III System Model and Problem Formulation for Model Retraining and Inference Co-location

In this section, we present the system model for model retraining and inference colocation in an edge server, and the problem formulation for the dynamical resource allocation to maximize the long-term inference accuracy.

III-A System Overview

Refer to caption

Figure 1: Model Retraining and Inference Co-location.

Following the pilot effort of Ekya[9], we adopt a general system architecture as illustrated in Fig. 1 for edge intelligence with colocated model training and inference. In this architecture, an edge server equipped with moderate CPU and GPU resources simultaneously performs model retraining and inference serving, with the new data stream (or inference requests) collected from a set of nearby device clients (e.g., surveillance cameras) running the same AI application (e.g., object detection). Since manual labeling for the online data stream is not feasible at the edge, the labels for the retraining are obtained from a “teacher model” — a highly accurate but expensive model (with deeper architecture and larger size). Since the inference latency of the teacher model cannot meet the stringent latency requirements of mission-critical edge AI applications such as safety surveillance, we only use it for labeling. Instead, for the actual inference serving and model retraining, a “student model” which is less accurate but more responsive and resource-efficient is adopted. Notably, this philosophy of supervising a small student model with a large teacher model has been widely applied in the community of computer vision.

To model the periodic behavior of model retraining, we assume that the system works in a time-slotted fashion. Each time slot t𝒯{1,2,,T}𝑡𝒯12𝑇t\in\mathcal{T}\triangleq\{1,2,\cdots,T\}italic_t ∈ caligraphic_T ≜ { 1 , 2 , ⋯ , italic_T } denotes a “retraining window” that designates model retraining once on the newly collected data. Specifically, as shown in Fig. 1, at each time slot t𝑡titalic_t, the group of device clients first 1 upload their inference requests to the edge server. Here we use D(t)subscript𝐷𝑡D_{(t)}italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT to denote the amount of data uploaded at time slot t𝑡titalic_t. Without loss of generality, we assume that D(t)[Dmin,Dmax]subscript𝐷𝑡subscript𝐷subscript𝐷D_{(t)}\in[D_{\min},D_{\max}]italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT ∈ [ italic_D start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] and Dmin>0subscript𝐷0D_{\min}>0italic_D start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT > 0. After receiving the data, the scheduler 2 determines the configuration for model retraining and inference, based on the data amount (D(t)subscript𝐷𝑡D_{(t)}italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT) and the available computational resource at the edge server (denoted as C(t)subscript𝐶𝑡C_{(t)}italic_C start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT). Afterward, the student model will immediately 3 return predictions of all inference requests to the clients based on inference configuration from the scheduler. Next, some uniformly random-chosen data according to the retraining configuration will be sent to the teacher model to 4 get the corresponding high-credit labels (or pseudo-labels). The student model 5 then updates its weight by retraining the model according to the pseudo-labels and the retraining configuration determined by the scheduler. Then in the next time slot t+1𝑡1t+1italic_t + 1, the student model can serve the inference request with retrained model weights, thus improving accuracy.

III-B Resource Allocation Model

TABLE I: Notations.
Notation Description
C(t)subscript𝐶𝑡C_{(t)}italic_C start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT the available computational resource in time slot t𝑡titalic_t.
D(t)subscript𝐷𝑡D_{(t)}italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT the amount of uploaded data at the beginning of t𝑡titalic_t.
AiT,CiTsuperscriptsubscript𝐴𝑖𝑇superscriptsubscript𝐶𝑖𝑇A_{i}^{T},C_{i}^{T}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT the profit and resource consumption of
i𝑖iitalic_i-th retraining configuration.
AjI,CjIsuperscriptsubscript𝐴𝑗𝐼superscriptsubscript𝐶𝑗𝐼A_{j}^{I},C_{j}^{I}italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT the profit and resource consumption of
j𝑗jitalic_j-th inference configuration.
xi(t)subscript𝑥𝑖𝑡x_{i}(t)italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) binary variable indicating whether i𝑖iitalic_i-th retraining
configuration is chosen at time slot t𝑡titalic_t.
yj(t)subscript𝑦𝑗𝑡y_{j}(t)italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) binary variable indicating whether j𝑗jitalic_j-th inference
configuration is chosen at time slot t𝑡titalic_t.

When colocating model retraining and inference serving at the edge server, they may compete for the limited computational resource such as CPU and GPU, especially when the data arrival D(t)subscript𝐷𝑡D_{(t)}italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT bursts. Therefore, the resource allocation to model retraining and inference serving faces a fundamental tradeoff between the retrained model’s accuracy and the inference accuracy. Specifically, if we allocate more resource to model retraining to improve its accuracy, the accuracy of the current model inference would diminish due to reduced resource allocation. Vice versa, if we take away resource from model retraining to inference serving, the current inference accuracy would increase but the subsequent inference may decrease due to the reduced accuracy of the retrained model.

The knob to navigate the tradeoff between the retrained model’s accuracy and the inference accuracy is the configuration of both model retraining and inference, which controls the resource-accuracy tradeoff of both model retraining and inference. For model retraining, the configuration refers to the hyperparameters of the training, such as the number of epochs, training data size [19], or the layers performing back-propagation [2]. For these hyperparameters, a larger value results in higher accuracy, but also at cost of more resource demand. For model inference, the configuration includes the hyperparameters such as frame rate/resolution of the input video, the compressed variant of the model[17], or the early exit point of the branchy model[18].

III-B1 Retraining configuration adaption

In each time slot t𝑡titalic_t, the scheduler selects one retraining configuration from the set of feasible configurations {1,2,,M}12𝑀\mathcal{M}\triangleq\{1,2,\cdots,M\}caligraphic_M ≜ { 1 , 2 , ⋯ , italic_M }. This selection is represented by binary variables xi(t){0,1}subscript𝑥𝑖𝑡01x_{i}(t)\in\{0,1\}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ∈ { 0 , 1 }, where xi(t)=1subscript𝑥𝑖𝑡1x_{i}(t)=1italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = 1 indicates the i𝑖iitalic_i-th retraining configuration is selected at time slot t𝑡titalic_t. Formally, this can be expressed as:

xi(t){0,1},i,t𝒯,formulae-sequencesubscript𝑥𝑖𝑡01formulae-sequencefor-all𝑖for-all𝑡𝒯x_{i}{(t)}\in\{0,1\},\quad\forall i\in\mathcal{M},\ \forall t\in\mathcal{T},italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ∈ { 0 , 1 } , ∀ italic_i ∈ caligraphic_M , ∀ italic_t ∈ caligraphic_T , (1)
i=1Mxi(t)=1,t𝒯.formulae-sequencesuperscriptsubscript𝑖1𝑀subscript𝑥𝑖𝑡1for-all𝑡𝒯\sum_{i=1}^{M}x_{i}{(t)}=1,\quad\forall t\in\mathcal{T}.∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = 1 , ∀ italic_t ∈ caligraphic_T . (2)

To characterize the resource-accuracy of the i𝑖iitalic_i-th retraining configuration, we use CiTsuperscriptsubscript𝐶𝑖𝑇C_{i}^{T}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT to denote the resource demand (per data sample, measured by FLOPs or MACs) and AiTsuperscriptsubscript𝐴𝑖𝑇A_{i}^{T}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT to aid in modeling the tested accuracy. Given an amount of D(t)subscript𝐷𝑡D_{(t)}italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT data samples at time slot t𝑡titalic_t, the total amount of resource demand of model retraining (including pseudo-labeling) is D(t)CiTsubscript𝐷𝑡superscriptsubscript𝐶𝑖𝑇D_{(t)}C_{i}^{T}italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT.

III-B2 Inference configuration adaption

similar to the retraining configuration, in each time slot t𝑡titalic_t, the scheduler selects one inference configuration from the set 𝒩{1,2,,N}𝒩12𝑁\mathcal{N}\triangleq\{1,2,\cdots,N\}caligraphic_N ≜ { 1 , 2 , ⋯ , italic_N }. This selection is represented by binary variables yj(t){0,1}subscript𝑦𝑗𝑡01y_{j}(t)\in\{0,1\}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ∈ { 0 , 1 }, where yj(t)=1subscript𝑦𝑗𝑡1y_{j}(t)=1italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) = 1 indicates that the j𝑗jitalic_j-th inference configuration is selected at time slot t𝑡titalic_t. Formally, we have:

yj(t){0,1},j𝒩,t𝒯,formulae-sequencesubscript𝑦𝑗𝑡01formulae-sequencefor-all𝑗𝒩for-all𝑡𝒯y_{j}{(t)}\in\{0,1\},\quad\forall j\in\mathcal{N},\ \forall t\in\mathcal{T},italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ∈ { 0 , 1 } , ∀ italic_j ∈ caligraphic_N , ∀ italic_t ∈ caligraphic_T , (3)
j=1Nyj(t)=1,t𝒯.formulae-sequencesuperscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡1for-all𝑡𝒯\sum_{j=1}^{N}y_{j}{(t)}=1,\quad\forall t\in\mathcal{T}.∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) = 1 , ∀ italic_t ∈ caligraphic_T . (4)

The j𝑗jitalic_j-th inference configuration consumes D(t)CjIsubscript𝐷𝑡superscriptsubscript𝐶𝑗𝐼D_{(t)}C_{j}^{I}italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT computational resources (measured by FLOPs or MACs). The profit, D(t)AjIsubscript𝐷𝑡superscriptsubscript𝐴𝑗𝐼D_{(t)}A_{j}^{I}italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT, is the corresponding j𝑗jitalic_j-th result of normalizing the model accuracy for all N𝑁Nitalic_N inference configurations using the maximum value as a reference. Both D(t)CjIsubscript𝐷𝑡superscriptsubscript𝐶𝑗𝐼D_{(t)}C_{j}^{I}italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT and D(t)AjIsubscript𝐷𝑡superscriptsubscript𝐴𝑗𝐼D_{(t)}A_{j}^{I}italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT are easy to calculate in practice. In our experiments, the MACs and accuracy of different inference configurations on the test dataset, whose distribution is the same as the training dataset, are used to represent D(t)CjIsubscript𝐷𝑡superscriptsubscript𝐶𝑗𝐼D_{(t)}C_{j}^{I}italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT and D(t)AjIsubscript𝐷𝑡superscriptsubscript𝐴𝑗𝐼D_{(t)}A_{j}^{I}italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT.

Let AminTsuperscriptsubscript𝐴𝑇A_{\min}^{T}italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and AmaxTsuperscriptsubscript𝐴𝑇A_{\max}^{T}italic_A start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denote the minimum and maximum of the set {AiTi}conditional-setsuperscriptsubscript𝐴𝑖𝑇for-all𝑖\{A_{i}^{T}\mid\forall i\in\mathcal{M}\}{ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∣ ∀ italic_i ∈ caligraphic_M }. Similarly, let AminIsuperscriptsubscript𝐴𝐼A_{\min}^{I}italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT and AmaxIsuperscriptsubscript𝐴𝐼A_{\max}^{I}italic_A start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT represent the minimum and maximum of the set {AjIj𝒩}conditional-setsuperscriptsubscript𝐴𝑗𝐼for-all𝑗𝒩\{A_{j}^{I}\mid\forall j\in\mathcal{N}\}{ italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ∣ ∀ italic_j ∈ caligraphic_N }. Then AminT=0superscriptsubscript𝐴𝑇0A_{\min}^{T}=0italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = 0 and AminI>0superscriptsubscript𝐴𝐼0A_{\min}^{I}>0italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT > 0 according to a natural assumption that if the computational resources are scarce and for the consideration of satisfaction from users, model retraining is not unnecessary compared with model inference.

III-B3 Computational resources constraint

We use C(t)subscript𝐶𝑡C_{(t)}italic_C start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT to denote the available computational resource at time slot t𝑡titalic_t. And we suppose that there is at least one feasible solution to the following inequality, regardless of the value of D(t)subscript𝐷𝑡D_{(t)}italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT and C(t)subscript𝐶𝑡C_{(t)}italic_C start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT:

D(t)i=1MCiTxi(t)+D(t)j=1NCjIyj(t)C(t),t𝒯.formulae-sequencesubscript𝐷𝑡superscriptsubscript𝑖1𝑀superscriptsubscript𝐶𝑖𝑇subscript𝑥𝑖𝑡subscript𝐷𝑡superscriptsubscript𝑗1𝑁superscriptsubscript𝐶𝑗𝐼subscript𝑦𝑗𝑡subscript𝐶𝑡for-all𝑡𝒯D_{(t)}\sum_{i=1}^{M}C_{i}^{T}x_{i}{(t)}+D_{(t)}\sum_{j=1}^{N}C_{j}^{I}y_{j}{(% t)}\leq C_{(t)},\ \forall t\in\mathcal{T}.italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) + italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ≤ italic_C start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT , ∀ italic_t ∈ caligraphic_T . (5)

III-C Long-term Accuracy Model

We model the basic model performance at time slot t𝑡titalic_t as f(τ=1t1D(τ)i=1Mxi(τ)AiTτ=1t1D(τ))𝑓superscriptsubscript𝜏1𝑡1subscript𝐷𝜏superscriptsubscript𝑖1𝑀subscript𝑥𝑖𝜏superscriptsubscript𝐴𝑖𝑇superscriptsubscript𝜏1𝑡1subscript𝐷𝜏f\left(\frac{\sum_{\tau=1}^{t-1}D_{(\tau)}\sum_{i=1}^{M}x_{i}(\tau)A_{i}^{T}}{% \sum_{\tau=1}^{t-1}D_{(\tau)}}\right)italic_f ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT end_ARG ). Our modeling of model performance under various drifts is based on two key observations: (1) Irrespective of the type of drift, model performance declines to a minimum gradually if the model is not retrained regularly. (2) The increase in model performance resulting from training exhibits a diminishing marginal effect[26].

To incorporate the first observation into our modeling, it is essential to explore the correlation between current testing data and the previous data used to retrain the model. However, precisely determining the relationship between previous data used for retraining and current testing data is usually challenging or maybe compute-intensive. So similar to the maximum entropy principle, we make the following assumption: every previously used retraining configuration has the same effect on current model performance. That is where τ=1t1D(τ)i=1Mxi(τ)AiTτ=1t1D(τ)superscriptsubscript𝜏1𝑡1subscript𝐷𝜏superscriptsubscript𝑖1𝑀subscript𝑥𝑖𝜏superscriptsubscript𝐴𝑖𝑇superscriptsubscript𝜏1𝑡1subscript𝐷𝜏\frac{\sum_{\tau=1}^{t-1}D_{(\tau)}\sum_{i=1}^{M}x_{i}(\tau)A_{i}^{T}}{\sum_{% \tau=1}^{t-1}D_{(\tau)}}divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT end_ARG comes from. Then to align with the second observation, we introduce a function f𝑓fitalic_f that maps the average learning extent of all past data to the current model performance to represent the average influence of the drifts on model performance over time. If there are no drifts, wherein the current test data follows the same distribution as the training dataset on which the model has been fully trained, then model retraining has a small influence on model performance and f𝑓fitalic_f reduces to a constant function.

In the study of learning curves[26], the expression of the function f𝑓fitalic_f can take on various forms, such as power functions like f(x)=caxα𝑓𝑥𝑐𝑎superscript𝑥𝛼f(x)=c-ax^{-\alpha}italic_f ( italic_x ) = italic_c - italic_a italic_x start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT, exponential functions like f(x)=exp(a+bx+clog(x))𝑓𝑥𝑎𝑏𝑥𝑐𝑥f(x)=\exp(a+\frac{b}{x}+c\log(x))italic_f ( italic_x ) = roman_exp ( italic_a + divide start_ARG italic_b end_ARG start_ARG italic_x end_ARG + italic_c roman_log ( italic_x ) ), logarithmic functions like f(x)=log(alog(x)+b)𝑓𝑥𝑎𝑥𝑏f(x)=\log(a\log(x)+b)italic_f ( italic_x ) = roman_log ( italic_a roman_log ( italic_x ) + italic_b ), or even a weighted linear combination of these forms. Here, x𝑥xitalic_x represents training time, number of iterations, or training dataset size, and f(x)𝑓𝑥f(x)italic_f ( italic_x ) denotes accuracy on the validation set. For a more comprehensive understanding of the potential expressions of the function f𝑓fitalic_f, please refer to Figure 1 in [26]. In our study, rather than making assumptions about the exact expression of f𝑓fitalic_f, we identify that these expressions share a common property: they are concave and increasing. Consequently, we introduce the following assumption about f𝑓fitalic_f:

Assumption 1.

The function f(x)𝑓𝑥f(x)italic_f ( italic_x ) is increasing, concave, and continuously differentiable over the interval [0,AmaxT]0subscriptsuperscript𝐴𝑇[0,A^{T}_{\max}][ 0 , italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ], and f(0)>0𝑓00f(0)>0italic_f ( 0 ) > 0, but its analytical expression is unknown.

Other assumptions about the relationship between retraining configuration and model performance may be reasonable as well. For instance, if the current model performance is only related to past data within a time window (e.g., in-context learning), then model performance can be modeled as f(τ=twt1D(τ)i=1Mxi(τ)AiTτ=1t1D(τ))𝑓superscriptsubscript𝜏𝑡𝑤𝑡1subscript𝐷𝜏superscriptsubscript𝑖1𝑀subscript𝑥𝑖𝜏superscriptsubscript𝐴𝑖𝑇superscriptsubscript𝜏1𝑡1subscript𝐷𝜏f\left(\frac{\sum_{\tau=t-w}^{t-1}D_{(\tau)}\sum_{i=1}^{M}x_{i}(\tau)A_{i}^{T}% }{\sum_{\tau=1}^{t-1}D_{(\tau)}}\right)italic_f ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ = italic_t - italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT end_ARG ), where w𝑤witalic_w is the window size. And if the current data is more related to nearby data than former data, model performance would be f(τ=1t1D(τ)αt1τi=1Mxi(τ)AiTτ=1t1αt1τD(τ))𝑓superscriptsubscript𝜏1𝑡1subscript𝐷𝜏superscript𝛼𝑡1𝜏superscriptsubscript𝑖1𝑀subscript𝑥𝑖𝜏superscriptsubscript𝐴𝑖𝑇superscriptsubscript𝜏1𝑡1superscript𝛼𝑡1𝜏subscript𝐷𝜏f\left(\frac{\sum_{\tau=1}^{t-1}D_{(\tau)}\alpha^{t-1-\tau}\sum_{i=1}^{M}x_{i}% (\tau)A_{i}^{T}}{\sum_{\tau=1}^{t-1}\alpha^{t-1-\tau}D_{(\tau)}}\right)italic_f ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT italic_t - 1 - italic_τ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT italic_t - 1 - italic_τ end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT end_ARG ), where α𝛼\alphaitalic_α is a positive decay factor less than 1. A more complex modeling is that the formula of f𝑓fitalic_f may change over time, i.e., ft(τ=1t1D(τ)i=1Mxi(τ)AiTτ=1t1D(τ))subscript𝑓𝑡superscriptsubscript𝜏1𝑡1subscript𝐷𝜏superscriptsubscript𝑖1𝑀subscript𝑥𝑖𝜏superscriptsubscript𝐴𝑖𝑇superscriptsubscript𝜏1𝑡1subscript𝐷𝜏f_{t}\left(\frac{\sum_{\tau=1}^{t-1}D_{(\tau)}\sum_{i=1}^{M}x_{i}(\tau)A_{i}^{% T}}{\sum_{\tau=1}^{t-1}D_{(\tau)}}\right)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT end_ARG ). We leave these variants for future research.

In the presence of an unknown analytical formula for the function f𝑓fitalic_f, we introduce the following assumptions to facilitate the algorithm design:

Assumption 2.

The value of f(AmaxT)𝑓subscriptsuperscript𝐴𝑇f(A^{T}_{\max})italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) is known. And a positive lower bound of f(AmaxT)superscript𝑓subscriptsuperscript𝐴𝑇f^{\prime}(A^{T}_{\max})italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ), denoted as L𝐿Litalic_L, is known.

In practice, the accuracy of the trained model on the test dataset with the same distribution as the training dataset serves as an estimate for f(AmaxT)𝑓subscriptsuperscript𝐴𝑇f(A^{T}_{\max})italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ). This is because this accuracy indicates the model’s performance when drifts are absent, effectively representing the highest achievable accuracy (f(AmaxT)𝑓subscriptsuperscript𝐴𝑇f(A^{T}_{\max})italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT )) under the best retraining and inference configurations in the presence of drifts. The value of f(AmaxT)superscript𝑓subscriptsuperscript𝐴𝑇f^{\prime}(A^{T}_{\max})italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ), reflecting the rate of improvement in model accuracy when the model always uses the best retraining configuration under drifts, can only be determined with prior knowledge of the drifts. In our experiments, we set a small constant (e.g., 0.01) as the value of L𝐿Litalic_L based on the accuracy improvement on the mentioned test dataset between the last two epochs of the training process. For a typical task, the optimal L𝐿Litalic_L can be determined empirically by experimenting with the algorithm with various values of L𝐿Litalic_L in the real world. Approximating an unknown value (L𝐿Litalic_L) is much simpler than approximating an unknown function (f𝑓fitalic_f).

Moreover, the inference configuration, viewed as the utilization of the model, also plays a significant role in determining its performance. We assume that model performance at time slot t𝑡titalic_t is directly proportional to the profit of the inference configuration used at that time, i.e., j=1Nyj(t)AjID(t)superscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡superscriptsubscript𝐴𝑗𝐼subscript𝐷𝑡\sum_{j=1}^{N}y_{j}(t)A_{j}^{I}D_{(t)}∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT. While output accuracy with different inference configurations may vary over time, we argue that AjIsuperscriptsubscript𝐴𝑗𝐼A_{j}^{I}italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT represents the average utilization degree of the j𝑗jitalic_j-th inference configuration on the model and is assumed to be a constant known in advance.

Now, the problem of maximizing long-term average accuracy within the constraints of varying computing resources over time, with decision variables (xi(t),yj(t)subscript𝑥𝑖𝑡subscript𝑦𝑗𝑡x_{i}(t),y_{j}(t)italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t )) representing the chosen retraining and inference configurations, is formulated as:

maxxi(t),yj(t)subscriptsubscript𝑥𝑖𝑡subscript𝑦𝑗𝑡\displaystyle\max_{x_{i}(t),y_{j}(t)}roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT t=1Tf(τ=1t1D(τ)i=1Mxi(τ)AiTτ=1t1D(τ))j=1Nyj(t)AjID(t)superscriptsubscript𝑡1𝑇𝑓superscriptsubscript𝜏1𝑡1subscript𝐷𝜏superscriptsubscript𝑖1𝑀subscript𝑥𝑖𝜏superscriptsubscript𝐴𝑖𝑇superscriptsubscript𝜏1𝑡1subscript𝐷𝜏superscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡superscriptsubscript𝐴𝑗𝐼subscript𝐷𝑡\displaystyle\sum_{t=1}^{T}f\left(\frac{\sum_{\tau=1}^{t-1}D_{(\tau)}\sum_{i=1% }^{M}x_{i}(\tau)A_{i}^{T}}{\sum_{\tau=1}^{t-1}D_{(\tau)}}\right)\sum_{j=1}^{N}% y_{j}(t)A_{j}^{I}D_{(t)}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT end_ARG ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT (P)
s.t.formulae-sequencest\displaystyle\mathrm{s.t.}roman_s . roman_t . Constraints (1)(5).Constraints 15\displaystyle~{}\text{Constraints }(\ref{ctr:x})-(\ref{ctr:ctdt}).Constraints ( ) - ( ) .

III-D Existing Approaches

To the best of our knowledge, no online algorithms for a similar problem (P) have been proposed in the literature so far. There are three main difficulties when dealing with it: (1) The objective function is nonconvex-nonconcave, as demonstrated in Theorem 1 with its proof in Appendix -A. (2) Decision variables are heavily coupled. (3) The analytical formula for f𝑓fitalic_f is commonly unknown in practice.

Theorem 1.

If f(x)𝑓𝑥f(x)italic_f ( italic_x ) is a concave function and defined on [0,AmaxT]0subscriptsuperscript𝐴𝑇[0,A^{T}_{\max}][ 0 , italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ], then f(x)y𝑓𝑥𝑦f(x)yitalic_f ( italic_x ) italic_y, defined on [0,AmaxT]×[AminI,AmaxI]0subscriptsuperscript𝐴𝑇subscriptsuperscript𝐴𝐼subscriptsuperscript𝐴𝐼[0,A^{T}_{\max}]\times[A^{I}_{\min},A^{I}_{\max}][ 0 , italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] × [ italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ], is a nonconvex-nonconcave function.

For the third difficulty, a promising technique is bandit convex optimization[27]. The method usually adds random noise to the decision variable and then estimates the gradient information [28] of the function, but these techniques are specifically designed for convex functions and may not be readily applicable to our situation. Moreover, the decision variables of our problem are discrete, and therefore, the technique of adding random noise is not suitable.

The well-known primal-dual[29] method in online algorithms is also not applicable to our problem. Even if we can approximate the analytical expression of f𝑓fitalic_f, it is difficult to find the dual function of the original function due to the heavy coupling between decision variables. Moreover, since the problem is nonconvex-nonconcave, there may not be strong duality as in linear programming. Then even if the dual function is found and solved, it may violate the original problem constraints. Review[30] has more details on the time-varying convex optimization algorithms.

IV Algorithm Design

We follow a three-step procedure to design a lightweight and theoretically guaranteed algorithm: (i) Leverage the concave property of f𝑓fitalic_f to move the decision variables {xi(t)}subscript𝑥𝑖𝑡\{x_{i}(t)\}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) } out of f𝑓fitalic_f (Lemma 1). Then the problem of interest changes from (P) to (IV-A). (ii) Decouple the interaction of {xi(t),yj(t)}subscript𝑥𝑖𝑡subscript𝑦𝑗𝑡\{x_{i}(t),y_{j}(t)\}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) } by involving a particular regularization term (Lemma 3), with the concerned problem specialized to (IV-A). Then ORRIC is proposed to solve (Dt), the subproblem of (IV-A) in every time slot. (iii) Lemma 2 and Theorem 4 are used to facilitate the proof of the competitive ratio of ORRIC.

IV-A Deal with the Target Function of (P)

Lemma 1.

f(x)Lx+g(AmaxT)𝑓𝑥𝐿𝑥𝑔subscriptsuperscript𝐴𝑇f(x)\leq Lx+g(A^{T}_{\max})italic_f ( italic_x ) ≤ italic_L italic_x + italic_g ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ), where g(AmaxT)=f(AmaxT)LAmaxT𝑔subscriptsuperscript𝐴𝑇𝑓subscriptsuperscript𝐴𝑇𝐿subscriptsuperscript𝐴𝑇g(A^{T}_{\max})=f(A^{T}_{\max})-LA^{T}_{\max}italic_g ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) = italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) - italic_L italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT.

Proof.

f(x)f(AmaxT)+f(AmaxT)(xAmaxT)𝑓𝑥𝑓subscriptsuperscript𝐴𝑇superscript𝑓subscriptsuperscript𝐴𝑇𝑥subscriptsuperscript𝐴𝑇f(x)\leq f(A^{T}_{\max})+f^{\prime}(A^{T}_{\max})(x-A^{T}_{\max})italic_f ( italic_x ) ≤ italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) + italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ( italic_x - italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) for the concave property of f(x)𝑓𝑥f(x)italic_f ( italic_x ); Because xAmaxT𝑥subscriptsuperscript𝐴𝑇x\leq A^{T}_{\max}italic_x ≤ italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT and f(AmaxT)L>0superscript𝑓subscriptsuperscript𝐴𝑇𝐿0f^{\prime}(A^{T}_{\max})\geq L>0italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ≥ italic_L > 0, then f(AmaxT)(xAmaxT)L(xAmaxT)superscript𝑓subscriptsuperscript𝐴𝑇𝑥subscriptsuperscript𝐴𝑇𝐿𝑥subscriptsuperscript𝐴𝑇f^{\prime}(A^{T}_{\max})(x-A^{T}_{\max})\leq L(x-A^{T}_{\max})italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ( italic_x - italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ≤ italic_L ( italic_x - italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ), and thus f(x)Lx+f(AmaxT)LAmaxTLx+g(AmaxT)𝑓𝑥𝐿𝑥𝑓subscriptsuperscript𝐴𝑇𝐿subscriptsuperscript𝐴𝑇𝐿𝑥𝑔subscriptsuperscript𝐴𝑇f(x)\leq Lx+f(A^{T}_{\max})-LA^{T}_{\max}\leq Lx+g(A^{T}_{\max})italic_f ( italic_x ) ≤ italic_L italic_x + italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) - italic_L italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ≤ italic_L italic_x + italic_g ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ). ∎

We use zt1subscript𝑧𝑡1z_{t-1}italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to denote τ=1t1D(τ)i=1Mxi(τ)AiTsuperscriptsubscript𝜏1𝑡1subscript𝐷𝜏superscriptsubscript𝑖1𝑀subscript𝑥𝑖𝜏superscriptsubscript𝐴𝑖𝑇\sum_{\tau=1}^{t-1}D_{(\tau)}\sum_{i=1}^{M}x_{i}(\tau)A_{i}^{T}∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, and then transform the objective function using Lemma 1 for 1tT1𝑡𝑇1\leq t\leq T1 ≤ italic_t ≤ italic_T. Then we have f(zt1τ=1t1D(τ))g(AmaxT)+Lzt1τ=1t1D(τ)𝑓subscript𝑧𝑡1superscriptsubscript𝜏1𝑡1subscript𝐷𝜏𝑔subscriptsuperscript𝐴𝑇𝐿subscript𝑧𝑡1superscriptsubscript𝜏1𝑡1subscript𝐷𝜏f\left(\frac{z_{t-1}}{\sum_{\tau=1}^{t-1}D_{(\tau)}}\right)\leq g(A^{T}_{\max}% )+\frac{Lz_{t-1}}{\sum_{\tau=1}^{t-1}D_{(\tau)}}italic_f ( divide start_ARG italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT end_ARG ) ≤ italic_g ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) + divide start_ARG italic_L italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT end_ARG. To further deal with ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we introduce the positive regularization term λt(τ=1tD(τ)i=1Mxi(τ)AiTzt)subscript𝜆𝑡superscriptsubscript𝜏1𝑡subscript𝐷𝜏superscriptsubscript𝑖1𝑀subscript𝑥𝑖𝜏superscriptsubscript𝐴𝑖𝑇subscript𝑧𝑡\lambda_{t}(\sum_{\tau=1}^{t}D_{(\tau)}\sum_{i=1}^{M}x_{i}(\tau)A_{i}^{T}-z_{t})italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where λt>0,tsubscript𝜆𝑡0for-all𝑡\lambda_{t}>0,\forall titalic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > 0 , ∀ italic_t. With this, the new problem (IV-A) is formulated as follows:

maxxi(t),yj(t),ztsubscriptsubscript𝑥𝑖𝑡subscript𝑦𝑗𝑡subscript𝑧𝑡\displaystyle\max_{x_{i}(t),y_{j}(t),z_{t}}roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT t=1Tg(AmaxT)j=1Nyj(t)AjID(t)superscriptsubscript𝑡1𝑇𝑔subscriptsuperscript𝐴𝑇superscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡superscriptsubscript𝐴𝑗𝐼subscript𝐷𝑡\displaystyle\ \sum_{t=1}^{T}g\left(A^{T}_{\max}\right)\sum_{j=1}^{N}y_{j}(t)A% _{j}^{I}D_{(t)}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT
+t=2TLzt1τ=1t1D(τ)j=1Nyj(t)AjID(t)superscriptsubscript𝑡2𝑇𝐿subscript𝑧𝑡1superscriptsubscript𝜏1𝑡1subscript𝐷𝜏superscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡superscriptsubscript𝐴𝑗𝐼subscript𝐷𝑡\displaystyle+\sum_{t=2}^{T}L\frac{z_{t-1}}{\sum_{\tau=1}^{t-1}D_{(\tau)}}\sum% _{j=1}^{N}y_{j}(t)A_{j}^{I}D_{(t)}+ ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_L divide start_ARG italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT
+t=1T1λt(τ=1tD(τ)i=1Mxi(τ)AiTzt)superscriptsubscript𝑡1𝑇1subscript𝜆𝑡superscriptsubscript𝜏1𝑡subscript𝐷𝜏superscriptsubscript𝑖1𝑀subscript𝑥𝑖𝜏superscriptsubscript𝐴𝑖𝑇subscript𝑧𝑡\displaystyle+\sum_{t=1}^{T-1}\lambda_{t}(\sum_{\tau=1}^{t}D_{(\tau)}\sum_{i=1% }^{M}x_{i}(\tau)A_{i}^{T}-z_{t})+ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (Q)
𝐬.𝐭.formulae-sequence𝐬𝐭\displaystyle\mathbf{s.t.}\ bold_s . bold_t . Constraints (1)(5),Constraints 15\displaystyle\text{Constraints }(\ref{ctr:x})-(\ref{ctr:ctdt}),Constraints ( ) - ( ) ,
ztτ=1tD(τ)AmaxT,1tT1.formulae-sequencesubscript𝑧𝑡superscriptsubscript𝜏1𝑡subscript𝐷𝜏subscriptsuperscript𝐴𝑇1𝑡𝑇1\displaystyle z_{t}\leq\sum_{\tau=1}^{t}D_{(\tau)}A^{T}_{\max},\quad 1\leq t% \leq T-1.italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , 1 ≤ italic_t ≤ italic_T - 1 .

Using Psuperscript𝑃P^{*}italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to respectively denote the optimal offline objective function value of the problems (P) and (IV-A), then the following lemma holds.

Lemma 2.

Suppose λt>0subscript𝜆𝑡0\lambda_{t}>0italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > 0 for 1tT11𝑡𝑇11\leq t\leq T-11 ≤ italic_t ≤ italic_T - 1, then PQsuperscript𝑃superscript𝑄P^{*}\leq Q^{*}italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Proof.

We can prove this by demonstrating that the optimal solution to (P) is also a feasible solution for (IV-A). Let xi(t)superscriptsubscript𝑥𝑖𝑡x_{i}^{*}(t)italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_t ) and yj(t)superscriptsubscript𝑦𝑗𝑡y_{j}^{*}(t)italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_t ) be the optimal solutions to (P), and define zt=τ=1tD(τ)i=1Mxi(τ)AiTsubscript𝑧𝑡superscriptsubscript𝜏1𝑡subscript𝐷𝜏superscriptsubscript𝑖1𝑀superscriptsubscript𝑥𝑖𝜏superscriptsubscript𝐴𝑖𝑇z_{t}=\sum_{\tau=1}^{t}D_{(\tau)}\sum_{i=1}^{M}x_{i}^{*}(\tau)A_{i}^{T}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_τ ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. It follows that the optimal solution of (P) satisfies the constraints of (IV-A), and the objective function value of (P) is less than that of (IV-A) for the optimal solution of (P) based on the former inequalities. ∎

We choose a particular realization of λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to simplify (IV-A).

Lemma 3.

when λt=LDminAminItDmaxsubscript𝜆𝑡𝐿subscript𝐷superscriptsubscript𝐴𝐼𝑡subscript𝐷\lambda_{t}=L\frac{D_{\min}A_{\min}^{I}}{tD_{\max}}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_L divide start_ARG italic_D start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT end_ARG start_ARG italic_t italic_D start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG, then the corresponding ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to the optimal solution of (IV-A) is τ=1tD(τ)AmaxTsuperscriptsubscript𝜏1𝑡subscript𝐷𝜏subscriptsuperscript𝐴𝑇\sum_{\tau=1}^{t}D_{(\tau)}A^{T}_{\max}∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT

Proof.

Extracting all terms containing ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in (IV-A), we have: t=2TLzt1τ=1t1D(τ)j=1Nyj(t)AjID(t)t=1T1λtzt=t=1T1[L1τ=1tD(τ)j=1Nyj(t+1)AjID(t+1)λt]ztsuperscriptsubscript𝑡2𝑇𝐿subscript𝑧𝑡1superscriptsubscript𝜏1𝑡1subscript𝐷𝜏superscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡superscriptsubscript𝐴𝑗𝐼subscript𝐷𝑡superscriptsubscript𝑡1𝑇1subscript𝜆𝑡subscript𝑧𝑡superscriptsubscript𝑡1𝑇1delimited-[]𝐿1superscriptsubscript𝜏1𝑡subscript𝐷𝜏superscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡1superscriptsubscript𝐴𝑗𝐼subscript𝐷𝑡1subscript𝜆𝑡subscript𝑧𝑡\sum_{t=2}^{T}L\frac{z_{t-1}}{\sum_{\tau=1}^{t-1}D_{(\tau)}}\sum_{j=1}^{N}y_{j% }(t)A_{j}^{I}D_{(t)}-\sum_{t=1}^{T-1}\lambda_{t}z_{t}=\sum_{t=1}^{T-1}\left[L% \frac{1}{\sum_{\tau=1}^{t}D_{(\tau)}}\sum_{j=1}^{N}y_{j}(t+1)A_{j}^{I}D_{(t+1)% }-\lambda_{t}\right]z_{t}∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_L divide start_ARG italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT [ italic_L divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t + 1 ) end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. When λt=LAminIDmintDmaxsubscript𝜆𝑡𝐿superscriptsubscript𝐴𝐼subscript𝐷𝑡subscript𝐷\lambda_{t}=L\frac{A_{\min}^{I}D_{\min}}{tD_{\max}}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_L divide start_ARG italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG start_ARG italic_t italic_D start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG, the coefficient of ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is no less than 00, regardless of the values of yj(t+1)subscript𝑦𝑗𝑡1y_{j}(t+1)italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) and D(t+1)subscript𝐷𝑡1D_{(t+1)}italic_D start_POSTSUBSCRIPT ( italic_t + 1 ) end_POSTSUBSCRIPT. To maximize (IV-A), ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT will equal its maximum: τ=1tD(τ)AmaxTsuperscriptsubscript𝜏1𝑡subscript𝐷𝜏subscriptsuperscript𝐴𝑇\sum_{\tau=1}^{t}D_{(\tau)}A^{T}_{\max}∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT. ∎

After setting λt=LDminAminItDmaxsubscript𝜆𝑡𝐿subscript𝐷superscriptsubscript𝐴𝐼𝑡subscript𝐷\lambda_{t}=L\frac{D_{\min}A_{\min}^{I}}{tD_{\max}}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_L divide start_ARG italic_D start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT end_ARG start_ARG italic_t italic_D start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG and zt=τ=1tD(τ)AmaxTsubscript𝑧𝑡superscriptsubscript𝜏1𝑡subscript𝐷𝜏subscriptsuperscript𝐴𝑇z_{t}=\sum_{\tau=1}^{t}D_{(\tau)}A^{T}_{\max}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT based on Lemma 3, we obtain a specialized version of (IV-A), and we also have PDsuperscript𝑃superscript𝐷P^{*}\leq D^{*}italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT by Lemma 2:

maxxi(t),yj(t)t=1T1subscriptsubscript𝑥𝑖𝑡subscript𝑦𝑗𝑡superscriptsubscript𝑡1𝑇1\displaystyle\max_{x_{i}(t),y_{j}(t)}\sum_{t=1}^{T-1}roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT λt(τ=1tD(τ)i=1Mxi(τ)AiTτ=1tD(τ)AmaxT)subscript𝜆𝑡superscriptsubscript𝜏1𝑡subscript𝐷𝜏superscriptsubscript𝑖1𝑀subscript𝑥𝑖𝜏superscriptsubscript𝐴𝑖𝑇superscriptsubscript𝜏1𝑡subscript𝐷𝜏subscriptsuperscript𝐴𝑇\displaystyle\lambda_{t}(\sum_{\tau=1}^{t}D_{(\tau)}\sum_{i=1}^{M}x_{i}(\tau)A% _{i}^{T}-\sum_{\tau=1}^{t}D_{(\tau)}A^{T}_{\max})italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT )
+t=1Tg(AmaxT)superscriptsubscript𝑡1𝑇𝑔subscriptsuperscript𝐴𝑇\displaystyle+\sum_{t=1}^{T}g(A^{T}_{\max})+ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) j=1Nyj(t)AjID(t)+t=2TLAmaxTj=1Nyj(t)AjID(t)superscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡superscriptsubscript𝐴𝑗𝐼subscript𝐷𝑡superscriptsubscript𝑡2𝑇𝐿subscriptsuperscript𝐴𝑇superscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡superscriptsubscript𝐴𝑗𝐼subscript𝐷𝑡\displaystyle\sum_{j=1}^{N}y_{j}(t)A_{j}^{I}D_{(t)}+\sum_{t=2}^{T}LA^{T}_{\max% }\sum_{j=1}^{N}y_{j}(t)A_{j}^{I}D_{(t)}∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_L italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT (D)
𝐬.𝐭.formulae-sequence𝐬𝐭\displaystyle\mathbf{s.t.}bold_s . bold_t . Constraints (1)(5)Constraints 15\displaystyle\quad\text{Constraints }(\ref{ctr:x})-(\ref{ctr:ctdt})Constraints ( ) - ( )

Then we make some equivalent transformations to the target function of (IV-A) and decouple the problem to every time slot:

maxxi(t),yj(t)subscriptsubscript𝑥𝑖𝑡subscript𝑦𝑗𝑡\displaystyle\max_{x_{i}(t),y_{j}(t)}roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT Vti=1Mxi(t)AiT+Wtj=1Nyj(t)AjIsubscript𝑉𝑡superscriptsubscript𝑖1𝑀subscript𝑥𝑖𝑡superscriptsubscript𝐴𝑖𝑇subscript𝑊𝑡superscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡superscriptsubscript𝐴𝑗𝐼\displaystyle\ V_{t}\sum_{i=1}^{M}x_{i}(t)A_{i}^{T}+W_{t}\sum_{j=1}^{N}y_{j}(t% )A_{j}^{I}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT (Dt)
𝐬.𝐭.formulae-sequence𝐬𝐭\displaystyle\mathbf{s.t.}\ bold_s . bold_t . Constraints (1)(5) only at t .Constraints 15 only at t \displaystyle\quad\text{Constraints }(\ref{ctr:x})-(\ref{ctr:ctdt})\text{ only% at $t$ }.Constraints ( ) - ( ) only at italic_t .

where Vt=LDminAminIDmax(τ=tT11τ)subscript𝑉𝑡𝐿subscript𝐷superscriptsubscript𝐴𝐼subscript𝐷superscriptsubscript𝜏𝑡𝑇11𝜏V_{t}=L\frac{D_{\min}A_{\min}^{I}}{D_{\max}}\left(\sum_{\tau=t}^{T-1}\frac{1}{% \tau}\right)italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_L divide start_ARG italic_D start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT end_ARG start_ARG italic_D start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG ( ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_τ end_ARG ), W1=f(AmaxT)LAmaxTsubscript𝑊1𝑓subscriptsuperscript𝐴𝑇𝐿subscriptsuperscript𝐴𝑇W_{1}=f\left(A^{T}_{\max}\right)-LA^{T}_{\max}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) - italic_L italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT and Wt=f(AmaxT),t>1formulae-sequencesubscript𝑊𝑡𝑓subscriptsuperscript𝐴𝑇for-all𝑡1W_{t}=f\left(A^{T}_{\max}\right),\forall t>1italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) , ∀ italic_t > 1.

IV-B Online Robust Retraining and Inference Co-location

The problem (Dt) can be solved by an exhaustive method with O(MN)𝑂𝑀𝑁O(MN)italic_O ( italic_M italic_N ) complexity. To further speed up the solving, we introduce the following property, as in [16]:

Property 1.

a,b,CaT>CbTAaT>AbTformulae-sequencefor-all𝑎𝑏superscriptsubscript𝐶𝑎𝑇superscriptsubscript𝐶𝑏𝑇superscriptsubscript𝐴𝑎𝑇superscriptsubscript𝐴𝑏𝑇\forall a,b\in\mathcal{M},C_{a}^{T}>C_{b}^{T}\Rightarrow A_{a}^{T}>A_{b}^{T}∀ italic_a , italic_b ∈ caligraphic_M , italic_C start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT > italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⇒ italic_A start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT > italic_A start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT; and a,b𝒩,CaI>CbIAaI>AbIformulae-sequencefor-all𝑎𝑏𝒩superscriptsubscript𝐶𝑎𝐼superscriptsubscript𝐶𝑏𝐼superscriptsubscript𝐴𝑎𝐼superscriptsubscript𝐴𝑏𝐼\forall a,b\in\mathcal{N},C_{a}^{I}>C_{b}^{I}\Rightarrow A_{a}^{I}>A_{b}^{I}∀ italic_a , italic_b ∈ caligraphic_N , italic_C start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT > italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ⇒ italic_A start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT > italic_A start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT.

It should be noted that there are some infrequent cases where Property 1 is not satisfied, i.e., better performance can be achieved with fewer resources. For instance, when the image is corrupted by Gaussian noise, downsampling may improve model performance, as illustrated in Table III. Similarly, in model training, more iterations may not necessarily lead to better performance[9]. A practical solution[9] to this issue is regularly measuring the resource-performance profiles of different configurations after model deployment.

However, since we have assumed the resource requirements and profits of retraining and inference configurations remain constant throughout the whole time span 𝒯𝒯\mathcal{T}caligraphic_T in III-B, configurations that consume more resources yet yield lower profits can be reasonably eliminated before running the algorithm for (Dt), ensuring satisfaction of Property 1. This is because any reasonable algorithm for (Dt) would not choose these configurations when there are better alternatives available—ones with equivalent or lower resource requirements but higher profits.

Based on Property 1, we propose ORRIC (Online Robust Retraining and Inference Co-location), outlined in Algorithm 1. The underlying principle of ORRIC is that the optimal configuration is likely the one about to violate the computational resource constraint. Thus, the optimal configuration can be identified by searching through configurations likely to exceed the computational resource constraint. The proof of the correctness of ORRIC can be found in Appendix -C. The complexity of ORRIC is O(M+N)𝑂𝑀𝑁O(M+N)italic_O ( italic_M + italic_N ): During each iteration of the loop, either i=i+1𝑖𝑖1i=i+1italic_i = italic_i + 1 or j=j1𝑗𝑗1j=j-1italic_j = italic_j - 1. i𝑖iitalic_i increases mostly to M+1𝑀1M+1italic_M + 1 and j𝑗jitalic_j decreases mostly to 00, the total number of iterations in the loop must be no more than M+N𝑀𝑁M+Nitalic_M + italic_N.

0:  Vtsubscript𝑉𝑡V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, Wtsubscript𝑊𝑡W_{t}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, Ut=C(t)D(t)subscript𝑈𝑡subscript𝐶𝑡subscript𝐷𝑡U_{t}=\frac{C_{(t)}}{D_{(t)}}italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_C start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_ARG start_ARG italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_ARG and four ascending lists: {AiT,i}superscriptsubscript𝐴𝑖𝑇𝑖\{A_{i}^{T},i\in\mathcal{M}\}{ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_i ∈ caligraphic_M }, {AjI,j𝒩}superscriptsubscript𝐴𝑗𝐼𝑗𝒩\{A_{j}^{I},j\in\mathcal{N}\}{ italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , italic_j ∈ caligraphic_N }, {CiT,i}superscriptsubscript𝐶𝑖𝑇𝑖\{C_{i}^{T},i\in\mathcal{M}\}{ italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_i ∈ caligraphic_M }, {CjI,j𝒩}superscriptsubscript𝐶𝑗𝐼𝑗𝒩\{C_{j}^{I},j\in\mathcal{N}\}{ italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , italic_j ∈ caligraphic_N }.
0:  A pair of retraining and inference configurations.
1:  Initialization: Set i=1,j=N,i=j=K=0formulae-sequence𝑖1formulae-sequence𝑗𝑁superscript𝑖superscript𝑗𝐾0i=1,j=N,i^{*}=j^{*}=K=0italic_i = 1 , italic_j = italic_N , italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_K = 0.
2:  while iM𝑖𝑀i\leq Mitalic_i ≤ italic_M and j1𝑗1j\geq 1italic_j ≥ 1 do
3:     if CiT+CjIUtsuperscriptsubscript𝐶𝑖𝑇superscriptsubscript𝐶𝑗𝐼subscript𝑈𝑡C_{i}^{T}+C_{j}^{I}\leq U_{t}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ≤ italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT then
4:        if VtAiT+WtAjI>Ksubscript𝑉𝑡subscriptsuperscript𝐴𝑇𝑖subscript𝑊𝑡subscriptsuperscript𝐴𝐼𝑗𝐾V_{t}A^{T}_{i}+W_{t}A^{I}_{j}>Kitalic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_K then
5:           i=isuperscript𝑖𝑖i^{*}=iitalic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_i; j=jsuperscript𝑗𝑗j^{*}=jitalic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_j; K=VtAiT+WtAjI𝐾subscript𝑉𝑡subscriptsuperscript𝐴𝑇𝑖subscript𝑊𝑡subscriptsuperscript𝐴𝐼𝑗K=V_{t}A^{T}_{i}+W_{t}A^{I}_{j}italic_K = italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT;
6:        i=i+1;𝑖𝑖1i=i+1;italic_i = italic_i + 1 ;
7:     else
8:        j=j1;𝑗𝑗1j=j-1;italic_j = italic_j - 1 ;
9:  return i,jsuperscript𝑖superscript𝑗i^{*},j^{*}italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT;
Algorithm 1 ORRIC

In particular, ORRIC aligns with our intuition about the way to allocate limited computing resources to model retraining and inference to optimize long-term model performance. As depicted in Table II, ORRIC can be regarded as a combination of four heuristic algorithms, transitioning between them based on the duration of time and the availability of computing resources: 1) Knowledge-Distillation: The teacher model imparts knowledge to the student model without considering resource consumption. 2) Inference-Greedy: Prioritize using a higher configuration for inference and utilize the remaining resources for retraining. 3) Focus-Shift: Shift the focus from retraining to inference as time passes. 4) Inference-Only: This algorithm is actually the traditional computing paradigm that deploys a trained model and then performs inference.

When the computational resources are sufficient for the use of the best inference and retraining configuration, ORRIC converts to Knowledge-Distillation because Vt0,Wt>0,tformulae-sequencesubscript𝑉𝑡0subscript𝑊𝑡0for-all𝑡V_{t}\geq 0,W_{t}>0,\forall titalic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ 0 , italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > 0 , ∀ italic_t. When resources are really scarce, e.g., C(t)=D(t)CminIsubscript𝐶𝑡subscript𝐷𝑡superscriptsubscript𝐶𝐼C_{(t)}=D_{(t)}C_{\min}^{I}italic_C start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT, ORRIC converts to Inference-Only because CminT=0superscriptsubscript𝐶𝑇0C_{\min}^{T}=0italic_C start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = 0 while CminI>0superscriptsubscript𝐶𝐼0C_{\min}^{I}>0italic_C start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT > 0. When resources are limited (but not scarce) and T𝑇Titalic_T is large, ORRIC converts to Focus-Shift because τ=tT11τ>ln(T)ln(t)superscriptsubscript𝜏𝑡𝑇11𝜏𝑇𝑡\sum_{\tau=t}^{T-1}\frac{1}{\tau}>\ln(T)-\ln(t)∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_τ end_ARG > roman_ln ( italic_T ) - roman_ln ( italic_t ) and VT=0subscript𝑉𝑇0V_{T}=0italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 0. When resources are limited (but not scarce) and T𝑇Titalic_T is small, ORRIC converts to Inference-Greedy because Wtsubscript𝑊𝑡W_{t}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a constant when t>1𝑡1t>1italic_t > 1 while Vtsubscript𝑉𝑡V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT will decrease to 0 with the increasing of t𝑡titalic_t.

The translation relationship between ORRIC and the four heuristic algorithms not only illustrates the rationality of ORRIC but also provides insights into the properties of algorithms designed for the model retraining and inference co-location paradigm. We believe that all rational algorithms for this paradigm should similarly translate to these four heuristic algorithms given specific conditions regarding time length and available computing resources, as illustrated in Table II.

TABLE II: ORRIC and Several Heuristic Algorithms.
Resources T Large Small
Sufficient Knowledge-Distillation
Limited Focus-Shift Inference-Greedy
Scarce Inference-Only

Remark: Our algorithm is an open-loop algorithm that does not leverage feedback from the system. We acknowledge that it is possible to calculate the current accuracy of the student model (f(τ=1t1D(τ)i=1Mxi(τ)AiTτ=1t1D(τ))j=1Nyj(t)AjID(t)𝑓superscriptsubscript𝜏1𝑡1subscript𝐷𝜏superscriptsubscript𝑖1𝑀subscript𝑥𝑖𝜏superscriptsubscript𝐴𝑖𝑇superscriptsubscript𝜏1𝑡1subscript𝐷𝜏superscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡superscriptsubscript𝐴𝑗𝐼subscript𝐷𝑡f\left(\frac{\sum_{\tau=1}^{t-1}D_{(\tau)}\sum_{i=1}^{M}x_{i}(\tau)A_{i}^{T}}{% \sum_{\tau=1}^{t-1}D_{(\tau)}}\right)\sum_{j=1}^{N}y_{j}(t)A_{j}^{I}D_{(t)}italic_f ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT end_ARG ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT) at every end of time slot t𝑡titalic_t by considering the pseudo-labels output by the teacher model as the ground truth labels, but the relevant mathematical techniques used to incorporate such feedback into the design of algorithms for similar formulas as problem (P) are lacking in the existing literature. We leave the research on the closed-loop algorithm to the problem (P) as future work.

V Performance Analysis

Definition 1.

For a maximization problem, the competitive ratio (or CR) c𝑐citalic_c of algorithm ALG𝐴𝐿𝐺ALGitalic_A italic_L italic_G is defined as cALG(I)OPT(I)𝑐𝐴𝐿𝐺𝐼𝑂𝑃𝑇𝐼c\leq\frac{ALG(I)}{OPT(I)}italic_c ≤ divide start_ARG italic_A italic_L italic_G ( italic_I ) end_ARG start_ARG italic_O italic_P italic_T ( italic_I ) end_ARG for every input I𝐼Iitalic_I, where OPT𝑂𝑃𝑇OPTitalic_O italic_P italic_T represents the optimal offline algorithm with complete knowledge of future information.

Definition 2.

For a maximization problem, the tight competitive ratio c𝑐citalic_c of algorithm ALG𝐴𝐿𝐺ALGitalic_A italic_L italic_G is also a competitive ratio of algorithm ALG𝐴𝐿𝐺ALGitalic_A italic_L italic_G, and there is no c>csuperscript𝑐𝑐c^{\prime}>citalic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_c such that for every input I𝐼Iitalic_I, cALG(I)OPT(I)superscript𝑐𝐴𝐿𝐺𝐼𝑂𝑃𝑇𝐼c^{\prime}\leq\frac{ALG(I)}{OPT(I)}italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ divide start_ARG italic_A italic_L italic_G ( italic_I ) end_ARG start_ARG italic_O italic_P italic_T ( italic_I ) end_ARG.

Theorem 2.

The CR of Inference-Only is f(0)f(AmaxT)𝑓0𝑓subscriptsuperscript𝐴𝑇\frac{f(0)}{f(A^{T}_{\max})}divide start_ARG italic_f ( 0 ) end_ARG start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG.

Proof.

Denote {xi(t),yj(t)}superscriptsubscript𝑥𝑖𝑡superscriptsubscript𝑦𝑗𝑡\{x_{i}^{*}(t),y_{j}^{*}(t)\}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_t ) , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_t ) } as the optimal offline solution to (P) and {xi(t),yj(t)}subscript𝑥𝑖𝑡subscript𝑦𝑗𝑡\{x_{i}(t),y_{j}(t)\}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) } as the solution given by Inference-Only. Then Pt=1Tf(AmaxT)j=1Nyj(t)AjID(t)f(AmaxT)f(0)t=1Tf(0)j=1Nyj(t)AjID(t)=f(AmaxT)f(0)Psuperscript𝑃superscriptsubscript𝑡1𝑇𝑓subscriptsuperscript𝐴𝑇superscriptsubscript𝑗1𝑁superscriptsubscript𝑦𝑗𝑡superscriptsubscript𝐴𝑗𝐼subscript𝐷𝑡𝑓subscriptsuperscript𝐴𝑇𝑓0superscriptsubscript𝑡1𝑇𝑓0superscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡superscriptsubscript𝐴𝑗𝐼subscript𝐷𝑡𝑓subscriptsuperscript𝐴𝑇𝑓0𝑃P^{*}\leq\sum_{t=1}^{T}f(A^{T}_{\max})\sum_{j=1}^{N}y_{j}^{*}(t)A_{j}^{I}D_{(t% )}\leq\frac{f(A^{T}_{\max})}{f(0)}\sum_{t=1}^{T}f(0)\sum_{j=1}^{N}y_{j}(t)A_{j% }^{I}D_{(t)}=\frac{f(A^{T}_{\max})}{f(0)}Pitalic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_t ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT ≤ divide start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( 0 ) end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f ( 0 ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT = divide start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( 0 ) end_ARG italic_P. ∎

Theorem 3.

An upper bound of the tight competitive ratio of Inference-Only is Tf(0)f(0)+(T1)f(AmaxT)𝑇𝑓0𝑓0𝑇1𝑓subscriptsuperscript𝐴𝑇\frac{Tf(0)}{f(0)+(T-1)f(A^{T}_{\max})}divide start_ARG italic_T italic_f ( 0 ) end_ARG start_ARG italic_f ( 0 ) + ( italic_T - 1 ) italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG.

Insight: The closer f(0)𝑓0f(0)italic_f ( 0 ) and f(AmaxT)𝑓subscriptsuperscript𝐴𝑇f(A^{T}_{\max})italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) are, the closer the competitive ratio (f(0)f(AmaxT)𝑓0𝑓subscriptsuperscript𝐴𝑇\frac{f(0)}{f(A^{T}_{\max})}divide start_ARG italic_f ( 0 ) end_ARG start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG) and the upper bound of the tight competitive ratio (Tf(0)f(0)+(T1)f(AmaxT)𝑇𝑓0𝑓0𝑇1𝑓subscriptsuperscript𝐴𝑇\frac{Tf(0)}{f(0)+(T-1)f(A^{T}_{\max})}divide start_ARG italic_T italic_f ( 0 ) end_ARG start_ARG italic_f ( 0 ) + ( italic_T - 1 ) italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG) of Inference-Only are to 1. This implies that when the drift is very slight, Inference-Only approaches the optimal offline algorithm. The proof of Theorem 3 is provided in Appendix -B.

TABLE III: Top-1 Accuracy (%) on CIFAR-10 and CIFAR-10-C.
Model (Resolution)

MACs (M)

Latency (μs𝜇𝑠\mu sitalic_μ italic_s)

original

brightness

contrast

defocus blur

elastic transform

fog

frost

gaussian blur

gaussian noise

glass blur

impulse noise

jpeg compression

motion blur

pixelate

saturate

shot noise

snow

spatter

speckle noise

zoom blur

Mean
MobileNetV2 (20*20) 6.35 7.54 44.93 42.60 23.28 40.47 39.25 27.64 39.46 38.41 42.97 40.33 41.35 42.95 35.84 43.02 38.14 43.29 39.29 41.38 43.08 41.69 39.18
MobileNetV2 (24*24) 6.71 8.37 59.38 54.41 28.09 51.26 49.94 37.42 48.49 48.08 55.71 50.18 53.04 57.10 44.07 55.91 50.56 56.77 50.42 55.41 56.78 49.96 50.19
MobileNetV2 (28*28) 7.45 10.15 73.29 67.94 38.33 63.21 62.48 49.68 59.17 59.23 64.53 62.21 60.38 69.31 53.91 69.67 63.68 66.57 61.70 67.33 66.68 61.22 61.43
MobileNetV2 (32*32) 7.94 10.51 79.57 76.00 47.52 71.08 71.91 62.74 62.70 67.02 56.28 62.90 57.38 74.71 62.42 76.98 71.61 61.98 65.83 71.92 62.86 67.78 65.87
ResNet50 (20*20) 65.76 17.41 54.50 49.20 32.26 50.71 49.00 39.31 44.19 48.99 52.23 49.99 49.99 53.04 45.79 53.03 46.06 52.95 45.68 48.92 52.81 52.83 48.26
ResNet50 (24*24) 68.96 19.29 71.95 66.25 40.68 62.58 61.52 50.54 60.49 58.75 68.26 62.61 64.58 69.64 54.60 68.51 62.09 69.10 61.03 64.42 68.98 61.99 61.93
ResNet50 (28*28) 82.01 24.08 79.02 74.19 42.74 66.58 66.79 55.34 66.95 61.60 72.89 68.07 66.01 75.72 56.96 75.12 69.01 74.45 69.11 70.41 74.12 64.91 66.89
ResNet50 (32*32) 86.37 24.09 86.13 83.21 55.34 73.97 76.59 70.41 76.09 68.40 72.94 70.55 62.42 82.43 66.48 82.33 78.61 76.16 76.44 75.46 75.90 70.13 73.36
ORRIC - - 79.24 79.06 52.19 72.08 72.35 67.20 70.96 67.51 68.44 64.90 58.99 75.70 64.51 77.23 73.15 69.01 70.46 71.69 69.46 69.69 69.19
Theorem 4.

The CR of ORRIC is (1+α)f(0)f(AmaxT)1𝛼𝑓0𝑓subscriptsuperscript𝐴𝑇\frac{(1+\alpha)f(0)}{f(A^{T}_{\max})}divide start_ARG ( 1 + italic_α ) italic_f ( 0 ) end_ARG start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG or 1f(AmaxT)f(0)α1𝑓subscriptsuperscript𝐴𝑇𝑓0𝛼\frac{1}{\frac{f(A^{T}_{\max})}{f(0)}-\alpha}divide start_ARG 1 end_ARG start_ARG divide start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( 0 ) end_ARG - italic_α end_ARG, where α=LAmaxTDmin2AminIf(AmaxT)Dmax2AmaxI𝛼𝐿subscriptsuperscript𝐴𝑇subscriptsuperscript𝐷2superscriptsubscript𝐴𝐼𝑓subscriptsuperscript𝐴𝑇subscriptsuperscript𝐷2subscriptsuperscript𝐴𝐼\alpha=\frac{LA^{T}_{\max}D^{2}_{\min}A_{\min}^{I}}{f(A^{T}_{\max})D^{2}_{\max% }A^{I}_{\max}}italic_α = divide start_ARG italic_L italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT end_ARG start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG.

Proof.

The basic idea is that we have PDsuperscript𝑃superscript𝐷P^{*}\leq D^{*}italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT based on Lemma 2, and if we prove D1cPsuperscript𝐷1𝑐𝑃D^{*}\leq\frac{1}{c}Pitalic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_c end_ARG italic_P, then c𝑐citalic_c is the competitive ratio. Details see Appendix -D

Insight: First, similarly to Theorem 2 and 3, if drift is slight, L𝐿Litalic_L is close to 0, and then ORRIC reduces to Inference-Only (which is an almost optimal algorithm in this case). Second, ORRIC relies on the precise estimation of the lower bound (L𝐿Litalic_L) of the degree of drift (f(AmaxT)superscript𝑓subscriptsuperscript𝐴𝑇f^{\prime}(A^{T}_{\max})italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT )). If the degree of drift is large, then L𝐿Litalic_L and Vtsubscript𝑉𝑡V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are large too if the estimation is precise, making the model pay more attention to retraining to get good future performance. When f(AmaxT)superscript𝑓subscriptsuperscript𝐴𝑇f^{\prime}(A^{T}_{\max})italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) is underestimated too much, the model pays more attention to inference, reducing its future performance due to a lack of retraining, consistent with term L𝐿Litalic_L in CR. Third, the term Dmin2Dmax2subscriptsuperscript𝐷2subscriptsuperscript𝐷2\frac{D^{2}_{\min}}{D^{2}_{\max}}divide start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG suggests that ORRIC performs better with less variability in input data size.

Refer to caption

Figure 2: Competitive Ratio Result.
Corollary 1.

When T>f(AmaxT)f(0)αf(0)𝑇𝑓subscriptsuperscript𝐴𝑇𝑓0𝛼𝑓0T>\frac{f(A^{T}_{\max})-f(0)}{\alpha f(0)}italic_T > divide start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) - italic_f ( 0 ) end_ARG start_ARG italic_α italic_f ( 0 ) end_ARG, the tight competitive ratio of ORRIC is strictly better (bigger) than the tight competitive ratio of Inference-Only.

Proof.

Denote c1,c2subscript𝑐1subscript𝑐2c_{1},c_{2}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as the tight competitive ratio of ORRIC and Inference-Only. From Definition 2 and Theorem 4, we have c11f(AmaxT)f(0)αsubscript𝑐11𝑓subscriptsuperscript𝐴𝑇𝑓0𝛼c_{1}\geq\frac{1}{\frac{f(A^{T}_{\max})}{f(0)}-\alpha}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG divide start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( 0 ) end_ARG - italic_α end_ARG. And when T>f(AmaxT)f(0)αf(0)𝑇𝑓subscriptsuperscript𝐴𝑇𝑓0𝛼𝑓0T>\frac{f(A^{T}_{\max})-f(0)}{\alpha f(0)}italic_T > divide start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) - italic_f ( 0 ) end_ARG start_ARG italic_α italic_f ( 0 ) end_ARG, we have 1f(AmaxT)f(0)α>Tf(0)f(0)+(T1)f(AmaxT)1𝑓subscriptsuperscript𝐴𝑇𝑓0𝛼𝑇𝑓0𝑓0𝑇1𝑓subscriptsuperscript𝐴𝑇\frac{1}{\frac{f(A^{T}_{\max})}{f(0)}-\alpha}>\frac{Tf(0)}{f(0)+(T-1)f(A^{T}_{% \max})}divide start_ARG 1 end_ARG start_ARG divide start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( 0 ) end_ARG - italic_α end_ARG > divide start_ARG italic_T italic_f ( 0 ) end_ARG start_ARG italic_f ( 0 ) + ( italic_T - 1 ) italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG. Then, based on Theorem 3, we have Tf(0)f(0)+(T1)f(AmaxT)c2𝑇𝑓0𝑓0𝑇1𝑓subscriptsuperscript𝐴𝑇subscript𝑐2\frac{Tf(0)}{f(0)+(T-1)f(A^{T}_{\max})}\geq c_{2}divide start_ARG italic_T italic_f ( 0 ) end_ARG start_ARG italic_f ( 0 ) + ( italic_T - 1 ) italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG ≥ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Besides, from Definition 2 and Theorem 2, we have c2f(0)f(AmaxT)subscript𝑐2𝑓0𝑓subscriptsuperscript𝐴𝑇c_{2}\geq\frac{f(0)}{f(A^{T}_{\max})}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ divide start_ARG italic_f ( 0 ) end_ARG start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG. We summarize our theoretical results in Fig. 2. ∎

Insight: Due to the rough approximation to the objective function of (P), ORRIC may not fully represent the potentiality of model retraining and inference co-location paradigm. However, the tight competitive ratio of ORRIC still surpasses that of Inference-Only when drift occurs (L>0𝐿0L>0italic_L > 0) for a sufficiently lengthy time (T>f(AmaxT)f(0)αf(0)𝑇𝑓subscriptsuperscript𝐴𝑇𝑓0𝛼𝑓0T>\frac{f(A^{T}_{\max})-f(0)}{\alpha f(0)}italic_T > divide start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) - italic_f ( 0 ) end_ARG start_ARG italic_α italic_f ( 0 ) end_ARG). This implies that, in such scenarios, the worst-case performance of the model retraining and inference co-location paradigm is strictly better than that of the traditional Inference-Only paradigm.

VI Experiments

We conduct the experiments to answer the following question: Can model retraining and inference co-location paradigm alleviate the negative effect of data drift on model performance?

VI-A Setup

CIFAR-10-C[31], a dataset that is generated by adding 15 common corruptions and 4 extra corruptions to the test dataset of CIFAR-10, is typically used in experiments of out-of-distribution generalization or continual test-time adaptation[32]. We treat these corruptions as imitations of data drift. We first train MobileNetV2 and ResNet50 on the training set of CIFAR-10, then test them on CIFAR-10-C and the test set of CIFAR-10 (the “original” column in Table III) separately. Specially, we use original images (whose resolution is 3232323232*3232 ∗ 32) for training, while resized images (whose resolution may be 3232323232*3232 ∗ 32, 2828282828*2828 ∗ 28, 2424242424*2424 ∗ 24, or 2020202020*2020 ∗ 20) are used for testing. We do not use more kinds of lower resolution (e.g., 1616161616*1616 ∗ 16) because the predictions given by the trained MobileNetV2 and ResNet50 are random in these cases. We also give the computing resource consumption (measured by MACs, which can be calculated by using a third-party library like PyTorch-OpCounter) and latency (measured on our NVIDIA A40 server) results of MobileNetV2 and ResNet50 on a single image at different resolutions. These varying resolutions represent distinct inference configurations (AjIsubscriptsuperscript𝐴𝐼𝑗A^{I}_{j}italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT). The retraining configurations (AiTsubscriptsuperscript𝐴𝑇𝑖A^{T}_{i}italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) are delineated by the sampling ratios (0, 0.1, 0.2, 0.3, 0.5, 1.0), denoting the portion of uploaded data on the t𝑡titalic_t-th time slot (D(t)subscript𝐷𝑡D_{(t)}italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT) utilized for one epoch of model retraining.

VI-B Results

We compare the following three methods on CIFAR-10-C: Teacher-Only (using ResNet50 for inference and without retraining), Student-Only (using MobileNetV2 for inference and without retraining), and ORRIC (using MobileNetV2 for inference and using ResNet50 to retrain MobileNetV2).

Without loss of generality, we set D(t)=1000,tsubscript𝐷𝑡1000for-all𝑡D_{(t)}=1000,\forall titalic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT = 1000 , ∀ italic_t for all three methods and C(t)𝕌(C1,C2),tsimilar-tosubscript𝐶𝑡𝕌subscript𝐶1subscript𝐶2for-all𝑡C_{(t)}\sim\mathbb{U}(C_{1},C_{2}),\forall titalic_C start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT ∼ blackboard_U ( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ∀ italic_t for ORRIC, where 𝕌(C1,C2)𝕌subscript𝐶1subscript𝐶2\mathbb{U}(C_{1},C_{2})blackboard_U ( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is a uniform distribution between C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (the MACs when MobileNetV2 performs inference on 1000 images whose resolution is 3232323232*3232 ∗ 32) and C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (the MACs when ResNet50 performs inference on 1000 images whose resolution is 3232323232*3232 ∗ 32). To satisfy Property 1, we delete the 3232323232*3232 ∗ 32 inference configuration and set f(AmaxT)𝑓subscriptsuperscript𝐴𝑇f(A^{T}_{\max})italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) to 0.73290.73290.73290.7329 for the dataset of “gaussian noise”, “impulse noise”, “shot noise”, and “speckle noise”. For other datasets, we set f(AmaxT)𝑓subscriptsuperscript𝐴𝑇f(A^{T}_{\max})italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) to 0.79570.79570.79570.7957 (see the “original” column). We set L𝐿Litalic_L to 0.010.010.010.01 and set AiT=βCiTsuperscriptsubscript𝐴𝑖𝑇𝛽superscriptsubscript𝐶𝑖𝑇A_{i}^{T}=\beta C_{i}^{T}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = italic_β italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (where the β𝛽\betaitalic_β is a normalization coefficient, making maxi{AiT}=1subscript𝑖superscriptsubscript𝐴𝑖𝑇1\max_{i}\{A_{i}^{T}\}=1roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } = 1).

The real-time accuracy of these three methods on the “fog” corruption dataset is shown in Fig. 3 (a). Because each type of corruption dataset in CIFAR-10-C has 5 severity levels, and the first 10,000 images are at severity 1, while the last 10,000 images are at severity 5[31], the accuracy of all three methods drops suddenly and periodically. However, the curve of ORRIC is almost always higher than the curve of Student-Only, showing the benefit of model retraining. We also give the average accuracy of ORRIC on other corruption datasets in the last row of Table III.

The resource consumption and latency of these three methods on the “fog” corruption dataset can be calculated using the parameters given in Table III, and we report the Accuracy-Cost-Latency trade-off of these three methods while normalizing the maximum value of each axis to 1, see Fig. 3 (b). ORRIC surpasses or equals the Student-Only algorithm in terms of accuracy and latency while utilizing idle available computing resources. ORRIC surpasses the Teacher-Only algorithm in terms of cost and latency while maintaining good accuracy. In general, the model retraining and inference co-location paradigm can utilize idle available resources to improve model accuracy while maintaining low latency, thereby alleviating the negative impact of drift on accuracy.

Refer to caption
(a) Real-time Accuracy Results Comparison.
Refer to caption
(b) Accuracy-Cost-Latency Trade-off Comparison.

Figure 3: Results on the “fog” Corruption Dataset of CIFAR-10-C.

VII Conclusion

In this paper, we study the online allocation in the model retraining and inference co-location paradigm. We model the current model performance as a function of past retraining configuration and current inference configuration and then propose a linear complexity online algorithm (named ORRIC). Our competitive analysis implies the advantages and applications of model retraining and inference co-location paradigm over the traditional Inference-Only paradigm. Experiments on the CIFAR-10-C validate the effectiveness of model retraining and inference co-location in drift scenarios.

VIII Acknowledgment

The authors appreciate the reviewers for their insightful and valuable comments. Discussions with Kongyange Zhao, Tao Ouyang and Qing Ling are gratefully acknowledged.

-A Proof of Theorem 1

Proof.

Supposing (x1,y1)subscript𝑥1subscript𝑦1(x_{1},y_{1})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and (x2,y2)subscript𝑥2subscript𝑦2(x_{2},y_{2})( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) are two points in the domain of f(x)y𝑓𝑥𝑦f(x)yitalic_f ( italic_x ) italic_y, and denoting x¯=αx1+(1α)x2¯𝑥𝛼subscript𝑥11𝛼subscript𝑥2\bar{x}=\alpha x_{1}+(1-\alpha)x_{2}over¯ start_ARG italic_x end_ARG = italic_α italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and y¯=αy1+(1α)y2¯𝑦𝛼subscript𝑦11𝛼subscript𝑦2\bar{y}=\alpha y_{1}+(1-\alpha)y_{2}over¯ start_ARG italic_y end_ARG = italic_α italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, where 0<α<10𝛼10<\alpha<10 < italic_α < 1, then E=f(x¯)y¯[αf(x1)y1+(1α)f(x2)y2]=α[f(x¯)f(x1)]y1+(1α)[f(x¯)f(x2)]y2𝐸𝑓¯𝑥¯𝑦delimited-[]𝛼𝑓subscript𝑥1subscript𝑦11𝛼𝑓subscript𝑥2subscript𝑦2𝛼delimited-[]𝑓¯𝑥𝑓subscript𝑥1subscript𝑦11𝛼delimited-[]𝑓¯𝑥𝑓subscript𝑥2subscript𝑦2E=f(\bar{x})\bar{y}-\left[\alpha f(x_{1})y_{1}+(1-\alpha)f(x_{2})y_{2}\right]=% \alpha\left[f(\bar{x})-f(x_{1})\right]y_{1}+(1-\alpha)\left[f(\bar{x})-f(x_{2}% )\right]y_{2}italic_E = italic_f ( over¯ start_ARG italic_x end_ARG ) over¯ start_ARG italic_y end_ARG - [ italic_α italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_f ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] = italic_α [ italic_f ( over¯ start_ARG italic_x end_ARG ) - italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_α ) [ italic_f ( over¯ start_ARG italic_x end_ARG ) - italic_f ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. If there exist values for α,x1,x2,y1,y2𝛼subscript𝑥1subscript𝑥2subscript𝑦1subscript𝑦2\alpha,x_{1},x_{2},y_{1},y_{2}italic_α , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT that make E𝐸Eitalic_E less than 00, then f(x)y𝑓𝑥𝑦f(x)yitalic_f ( italic_x ) italic_y is nonconcave. We can also prove that f(x)y𝑓𝑥𝑦f(x)yitalic_f ( italic_x ) italic_y is nonconvex in the same way. Therefore, we can conclude that f(x)y𝑓𝑥𝑦f(x)yitalic_f ( italic_x ) italic_y is nonconvex-nonconcave.

If f(x)𝑓𝑥f(x)italic_f ( italic_x ) is twice-differentiable, the indefiniteness of the Hessian matrix of f(x)y𝑓𝑥𝑦f(x)yitalic_f ( italic_x ) italic_y can prove the theorem too. ∎

-B Proof of Theorem 3

Proof.

We show this fact using proof by contradiction. Suppose a situation where computing resources are sufficient and D(t)=Dsubscript𝐷𝑡𝐷D_{(t)}=Ditalic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT = italic_D on every time slot, then P=f(0)AmaxID(1)+t=2Tf(AmaxT)AmaxID(t)=(f(0)+t=2Tf(AmaxT))AmaxIDsuperscript𝑃𝑓0superscriptsubscript𝐴𝐼subscript𝐷1superscriptsubscript𝑡2𝑇𝑓subscriptsuperscript𝐴𝑇superscriptsubscript𝐴𝐼subscript𝐷𝑡𝑓0superscriptsubscript𝑡2𝑇𝑓subscriptsuperscript𝐴𝑇superscriptsubscript𝐴𝐼𝐷P^{*}=f(0)A_{\max}^{I}D_{(1)}+\sum_{t=2}^{T}f(A^{T}_{\max})A_{\max}^{I}D_{(t)}% =(f(0)+\sum_{t=2}^{T}f(A^{T}_{\max}))A_{\max}^{I}Ditalic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_f ( 0 ) italic_A start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) italic_A start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT = ( italic_f ( 0 ) + ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ) italic_A start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D, while P=t=1Tf(0)AmaxID(t)=(t=1Tf(0))AmaxID𝑃superscriptsubscript𝑡1𝑇𝑓0superscriptsubscript𝐴𝐼subscript𝐷𝑡superscriptsubscript𝑡1𝑇𝑓0superscriptsubscript𝐴𝐼𝐷P=\sum_{t=1}^{T}f(0)A_{\max}^{I}D_{(t)}=\left(\sum_{t=1}^{T}f(0)\right)A_{\max% }^{I}Ditalic_P = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f ( 0 ) italic_A start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT = ( ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f ( 0 ) ) italic_A start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D, then we have: PP=t=1Tf(0)f(0)+t=2Tf(AmaxT)=Tf(0)f(0)+(T1)f(AmaxT)𝑃superscript𝑃superscriptsubscript𝑡1𝑇𝑓0𝑓0superscriptsubscript𝑡2𝑇𝑓subscriptsuperscript𝐴𝑇𝑇𝑓0𝑓0𝑇1𝑓subscriptsuperscript𝐴𝑇\frac{P}{P^{*}}=\frac{\sum_{t=1}^{T}f(0)}{f(0)+\sum_{t=2}^{T}f(A^{T}_{\max})}=% \frac{Tf(0)}{f(0)+(T-1)f(A^{T}_{\max})}divide start_ARG italic_P end_ARG start_ARG italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG = divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f ( 0 ) end_ARG start_ARG italic_f ( 0 ) + ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG = divide start_ARG italic_T italic_f ( 0 ) end_ARG start_ARG italic_f ( 0 ) + ( italic_T - 1 ) italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG. Suppose the tight competitive ratio of the Inference-Only algorithm (denoted as c¯¯𝑐\bar{c}over¯ start_ARG italic_c end_ARG) is strictly bigger than Tf(0)f(0)+(T1)f(AmaxT)𝑇𝑓0𝑓0𝑇1𝑓subscriptsuperscript𝐴𝑇\frac{Tf(0)}{f(0)+(T-1)f(A^{T}_{\max})}divide start_ARG italic_T italic_f ( 0 ) end_ARG start_ARG italic_f ( 0 ) + ( italic_T - 1 ) italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG, which means for any input, Pc¯P>Tf(0)f(0)+(T1)f(AmaxT)P𝑃¯𝑐superscript𝑃𝑇𝑓0𝑓0𝑇1𝑓subscriptsuperscript𝐴𝑇superscript𝑃P\geq\bar{c}P^{*}>\frac{Tf(0)}{f(0)+(T-1)f(A^{T}_{\max})}P^{*}italic_P ≥ over¯ start_ARG italic_c end_ARG italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > divide start_ARG italic_T italic_f ( 0 ) end_ARG start_ARG italic_f ( 0 ) + ( italic_T - 1 ) italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, but we have found an input that makes P=Tf(0)f(0)+(T1)f(AmaxT)P𝑃𝑇𝑓0𝑓0𝑇1𝑓subscriptsuperscript𝐴𝑇superscript𝑃P=\frac{Tf(0)}{f(0)+(T-1)f(A^{T}_{\max})}P^{*}italic_P = divide start_ARG italic_T italic_f ( 0 ) end_ARG start_ARG italic_f ( 0 ) + ( italic_T - 1 ) italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, the contradiction has arisen. Then we show an upper bound of the tight competitive ratio c¯¯𝑐\bar{c}over¯ start_ARG italic_c end_ARG of the Inference-Only algorithm is Tf(0)f(0)+(T1)f(AmaxT)𝑇𝑓0𝑓0𝑇1𝑓subscriptsuperscript𝐴𝑇\frac{Tf(0)}{f(0)+(T-1)f(A^{T}_{\max})}divide start_ARG italic_T italic_f ( 0 ) end_ARG start_ARG italic_f ( 0 ) + ( italic_T - 1 ) italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG. ∎

-C Proof of the Correctness of ORRIC

Proof.

Let’s assume that the optimal configuration indices for retraining and inference in Dt𝐷𝑡Dtitalic_D italic_t are a𝑎aitalic_a and b𝑏bitalic_b, so CaT+CbIUsuperscriptsubscript𝐶𝑎𝑇superscriptsubscript𝐶𝑏𝐼𝑈C_{a}^{T}+C_{b}^{I}\leq Uitalic_C start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ≤ italic_U. We only need to demonstrate that ORRIC must have explored the pair (a,b,VAaT+WAbI)𝑎𝑏𝑉subscriptsuperscript𝐴𝑇𝑎𝑊subscriptsuperscript𝐴𝐼𝑏(a,b,VA^{T}_{a}+WA^{I}_{b})( italic_a , italic_b , italic_V italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_W italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ). ORRIC terminates when either i>M𝑖𝑀i>Mitalic_i > italic_M or j<1𝑗1j<1italic_j < 1. Let’s consider the scenario where it terminates with j<1𝑗1j<1italic_j < 1 (the case for terminating with i>M𝑖𝑀i>Mitalic_i > italic_M is similar). In this case, j𝑗jitalic_j will decrease from N𝑁Nitalic_N to 00. When j𝑗jitalic_j reaches b𝑏bitalic_b, let’s assume that i=a1𝑖subscript𝑎1i=a_{1}italic_i = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT at this moment.

First case: If a1asubscript𝑎1𝑎a_{1}\leq aitalic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_a, then Ca1T+CbIUsuperscriptsubscript𝐶subscript𝑎1𝑇superscriptsubscript𝐶𝑏𝐼𝑈C_{a_{1}}^{T}+C_{b}^{I}\leq Uitalic_C start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ≤ italic_U. According to the algorithm, i𝑖iitalic_i will start increasing from a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT until CiT+CbI>Usuperscriptsubscript𝐶𝑖𝑇superscriptsubscript𝐶𝑏𝐼𝑈C_{i}^{T}+C_{b}^{I}>Uitalic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT > italic_U or until i>M𝑖𝑀i>Mitalic_i > italic_M, whichever happens first. At this point, i>a𝑖𝑎i>aitalic_i > italic_a, so (a,b,VAaT+WAbI)𝑎𝑏𝑉subscriptsuperscript𝐴𝑇𝑎𝑊subscriptsuperscript𝐴𝐼𝑏(a,b,VA^{T}_{a}+WA^{I}_{b})( italic_a , italic_b , italic_V italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_W italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) must have been explored by ORRIC.

Second case: If a1>asubscript𝑎1𝑎a_{1}>aitalic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_a, then the previous iteration is (a1,b+1)subscript𝑎1𝑏1(a_{1},b+1)( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b + 1 ) (where Ca1T+Cb+1I>CaT+Cb+1I>Usuperscriptsubscript𝐶subscript𝑎1𝑇superscriptsubscript𝐶𝑏1𝐼superscriptsubscript𝐶𝑎𝑇superscriptsubscript𝐶𝑏1𝐼𝑈C_{a_{1}}^{T}+C_{b+1}^{I}>C_{a}^{T}+C_{b+1}^{I}>Uitalic_C start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT italic_b + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT > italic_C start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT italic_b + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT > italic_U). And the former iteration of it won’t be (a11,b+1)subscript𝑎11𝑏1(a_{1}-1,b+1)( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 , italic_b + 1 ) (since Ca11T+Cb+1ICaT+Cb+1I>Usuperscriptsubscript𝐶subscript𝑎11𝑇superscriptsubscript𝐶𝑏1𝐼superscriptsubscript𝐶𝑎𝑇superscriptsubscript𝐶𝑏1𝐼𝑈C_{a_{1}-1}^{T}+C_{b+1}^{I}\geq C_{a}^{T}+C_{b+1}^{I}>Uitalic_C start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT italic_b + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ≥ italic_C start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT italic_b + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT > italic_U, so (a11,b)subscript𝑎11𝑏(a_{1}-1,b)( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 , italic_b ) is next to (a11,b+1)subscript𝑎11𝑏1(a_{1}-1,b+1)( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 , italic_b + 1 )). Therefore, the next pairs are (a1,b+2)subscript𝑎1𝑏2(a_{1},b+2)( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b + 2 ), (a1,b+3)subscript𝑎1𝑏3(a_{1},b+3)( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b + 3 ), and so on until (a1,N)subscript𝑎1𝑁(a_{1},N)( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_N ) is reached. At this point, the pair before must be (a11,N)subscript𝑎11𝑁(a_{1}-1,N)( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 , italic_N ), and (a12,N)subscript𝑎12𝑁(a_{1}-2,N)( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 2 , italic_N ), and so on until a1ksubscript𝑎1𝑘a_{1}-kitalic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_k is found such that Ca1kT+CNI<Usuperscriptsubscript𝐶subscript𝑎1𝑘𝑇superscriptsubscript𝐶𝑁𝐼𝑈C_{a_{1}-k}^{T}+C_{N}^{I}<Uitalic_C start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT < italic_U. In this case, (a,N)𝑎𝑁(a,N)( italic_a , italic_N ) must be present in these iterations. However, according to the algorithm, the next iteration from (a,N)𝑎𝑁(a,N)( italic_a , italic_N ) is (a,N1)𝑎𝑁1(a,N-1)( italic_a , italic_N - 1 ), not (a+1,N)𝑎1𝑁(a+1,N)( italic_a + 1 , italic_N ). Therefore, this case is not possible. ∎

-D Proof of Theorem 4

Proof.
Dsuperscript𝐷\displaystyle D^{*}italic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT =f(AmaxT)j=1Nyj(1)AjID(1)A1LAmaxTj=1Nyj(1)AjID(1)B1absentsubscript𝑓subscriptsuperscript𝐴𝑇superscriptsubscript𝑗1𝑁subscript𝑦𝑗1superscriptsubscript𝐴𝑗𝐼subscript𝐷1subscript𝐴1subscript𝐿subscriptsuperscript𝐴𝑇superscriptsubscript𝑗1𝑁subscript𝑦𝑗1superscriptsubscript𝐴𝑗𝐼subscript𝐷1subscript𝐵1\displaystyle=\underbracket{f\left(A^{T}_{\max}\right)\sum_{j=1}^{N}y_{j}(1)A_% {j}^{I}D_{(1)}}_{A_{1}}\underbracket{-LA^{T}_{\max}\sum_{j=1}^{N}y_{j}(1)A_{j}% ^{I}D_{(1)}}_{B_{1}}= under﹈ start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT under﹈ start_ARG - italic_L italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
+t=2Tf(AmaxT)j=1Nyj(t)AjID(t)A2subscriptsuperscriptsubscript𝑡2𝑇𝑓subscriptsuperscript𝐴𝑇superscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡superscriptsubscript𝐴𝑗𝐼subscript𝐷𝑡subscript𝐴2\displaystyle+\underbracket{\sum_{t=2}^{T}f\left(A^{T}_{\max}\right)\sum_{j=1}% ^{N}y_{j}(t)A_{j}^{I}D_{(t)}}_{A_{2}}+ under﹈ start_ARG ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
+t=1T1LDminAminIDmax1t(τ=1tD(τ)i=1Mxi(τ)AiT)Csubscriptsuperscriptsubscript𝑡1𝑇1𝐿subscript𝐷superscriptsubscript𝐴𝐼subscript𝐷1𝑡superscriptsubscript𝜏1𝑡subscript𝐷𝜏superscriptsubscript𝑖1𝑀subscript𝑥𝑖𝜏superscriptsubscript𝐴𝑖𝑇𝐶\displaystyle+\underbracket{\sum_{t=1}^{T-1}L\frac{D_{\min}A_{\min}^{I}}{D_{% \max}}\frac{1}{t}\left(\sum_{\tau=1}^{t}D_{(\tau)}\sum_{i=1}^{M}x_{i}(\tau)A_{% i}^{T}\right)}_{C}+ under﹈ start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_L divide start_ARG italic_D start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT end_ARG start_ARG italic_D start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG divide start_ARG 1 end_ARG start_ARG italic_t end_ARG ( ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT
t=1T1LDminAminIDmax1tτ=1tD(τ)AmaxTB2subscriptsuperscriptsubscript𝑡1𝑇1𝐿subscript𝐷superscriptsubscript𝐴𝐼subscript𝐷1𝑡superscriptsubscript𝜏1𝑡subscript𝐷𝜏subscriptsuperscript𝐴𝑇subscript𝐵2\displaystyle\underbracket{-\sum_{t=1}^{T-1}L\frac{D_{\min}A_{\min}^{I}}{D_{% \max}}\frac{1}{t}\sum_{\tau=1}^{t}D_{(\tau)}A^{T}_{\max}}_{B_{2}}under﹈ start_ARG - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_L divide start_ARG italic_D start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT end_ARG start_ARG italic_D start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG divide start_ARG 1 end_ARG start_ARG italic_t end_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT

For term B1subscript𝐵1B_{1}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we have B1LAmaxTDmin2AminIDmaxsubscript𝐵1𝐿subscriptsuperscript𝐴𝑇subscriptsuperscript𝐷2superscriptsubscript𝐴𝐼subscript𝐷B_{1}\leq-\frac{LA^{T}_{\max}D^{2}_{\min}A_{\min}^{I}}{D_{\max}}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ - divide start_ARG italic_L italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT end_ARG start_ARG italic_D start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG by j=1Nyj(1)AjID(1)DminAminIDminAminIDminDmaxsuperscriptsubscript𝑗1𝑁subscript𝑦𝑗1superscriptsubscript𝐴𝑗𝐼subscript𝐷1subscript𝐷superscriptsubscript𝐴𝐼subscript𝐷superscriptsubscript𝐴𝐼subscript𝐷subscript𝐷\sum_{j=1}^{N}y_{j}(1)A_{j}^{I}D_{(1)}\geq D_{\min}A_{\min}^{I}\geq D_{\min}A_% {\min}^{I}\frac{D_{\min}}{D_{\max}}∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT ≥ italic_D start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ≥ italic_D start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT divide start_ARG italic_D start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG start_ARG italic_D start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG. For term B2subscript𝐵2B_{2}italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we have B2(T1)LAmaxTDmin2AminIDmaxsubscript𝐵2𝑇1𝐿subscriptsuperscript𝐴𝑇subscriptsuperscript𝐷2superscriptsubscript𝐴𝐼subscript𝐷B_{2}\leq-(T-1)\frac{LA^{T}_{\max}D^{2}_{\min}A_{\min}^{I}}{D_{\max}}italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ - ( italic_T - 1 ) divide start_ARG italic_L italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT end_ARG start_ARG italic_D start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG due to τ=1tD(τ)tDminsuperscriptsubscript𝜏1𝑡subscript𝐷𝜏𝑡subscript𝐷\sum_{\tau=1}^{t}D_{(\tau)}\geq tD_{\min}∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT ≥ italic_t italic_D start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT. Finally B1+B2TLAmaxTDmin2AminIDmaxsubscript𝐵1subscript𝐵2𝑇𝐿subscriptsuperscript𝐴𝑇subscriptsuperscript𝐷2superscriptsubscript𝐴𝐼subscript𝐷B_{1}+B_{2}\leq-T\frac{LA^{T}_{\max}D^{2}_{\min}A_{\min}^{I}}{D_{\max}}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ - italic_T divide start_ARG italic_L italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT end_ARG start_ARG italic_D start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG.

For term C𝐶Citalic_C, since the increasing and concave property of f𝑓fitalic_f (Assumption 1), τ=1tD(τ)i=1Mxi(τ)AiTτ=1tD(τ)AmaxTsuperscriptsubscript𝜏1𝑡subscript𝐷𝜏superscriptsubscript𝑖1𝑀subscript𝑥𝑖𝜏superscriptsubscript𝐴𝑖𝑇superscriptsubscript𝜏1𝑡subscript𝐷𝜏subscriptsuperscript𝐴𝑇\frac{\sum_{\tau=1}^{t}D_{(\tau)}\sum_{i=1}^{M}x_{i}(\tau)A_{i}^{T}}{\sum_{% \tau=1}^{t}D_{(\tau)}}\leq A^{T}_{\max}divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT end_ARG ≤ italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT by the definition of AmaxTsubscriptsuperscript𝐴𝑇A^{T}_{\max}italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT and Assumption 2, we have the following fact: Lf(AmaxT)f(τ=1tD(τ)i=1Mxi(τ)AiTτ=1tD(τ))𝐿superscript𝑓subscriptsuperscript𝐴𝑇superscript𝑓superscriptsubscript𝜏1𝑡subscript𝐷𝜏superscriptsubscript𝑖1𝑀subscript𝑥𝑖𝜏superscriptsubscript𝐴𝑖𝑇superscriptsubscript𝜏1𝑡subscript𝐷𝜏L\leq f^{\prime}\left({A^{T}_{\max}}\right)\leq f^{\prime}\left(\frac{\sum_{% \tau=1}^{t}D_{(\tau)}\sum_{i=1}^{M}x_{i}(\tau)A_{i}^{T}}{\sum_{\tau=1}^{t}D_{(% \tau)}}\right)italic_L ≤ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ≤ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT end_ARG ). Combining this fact and DminAminIDmax1t<j=1Nyj(t+1)AjID(t+1)τ=1tD(τ)subscript𝐷superscriptsubscript𝐴𝐼subscript𝐷1𝑡superscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡1superscriptsubscript𝐴𝑗𝐼subscript𝐷𝑡1superscriptsubscript𝜏1𝑡subscript𝐷𝜏\frac{D_{\min}A_{\min}^{I}}{D_{\max}}\frac{1}{t}<\frac{\sum_{j=1}^{N}y_{j}(t+1% )A_{j}^{I}D_{(t+1)}}{\sum_{\tau=1}^{t}D_{(\tau)}}divide start_ARG italic_D start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT end_ARG start_ARG italic_D start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG divide start_ARG 1 end_ARG start_ARG italic_t end_ARG < divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t + 1 ) end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT end_ARG, we get: Ct=1T1(f(0)+f(τ=1tD(τ)i=1Mxi(τ)AiTτ=1tD(τ))τ=1tD(τ)i=1Mxi(τ)AiTτ=1tD(τ))j=1Nyj(t+1)AjID(t+1)t=1T1f(0)j=1Nyj(t+1)AjID(t+1)𝐶superscriptsubscript𝑡1𝑇1𝑓0superscript𝑓superscriptsubscript𝜏1𝑡subscript𝐷𝜏superscriptsubscript𝑖1𝑀subscript𝑥𝑖𝜏superscriptsubscript𝐴𝑖𝑇superscriptsubscript𝜏1𝑡subscript𝐷𝜏superscriptsubscript𝜏1𝑡subscript𝐷𝜏superscriptsubscript𝑖1𝑀subscript𝑥𝑖𝜏superscriptsubscript𝐴𝑖𝑇superscriptsubscript𝜏1𝑡subscript𝐷𝜏superscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡1superscriptsubscript𝐴𝑗𝐼subscript𝐷𝑡1superscriptsubscript𝑡1𝑇1𝑓0superscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡1superscriptsubscript𝐴𝑗𝐼subscript𝐷𝑡1C\leq\sum_{t=1}^{T-1}\left(f(0)+f^{\prime}\left(\frac{\sum_{\tau=1}^{t}D_{(% \tau)}\sum_{i=1}^{M}x_{i}(\tau)A_{i}^{T}}{\sum_{\tau=1}^{t}D_{(\tau)}}\right)% \right.\\ \left.\frac{\sum_{\tau=1}^{t}D_{(\tau)}\sum_{i=1}^{M}x_{i}(\tau)A_{i}^{T}}{% \sum_{\tau=1}^{t}D_{(\tau)}}\vphantom{\frac{\sum_{\tau=1}^{t}D_{(\tau)}\sum_{i% =1}^{M}x_{i}(\tau)A_{i}^{T}}{\sum_{\tau=1}^{t}D_{(\tau)}}}\right)\sum_{j=1}^{N% }y_{j}(t+1)A_{j}^{I}D_{(t+1)}-\sum_{t=1}^{T-1}f(0)\sum_{j=1}^{N}y_{j}(t+1)A_{j% }^{I}D_{(t+1)}italic_C ≤ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_f ( 0 ) + italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT end_ARG ) divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT end_ARG ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t + 1 ) end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_f ( 0 ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t + 1 ) end_POSTSUBSCRIPT.

Since the assumed concave property of f𝑓fitalic_f (Assumption 1), we have f(0)f(x)+(0x)f(x)𝑓0𝑓𝑥0𝑥superscript𝑓𝑥f(0)\leq f(x)+(0-x)f^{\prime}(x)italic_f ( 0 ) ≤ italic_f ( italic_x ) + ( 0 - italic_x ) italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) , i.e., f(0)+xf(x)f(x)𝑓0𝑥superscript𝑓𝑥𝑓𝑥f(0)+xf^{\prime}(x)\leq f(x)italic_f ( 0 ) + italic_x italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ≤ italic_f ( italic_x ). Then, Ct=1T1f(τ=1tD(τ)i=1Mxi(τ)AiTτ=1tD(τ))j=1Nyj(t+1)AjID(t+1)t=1T1f(0)j=1Nyj(t+1)AjID(t+1)=Pt=0T1f(0)j=1Nyj(t+1)AjID(t+1)𝐶superscriptsubscript𝑡1𝑇1𝑓superscriptsubscript𝜏1𝑡subscript𝐷𝜏superscriptsubscript𝑖1𝑀subscript𝑥𝑖𝜏superscriptsubscript𝐴𝑖𝑇superscriptsubscript𝜏1𝑡subscript𝐷𝜏superscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡1superscriptsubscript𝐴𝑗𝐼subscript𝐷𝑡1superscriptsubscript𝑡1𝑇1𝑓0superscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡1superscriptsubscript𝐴𝑗𝐼subscript𝐷𝑡1𝑃superscriptsubscript𝑡0𝑇1𝑓0superscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡1superscriptsubscript𝐴𝑗𝐼subscript𝐷𝑡1C\leq\sum_{t=1}^{T-1}f\left(\frac{\sum_{\tau=1}^{t}D_{(\tau)}\sum_{i=1}^{M}x_{% i}(\tau)A_{i}^{T}}{\sum_{\tau=1}^{t}D_{(\tau)}}\right)\sum_{j=1}^{N}y_{j}(t+1)% A_{j}^{I}D_{(t+1)}-\sum_{t=1}^{T-1}f(0)\sum_{j=1}^{N}y_{j}(t+1)A_{j}^{I}D_{(t+% 1)}=P-\sum_{t=0}^{T-1}f(0)\sum_{j=1}^{N}y_{j}(t+1)A_{j}^{I}D_{(t+1)}italic_C ≤ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_f ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ ) italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_τ ) end_POSTSUBSCRIPT end_ARG ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t + 1 ) end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_f ( 0 ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t + 1 ) end_POSTSUBSCRIPT = italic_P - ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_f ( 0 ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t + 1 ) end_POSTSUBSCRIPT.

Based on all of the above analysis, we have: PD=B1+B2+C+A1+A2TLAmaxTDmin2AminIDmax+Pt=0T1f(0)j=1Nyj(t+1)AjID(t+1)+t=1Tf(AmaxT)j=1Nyj(t)AjID(t)=TLAmaxTDmin2AminIDmax+P+f(AmaxT)f(0)f(0)t=1Tf(0)j=1Nyj(t)AjID(t)TLAmaxTDmin2AminIDmax+P+f(AmaxT)f(0)f(0)P=f(AmaxT)f(0)PαTf(AmaxT)AmaxIDmaxsuperscript𝑃superscript𝐷subscript𝐵1subscript𝐵2𝐶subscript𝐴1subscript𝐴2𝑇𝐿subscriptsuperscript𝐴𝑇subscriptsuperscript𝐷2superscriptsubscript𝐴𝐼subscript𝐷𝑃superscriptsubscript𝑡0𝑇1𝑓0superscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡1superscriptsubscript𝐴𝑗𝐼subscript𝐷𝑡1superscriptsubscript𝑡1𝑇𝑓subscriptsuperscript𝐴𝑇superscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡superscriptsubscript𝐴𝑗𝐼subscript𝐷𝑡𝑇𝐿subscriptsuperscript𝐴𝑇subscriptsuperscript𝐷2superscriptsubscript𝐴𝐼subscript𝐷𝑃𝑓subscriptsuperscript𝐴𝑇𝑓0𝑓0superscriptsubscript𝑡1𝑇𝑓0superscriptsubscript𝑗1𝑁subscript𝑦𝑗𝑡superscriptsubscript𝐴𝑗𝐼subscript𝐷𝑡𝑇𝐿subscriptsuperscript𝐴𝑇subscriptsuperscript𝐷2superscriptsubscript𝐴𝐼subscript𝐷𝑃𝑓subscriptsuperscript𝐴𝑇𝑓0𝑓0𝑃𝑓subscriptsuperscript𝐴𝑇𝑓0𝑃𝛼𝑇𝑓subscriptsuperscript𝐴𝑇subscriptsuperscript𝐴𝐼subscript𝐷P^{*}\leq D^{*}=B_{1}+B_{2}+C+A_{1}+A_{2}\leq-T\frac{LA^{T}_{\max}D^{2}_{\min}% A_{\min}^{I}}{D_{\max}}+P-\sum_{t=0}^{T-1}f(0)\sum_{j=1}^{N}y_{j}(t+1)A_{j}^{I% }D_{(t+1)}+\sum_{t=1}^{T}f(A^{T}_{\max})\sum_{j=1}^{N}y_{j}(t)A_{j}^{I}D_{(t)}% =-T\frac{LA^{T}_{\max}D^{2}_{\min}A_{\min}^{I}}{D_{\max}}+P+\frac{f(A^{T}_{% \max})-f(0)}{f(0)}\sum_{t=1}^{T}f(0)\sum_{j=1}^{N}y_{j}(t)A_{j}^{I}D_{(t)}\leq% -T\frac{LA^{T}_{\max}D^{2}_{\min}A_{\min}^{I}}{D_{\max}}+P+\frac{f(A^{T}_{\max% })-f(0)}{f(0)}P=\frac{f(A^{T}_{\max})}{f(0)}P-\alpha Tf(A^{T}_{\max})A^{I}_{% \max}D_{\max}italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_C + italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ - italic_T divide start_ARG italic_L italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT end_ARG start_ARG italic_D start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG + italic_P - ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_f ( 0 ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t + 1 ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t + 1 ) end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT = - italic_T divide start_ARG italic_L italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT end_ARG start_ARG italic_D start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG + italic_P + divide start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) - italic_f ( 0 ) end_ARG start_ARG italic_f ( 0 ) end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f ( 0 ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT ≤ - italic_T divide start_ARG italic_L italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT end_ARG start_ARG italic_D start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG + italic_P + divide start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) - italic_f ( 0 ) end_ARG start_ARG italic_f ( 0 ) end_ARG italic_P = divide start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( 0 ) end_ARG italic_P - italic_α italic_T italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT, where α=LAmaxTDmin2AminIf(AmaxT)Dmax2AmaxI𝛼𝐿subscriptsuperscript𝐴𝑇subscriptsuperscript𝐷2superscriptsubscript𝐴𝐼𝑓subscriptsuperscript𝐴𝑇subscriptsuperscript𝐷2subscriptsuperscript𝐴𝐼\alpha=\frac{LA^{T}_{\max}D^{2}_{\min}A_{\min}^{I}}{f(A^{T}_{\max})D^{2}_{\max% }A^{I}_{\max}}italic_α = divide start_ARG italic_L italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT end_ARG start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG. Further from the fact that PPTf(AmaxT)AmaxIDmax𝑃superscript𝑃𝑇𝑓subscriptsuperscript𝐴𝑇subscriptsuperscript𝐴𝐼subscript𝐷P\leq P^{*}\leq Tf(A^{T}_{\max})A^{I}_{\max}D_{\max}italic_P ≤ italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_T italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT, we have: 1) Pf(AmaxT)f(0)PαPsuperscript𝑃𝑓subscriptsuperscript𝐴𝑇𝑓0𝑃𝛼superscript𝑃P^{*}\leq\frac{f(A^{T}_{\max})}{f(0)}P-\alpha P^{*}italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ divide start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( 0 ) end_ARG italic_P - italic_α italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, 2) Pf(AmaxT)f(0)PαPsuperscript𝑃𝑓subscriptsuperscript𝐴𝑇𝑓0𝑃𝛼𝑃P^{*}\leq\frac{f(A^{T}_{\max})}{f(0)}P-\alpha Pitalic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ divide start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( 0 ) end_ARG italic_P - italic_α italic_P. Then we prove that the competitive ratio of ORRIC is (1+α)f(0)f(AmaxT)1𝛼𝑓0𝑓subscriptsuperscript𝐴𝑇\frac{(1+\alpha)f(0)}{f(A^{T}_{\max})}divide start_ARG ( 1 + italic_α ) italic_f ( 0 ) end_ARG start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG or 1f(AmaxT)f(0)α1𝑓subscriptsuperscript𝐴𝑇𝑓0𝛼\frac{1}{\frac{f(A^{T}_{\max})}{f(0)}-\alpha}divide start_ARG 1 end_ARG start_ARG divide start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( 0 ) end_ARG - italic_α end_ARG, where α=LAmaxTDmin2AminIf(AmaxT)Dmax2AmaxI𝛼𝐿subscriptsuperscript𝐴𝑇subscriptsuperscript𝐷2superscriptsubscript𝐴𝐼𝑓subscriptsuperscript𝐴𝑇subscriptsuperscript𝐷2subscriptsuperscript𝐴𝐼\alpha=\frac{LA^{T}_{\max}D^{2}_{\min}A_{\min}^{I}}{f(A^{T}_{\max})D^{2}_{\max% }A^{I}_{\max}}italic_α = divide start_ARG italic_L italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT end_ARG start_ARG italic_f ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG. ∎

References

  • [1] S. Lin, Z. Zhou, Z. Zhang, X. Chen, and J. Zhang, Edge Intelligence in the Making: Optimization, Deep Learning, and Applications, ser. Synthesis Lectures on Learning, Networks, and Algorithms.   Morgan & Claypool Publishers, 2020.
  • [2] S. Niu, J. Wu, Y. Zhang, Y. Chen, S. Zheng, P. Zhao, and M. Tan, “Efficient test-time model adaptation without forgetting,” in International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, vol. 162.   PMLR, 2022, pp. 16 888–16 905.
  • [3] N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, and J. Liang, “Convolutional neural networks for medical image analysis: Full training or fine tuning?” IEEE Transactions on Medical Imaging, vol. 35, no. 5, pp. 1299–1312, 2016.
  • [4] J. Duan, S. Yu, H. L. Tan, H. Zhu, and C. Tan, “A survey of embodied AI: from simulators to research tasks,” IEEE Trans. Emerg. Top. Comput. Intell., vol. 6, no. 2, pp. 230–244, 2022.
  • [5] J. Liang, R. He, and T. Tan, “A comprehensive survey on test-time adaptation under distribution shifts,” CoRR, vol. abs/2303.15361, 2023.
  • [6] J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang, “Learning under concept drift: A review,” IEEE Trans. Knowl. Data Eng., vol. 31, no. 12, pp. 2346–2363, 2019.
  • [7] R. Wu, C. Guo, Y. Su, and K. Q. Weinberger, “Online adaptation to label distribution shift,” in Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, 2021, pp. 11 340–11 351.
  • [8] R. Fakoor, J. Mueller, Z. C. Lipton, P. Chaudhari, and A. J. Smola, “Data drift correction via time-varying importance weight estimator,” CoRR, vol. abs/2210.01422, 2022.
  • [9] R. Bhardwaj, Z. Xia, G. Ananthanarayanan, J. Jiang, Y. Shu, N. Karianakis, K. Hsieh, P. Bahl, and I. Stoica, “Ekya: Continuous learning of video analytics models on edge compute servers,” in 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022, Renton, WA, USA, April 4-6, 2022.   USENIX Association, 2022, pp. 119–135.
  • [10] M. Khani, G. Ananthanarayanan, K. Hsieh, J. Jiang, R. Netravali, Y. Shu, M. Alizadeh, and V. Bahl, “RECL: responsive resource-efficient continuous learning for video analytics,” in 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023, Boston, MA, April 17-19, 2023.   USENIX Association, 2023, pp. 917–932.
  • [11] L. Wang, K. Lu, N. Zhang, X. Qu, J. Wang, J. Wan, G. Li, and J. Xiao, “Shoggoth: Towards efficient edge-cloud collaborative real-time video inference via adaptive online learning,” in 60th ACM/IEEE Design Automation Conference, DAC 2023, San Francisco, CA, USA, July 9-13, 2023.   IEEE, 2023, pp. 1–6.
  • [12] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2016.
  • [13] R. Dong, Z. Tan, M. Wu, L. Zhang, and K. Ma, “Finding the task-optimal low-bit sub-distribution in deep neural networks,” in International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, vol. 162.   PMLR, 2022, pp. 5343–5359.
  • [14] L. Jia, Z. Zhou, F. Xu, and H. **, “Cost-efficient continuous edge learning for artificial intelligence of things,” IEEE Internet of Things Journal, vol. 9, no. 10, pp. 7325–7337, 2022.
  • [15] C. Zhao, F. Mi, X. Wu, K. Jiang, L. Khan, and F. Chen, “Adaptive fairness-aware online meta-learning for changing environments,” in KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022, 2022, pp. 2565–2575.
  • [16] P. Yang, F. Lyu, W. Wu, N. Zhang, L. Yu, and X. S. Shen, “Edge coordinated query configuration for low-latency and accurate video analytics,” IEEE Trans. Ind. Informatics, vol. 16, no. 7, pp. 4855–4864, 2020.
  • [17] K. Zhao, Z. Zhou, X. Chen, R. Zhou, X. Zhang, S. Yu, and D. Wu, “Edgeadaptor: Online configuration adaption, model selection and resource provisioning for edge dnn inference serving at scale,” IEEE Transactions on Mobile Computing, pp. 1–16, 2022.
  • [18] E. Li, L. Zeng, Z. Zhou, and X. Chen, “Edge AI: on-demand accelerating deep neural network inference via edge computing,” IEEE Trans. Wirel. Commun., vol. 19, no. 1, pp. 447–457, 2020.
  • [19] M. K. Shirkoohi, P. Hamadanian, A. Nasr-Esfahany, and M. Alizadeh, “Real-time video inference on edge devices via adaptive model streaming,” in 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021.   IEEE, 2021, pp. 4552–4562.
  • [20] S. G. Patil, P. Jain, P. Dutta, I. Stoica, and J. Gonzalez, “POET: training neural networks on tiny devices with integrated rematerialization and paging,” in International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, vol. 162.   PMLR, 2022, pp. 17 573–17 583.
  • [21] D. Xu, M. Xu, Q. Wang, S. Wang, Y. Ma, K. Huang, G. Huang, X. **, and X. Liu, “Mandheling: mixed-precision on-device DNN training with DSP offloading,” in ACM MobiCom ’22: The 28th Annual International Conference on Mobile Computing and Networking, Sydney, NSW, Australia, October 17 - 21, 2022.   ACM, 2022, pp. 214–227.
  • [22] C. Lv, C. Niu, R. Gu, X. Jiang, Z. Wang, B. Liu, Z. Wu, Q. Yao, C. Huang, P. Huang, T. Huang, H. Shu, J. Song, B. Zou, P. Lan, G. Xu, F. Wu, S. Tang, F. Wu, and G. Chen, “Walle: An End-to-End, General-Purpose, and Large-Scale production system for Device-Cloud collaborative machine learning,” in 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22).   Carlsbad, CA: USENIX Association, Jul. 2022, pp. 249–265.
  • [23] J. J. Moon, P. Kapoor, J. Lee, M. Ham, and H. S. Lee, “Nntrainer: Light-weight on-device training framework,” CoRR, vol. abs/2206.04688, 2022.
  • [24] M. Abadi, “Tensorflow lite,” 2023, https://www.tensorflow.org/lite [Accessed: (Jul. 28, 2023)].
  • [25] A. Paszke, “Pytorch mobile,” 2023, https://pytorch.org/mobile/ [Accessed: (Jul. 28, 2023)].
  • [26] T. Domhan, J. T. Springenberg, and F. Hutter, “Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves,” in Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015.   AAAI Press, 2015, pp. 3460–3468.
  • [27] T. Lattimore and A. György, “Improved regret for zeroth-order stochastic convex bandits,” in Conference on Learning Theory, COLT 2021, 15-19 August 2021, Boulder, Colorado, USA, ser. Proceedings of Machine Learning Research, vol. 134.   PMLR, 2021, pp. 2938–2964.
  • [28] P. Zhao, G. Wang, L. Zhang, and Z. Zhou, “Bandit convex optimization in non-stationary environments,” J. Mach. Learn. Res., vol. 22, pp. 125:1–125:45, 2021.
  • [29] A. Gupta, R. Krishnaswamy, and K. Pruhs, “Online primal-dual for non-linear optimization with applications to speed scaling,” in Approximation and Online Algorithms - 10th International Workshop, WAOA 2012, Ljubljana, Slovenia, September 13-14, 2012, Revised Selected Papers, vol. 7846.   Springer, 2012, pp. 173–186.
  • [30] A. Simonetto, E. Dall’Anese, S. Paternain, G. Leus, and G. B. Giannakis, “Time-varying convex optimization: Time-structured algorithms and applications,” Proc. IEEE, vol. 108, no. 11, pp. 2032–2048, 2020.
  • [31] D. Hendrycks and T. G. Dietterich, “Benchmarking neural network robustness to common corruptions and perturbations,” in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.   OpenReview.net, 2019.
  • [32] Q. Wang, O. Fink, L. Van Gool, and D. Dai, “Continual test-time domain adaptation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 7201–7211.