License: CC BY 4.0
arXiv:2312.07858v1 [eess.SY] 13 Dec 2023

Non-myopic Beam Scheduling for Multiple Smart Target Tracking in Phased Array Radar Network

Yuhang Hao, Zengfu Wang, José Niño-Mora, **g Fu, Min Yang, and Quan Pan Yuhang Hao is with the School of Automation, Northwestern Polytechnical University, and the Key Laboratory of Information Fusion Technology, Ministry of Education, Xi’an, Shaanxi, 710072, China. Part of this work is performed during his research stay in the Department of Statistics at Carlos III University of Madrid, Spain. Zengfu Wang, Min Yang, Quan Pan are with the School of Automation, Northwestern Polytechnical University, and the Key Laboratory of Information Fusion Technology, Ministry of Education, Xi’an, Shaanxi, 710072, China. José Niño-Mora is with the Department of Statistics, Carlos III University of Madrid, Getafe, Madrid, 28903, Spain. **g Fu is with School of Engineering, RMIT University, Melbourne, VIC, 3000, Australia. E-mail: ([email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]). (Corresponding author: Zengfu Wang.) This work was in part supported by the National Natural Science Foundation of China (grant no. U21B2008, 62233014). The work of J. Niño-Mora was funded in part by Spain’s State Research Agency (AEI) project PID2019-109196GB-I00/AEI/10.13039/501100011033.
Abstract

A smart target, also referred to as a reactive target, can take maneuvering motions to hinder radar tracking. We address beam scheduling for tracking multiple smart targets in phased array radar networks. We aim to mitigate the performance degradation in previous myopic tracking methods and enhance the system performance, which is measured by a discounted cost objective related to the tracking error covariance (TEC) of the targets. The scheduling problem is formulated as a restless multi-armed bandit problem (RMABP) with state variables, following the Markov decision process. In particular, the problem consists of parallel bandit processes. Each bandit process is associated with a target and evolves with different transition rules for different actions, i.e., either the target is tracked or not. We propose a non-myopic, scalable policy based on Whittle indices for selecting the targets to be tracked at each time. The proposed policy has a linear computational complexity in the number of targets and the truncated time horizon in the index computation, and is hence applicable to large networks with a realistic number of targets. We present numerical evidence that the model satisfies sufficient conditions for indexability (existence of the Whittle index) based upon partial conservation laws, and, through extensive simulations, we validate the effectiveness of the proposed policy in different scenarios.

Index Terms:
Target tracking, beam scheduling, restless bandits, Whittle index.

I Introduction

State-of-the-art sensor scheduling approaches have enabled higher flexibility for tracking smart targets through phased array radar networks, where real-time beam direction control is handled electronically [1, 2, 3]. Dynamic tracking of smart targets, which have the ability to be aware that they are being tracked and adapt their dynamics accordingly to hinder tracking accuracy, has drawn recent research attention [4, 5, 6]. Hence, beam scheduling in radar networks with a trade-off between frequency of tracking and probability of maneuvering is complicated and of importance in smart-target tracking. In this paper, we address a model to minimize the expected total discounted error for tracking smart targets through dynamic beam scheduling [7, 8].

Resource scheduling in radar systems has been mostly addressed through myopic policies. In [9], the challenge of antenna selection within a distributed multi-radar system was formulated as a knapsack problem, with the Cramér–Rao lower bound (CRLB) being regarded as the objective function. By exploring the multi-start knapsack tree, a local search algorithm with a multi-start strategy was adapted to solve the problem. In [10, 11, 12, 13, 14, 15, 16, 17, 18], at each time, once obtaining the estimation of the target dynamic states, the predicted one-step posterior CRLB (PCRLB) is calculated and used as the optimization metric to allocate resources for the next time step. However, in general, myopic scheduling policies exhibit inevitable performance degradation in the long run [19].

Non-myopic scheduling policies based on predicted multi-step objective functions or on partially observed Markov decision processes (POMDPs) [20] have been investigated in a wide range of settings. In [19], a bee colony algorithm with particle swarm optimization (PSO) was introduced to optimize a multi-step objective function representing operational risk. In [21, 22], the joint multi-target probability density (JMPD) was recursively estimated by particle filtering methods, and the maximum expected Rényi divergence between the JMPD and the JMPD recalculated with a new derived measurement was considered for long-term performance optimization. In [21], two information-directed approaches were presented to approximate the long-term effects of each action. One is a path searching approach, aiming to reduce the computational complexity of a full Monte Carlo search. The other one approximates an action’s long-term benefit through a distinct function, expressed in relation to “opportunity cost” or “regret”. In [22], a non-myopic sensor scheduling method leveraging the POMDP framework was utilized to improve long-term performance. In [23], a smart target tracking problem was modeled as a POMDP using the PCRLB as system state and multi-step cost prediction over unscented sampling. An improved decision tree search algorithm leveraging branch and bound was used to achieve non-myopic scheduling optimization for maneuvering target tracking. Similarly as in [23], a branch-and-bound algorithm was considered in [24] for sensor scheduling to reduce the interception probability. In [25, 26], a POMDP-based branch-and-bound algorithm with worst-case exponential complexity was applied to a sensor allocation problem for reducing the risks of the sensor radiation interception and target threat level, which provided a non-myopic scheme with improved long-term performance.

Since computing optimal dynamic beam scheduling policies for POMDP multi-target tracking models is generally intractable and has worst-case exponential complexity, researchers have instead aimed to design suboptimal heuristic index policies with low computational complexity. An index policy is based on defining an index for each target, a scalar function of its state, and giving higher tracking priority at each time to targets with larger index values. A simple greedy index for tracking a target moving in one dimension is its tracking error variance (TEV), which is updated via scalar Kalman filter dynamics [27].

A versatile model for addressing optimal resource management problems and index policies are provided by the restless multi-armed bandit problem (RMABP) [28], which is an extension to the multi-armed bandit problem (MABP) [29, 30, 31]. The RMABP focuses on selecting a maximum of M𝑀Mitalic_M from a pool of NM𝑁𝑀N\geq Mitalic_N ≥ italic_M projects (or bandits), which are versatile entities capable of being either active (i.e., chosen) or passive at each time. Different from MABP, in an RMABP, a bandit is able to change state while passive.

While solving optimally the RMABP is generally intractable [32], Whittle proposed in [28] a heuristic index policy, which has since been widely applied [33], where the index of a project depends only on its parameters. The Whittle index is implicitly defined as the state-dependent critical subsidy for passivity that makes both actions optimal in a single-project subproblem. Yet, neither existence nor uniqueness of such a critical subsidy is guaranteed, so the Whittle index is well defined only for the class of indexable projects. Whittle conjectured that, when all projects are indexable, this index policy should approach optimality asymptotically under the average criterion as M𝑀Mitalic_M and N𝑁Nitalic_N grow to infinity in a fixed ratio. [34] showed this not to be generally true, but gave sufficient conditions under which the conjecture holds.

Establishing indexability of a restless bandit model is widely considered a challenging task, which hinders the application of Whittle’s index policy. The currently dominant approach for proving indexability, mostly applied to one-dimensional state models, entails a two-step process: first, optimality of threshold policies is proven for single-project subproblems; and, second, the Whittle index is obtained from the optimal threshold under an appropriate monotonicity condition on the latter. Yet, some models have not yielded to such an approach, in particular restless bandit models for target tracking with Kalman filter dynamics, as considered here.

The RMABP formulation of multi-target tracking was first considered in [35], extending work in [27]. Given the lack of effective tools to prove indexability in such a setting, [35] addressed the greedy index policy and focused on identifying sufficient conditions for its optimality in scenarios with two symmetric scalar-state targets. [36] further investigated that model using the prevailing approach to indexability, encountering that it posed seemingly unsurmountable difficulties, so they assumed the optimality of threshold policies without proving it. The Whittle index policy has been studied via simulation to achieve near-optimality in related past studies [37, 38, 39].

Niño-Mora [40, 41, 42, 43] developed an alternative method to prove indexability and assess the Whittle index, circumventing the need to establish the optimality of threshold policies. It sufficed to show that certain project performance metrics under such policies satisfy PCL-indexability conditions —named after satisfaction of partial conservation laws (PCLs) [40, 44]. A verification theorem guarantees the optimality of threshold policies as well as the indexability of the model and its Whittle index was given by an explicitly defined marginal productivity (MP) index. This was proven in [40, 41, 42, 43] in increasingly general settings.

The PCL-indexability conditions for real-state restless bandits under the discounted criterion, which we shall apply here, were first outlined without proof in [45], and their application to multi-target tracking was numerically explored in [46, 47]. The verification theorem was proven in [48, 43]. [49] used such conditions to first establish the indexability of scalar-state Kalman-filter restless bandits.

In the more practically relevant case of multi-target tracking RMABP models with multi-dimensional state Kalman filter dynamics, indexability is currently an open problem. A heuristic approach was developed in [50], where scalar-state project approximations were considered by taking as system state the trace of the channel estimation mean square error (MSE).

For smart targets, sensor scheduling problems become substantially more complex, as such targets react to radar sensing by switching their dynamics between different modes, e.g., a constant velocity (CV) mode, a constant acceleration (CA) mode, a constant turn (CT) mode, etc. [4, 5, 6, 51]. In such scenarios, it might be beneficial to refrain from sensing too often those targets that are likely to hide or escape when tracked, as this would make them harder to track in the future. Therefore, proactive measures are necessary to effectively track smart targets. In [4], when a smart target is viewed by a sensor, it reacts by “hiding” itself. The tracking problem was formulated as a Markov decision process (MDP) with an infinite time horizon, which was addressed through a two-stage reinforcement-learning approach with separated optimizations for detection and tracking. In [5], a modified algebraic Riccati equation (MARE) was applied in a Kalman-based target tracking system with smart targets. This work considered game-theoretic methods that aimed to optimize waveform parameters and radar modes under imperfect measurement information, which was caused by the variation of the detection probability under target maneuvering. In [6], a POMDP model with multiple dynamics models was considered to minimize the interception risk during the tracking of multiple reactive targets. A roll-out method based on unscented transformation sampling (UTS) was introduced to approximate the long-term reward and to select sensors based on the closest distance policy. In [38], the problem of hunting hiding targets was formulated as an RMABP, where the state of a target is its posterior probability of being exposed. Experimental results were presented showing that the Whittle index policy outperforms greedy policies.

Hence, substantial research challenges remain in sensor scheduling for tracking smart targets, including: 1) Solving optimally Markov decision process (MDP) models with multiple smart targets entails a prohibitively high computational cost; 2) In models with multi-dimensional tracking error covariance (TEC) state, the application of the Whittle index policy is at present elusive; and 3) Increasing radar radiation for enhancing tracking performance needs to be optimally traded off with the potential of targets to react with evasive maneuvers in response to radiation detection.

In this paper, we consider a beam scheduling problem for tracking multiple smart targets that have a high probability of switching their dynamics when observed by phased array radars so as to hinder their tracking. Otherwise, targets have a high probability of selecting a CV dynamics model under non-observation. Similar to the interacting multiple models (IMM) method, the state of each target is defined as its sum-weighted TEC based on dynamics model probability vectors, which are assumed to be constant. To formulate the beam scheduling problem, we consider an infinite-horizon discounted MDP model of RMABP type, where each target corresponds to a restless project. The RMABP formulation is exploited to propose a non-myopic low-complexity scheduling policy. In particular, we aim to apply the Whittle index policy to the dynamic beam scheduling for tracking smart targets. Yet, such a goal is hindered by the fact that no proof of indexability (existence of the Whittle index) is at present available for this model. We partially circumvent this difficulty by presenting some numerical evidence supporting the conjecture that the model satisfies the aforementioned PCL-indexability conditions, and use the MP index policy resulting from them. Recall that, if it were proven that the model satisfied such conditions, this would imply that the model is indexable and the MP index would be its Whittle index. In addition, a different MP-based index policy is proposed in the multi-dimensional TEC state case. Through extensive simulation results, this index policy is shown to outperform baseline policies. The contributions of this paper are summarized as follows.

  • We formulate the beam scheduling problem for tracking multiple smart targets as an infinite-horizon discounted RMABP. The smart targets are characterized by tracking-action-dependent dynamics models with high maneuvering, which aim to hide themselves when being tracked. Each target is associated with a restless project, whose state is its sum-weighted TEC state based on the multiple-dynamic-model probability vectors.

  • We aim to minimize the multi-target tracking error measured by the sum of the TECs. To mitigate the maneuvering of the smart targets, the system should avoid overuse of tracking action. In the real-state case, we apply the Whittle index policy (using the MP index in the belief that it coincides with the Whittle index ) to this beam scheduling problem.

  • The Whittle index policy is analyzed in the real-valued state model, while in the multi-dimensional TEC state case, a different MP index policy is considered. The TEC states evolve through the Kalman filter. For the real-valued case, we present numerical evidence that the sufficient indexability conditions based on PCLs are satisfied, which is also used to evaluate the index. Furthermore, we provide the results of a simulation study assessing the sub-optimality gap of the Whittle index policy. The effectiveness of the proposed MP index policy for the multi-dimensional target state case is also assessed by a simulation study, where it is shown to outperform baseline policies.

The remainder of this paper is organized as follows. In Section II, the target dynamics models and measurement models are defined. Then, the state update under different actions and the discounted long-term objective of the scheduling problem are formulated. In Section III, the application of the Whittle index policy is discussed. The beam selection scheme is developed in the RMABP model based on MP indices for real-valued and multi-dimensional state cases. The computational method of the lower bound of optimization functions and the computational complexity of policies are presented. In Section IV, real-valued and multi-dimensional state cases are considered, and reckless and cautious targets with different dynamics model probabilities are also considered. The simulation results demonstrate the superiority of the proposed index policies. Section V concludes the paper.

II Model description and problem formulation

We update the target states via the Kalman filter and exploit the multiple dynamics model and corresponding model probabilities to represent the motion characteristics of smart targets. Subsequently, we formulate the beam scheduling problem as an RMABP, of which each bandit process is associated with a target.

II-A Target dynamics models

We consider N𝑁Nitalic_N smart targets labeled by n=1,2,,N𝑛12𝑁n=1,2,\ldots,Nitalic_n = 1 , 2 , … , italic_N and a radar network consisting of K<N𝐾𝑁K<Nitalic_K < italic_N phased array radars, as illustrated in Fig. 1. In Fig. 1, targets 1 and N𝑁Nitalic_N will likely change their dynamics models in response to radar tracking, while target 2 will be likely to maintain its dynamics model. All radars are synchronized to operate over time slots t=0,1,𝑡01t=0,1,\ldotsitalic_t = 0 , 1 , …. The radar network centrally steers beams to track at most K𝐾Kitalic_K targets at each time, as one radar beam can be only assigned to one target in a time slot.

Refer to caption
Figure 1: System model of a phased array radar network.

We assume that radars detect targets with unit probability. Denote the kinematic state of target n𝑛nitalic_n at time t𝑡titalic_t by 𝒙n,tLsubscript𝒙𝑛𝑡superscript𝐿\bm{x}_{n,t}\in\mathbb{R}^{L}bold_italic_x start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT. Targets follow independent linear dynamics. Meanwhile, they are smart, that is, they react to radar radiation by switching between M𝑀Mitalic_M dynamics models, labeled by m=1,2,,M𝑚12𝑀m=1,2,\ldots,Mitalic_m = 1 , 2 , … , italic_M, e.g., CV, CA, CT, etc. Target n𝑛nitalic_n’s dynamics under model m𝑚mitalic_m is

𝒙n,t=𝑭nm𝒙n,t1+𝒘n,tm.subscript𝒙𝑛𝑡superscriptsubscript𝑭𝑛𝑚subscript𝒙𝑛𝑡1superscriptsubscript𝒘𝑛𝑡𝑚\bm{x}_{n,t}=\bm{F}_{n}^{m}\bm{x}_{n,t-1}+\bm{w}_{n,t}^{m}.bold_italic_x start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = bold_italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_n , italic_t - 1 end_POSTSUBSCRIPT + bold_italic_w start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT . (1)

In (1), 𝑭nmsuperscriptsubscript𝑭𝑛𝑚\bm{F}_{n}^{m}bold_italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is the state transition matrix, and 𝒘n,tmsuperscriptsubscript𝒘𝑛𝑡𝑚\bm{w}_{n,t}^{m}bold_italic_w start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is process noise with mean being zero and covariance matrix being 𝐐nmsuperscriptsubscript𝐐𝑛𝑚\bm{\mathrm{Q}}_{n}^{m}bold_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. We assume that 𝐐nm=qnm𝐐nsuperscriptsubscript𝐐𝑛𝑚superscriptsubscript𝑞𝑛𝑚subscript𝐐𝑛\bm{\mathrm{Q}}_{n}^{m}=q_{n}^{m}\bm{\mathrm{Q}}_{n}bold_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT bold_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT for a covariance matrix 𝐐nsubscript𝐐𝑛\bm{\mathrm{Q}}_{n}bold_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, where 0<qn1<qn2<<qnM0superscriptsubscript𝑞𝑛1superscriptsubscript𝑞𝑛2superscriptsubscript𝑞𝑛𝑀0<q_{n}^{1}<q_{n}^{2}<\cdots<q_{n}^{M}0 < italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT < italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ⋯ < italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT and m=1𝑚1m=1italic_m = 1 corresponds to the CV model, so the uncertainty of the dynamics model increases with m𝑚mitalic_m.

At each time t𝑡titalic_t, we use a binary variable an,t{0,1}subscript𝑎𝑛𝑡01a_{n,t}\in\left\{0,1\right\}italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } to denote the tracking action imposed on target n𝑛nitalic_n. Let an,t=0subscript𝑎𝑛𝑡0a_{n,t}=0italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = 0 if target n𝑛nitalic_n is not tracked. Conversely, when an,t=1subscript𝑎𝑛𝑡1a_{n,t}=1italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = 1, the target is actively tracked, and a measurement 𝒚n,tsubscript𝒚𝑛𝑡\bm{y}_{n,t}bold_italic_y start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT is acquired following the subsequent measurement equation,

𝒚n,t=𝐇n𝒙n,t+𝒗n,t,subscript𝒚𝑛𝑡subscript𝐇𝑛subscript𝒙𝑛𝑡subscript𝒗𝑛𝑡\bm{y}_{n,t}=\bm{\mathrm{H}}_{n}\bm{{x}}_{n,t}+\bm{v}_{n,t},bold_italic_y start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = bold_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT + bold_italic_v start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT , (2)

where 𝐇nsubscript𝐇𝑛\bm{\mathrm{H}}_{n}bold_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the measurement matrix, and 𝒗n,tsubscript𝒗𝑛𝑡\bm{v}_{n,t}bold_italic_v start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT is a zero-mean Gaussian white noise with covariance matrix 𝐑nsubscript𝐑𝑛\bm{\mathrm{R}}_{n}bold_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

II-B Probabilities of dynamics models

Targets are assumed to change their dynamics models depending only on whether they are being tracked. After action ansubscript𝑎𝑛a_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is applied on target n𝑛nitalic_n at each time, its dynamics model is randomly drawn to be m𝑚mitalic_m with probability unan,msuperscriptsubscript𝑢𝑛subscript𝑎𝑛𝑚u_{n}^{a_{n},m}italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_m end_POSTSUPERSCRIPT, for m=1,,M𝑚1𝑀m=1,\ldots,Mitalic_m = 1 , … , italic_M. We write 𝐔n[𝒖n0,𝒖n1]subscript𝐔𝑛superscriptsubscript𝒖𝑛0superscriptsubscript𝒖𝑛1\bm{\mathrm{U}}_{n}\triangleq\left[\bm{u}_{n}^{0},\bm{u}_{n}^{1}\right]bold_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≜ [ bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ], where 𝒖nan=[unan,1,,unan,M]superscriptsubscript𝒖𝑛subscript𝑎𝑛superscriptsuperscriptsubscript𝑢𝑛subscript𝑎𝑛1superscriptsubscript𝑢𝑛subscript𝑎𝑛𝑀\bm{u}_{n}^{a_{n}}=\left[u_{n}^{a_{n},1},\ldots,u_{n}^{a_{n},M}\right]^{\prime}bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = [ italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , 1 end_POSTSUPERSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_M end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and ()superscript(\cdot)^{\prime}( ⋅ ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the transpose operator. We assume that un0,1>un0,msuperscriptsubscript𝑢𝑛01superscriptsubscript𝑢𝑛0𝑚u_{n}^{0,1}>u_{n}^{0,m}italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 , 1 end_POSTSUPERSCRIPT > italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 , italic_m end_POSTSUPERSCRIPT, m=2,3,,M𝑚23𝑀m=2,3,\ldots,Mitalic_m = 2 , 3 , … , italic_M, so targets that are not being tracked are more likely to move into model m=1𝑚1m=1italic_m = 1 (CV). We further assume that un1,1<un1,msuperscriptsubscript𝑢𝑛11superscriptsubscript𝑢𝑛1𝑚u_{n}^{1,1}<u_{n}^{1,m}italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 , 1 end_POSTSUPERSCRIPT < italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 , italic_m end_POSTSUPERSCRIPT for m=2,3,,M𝑚23𝑀m=2,3,\ldots,Mitalic_m = 2 , 3 , … , italic_M, so targets that are being tracked are less likely to move into model m=1𝑚1m=1italic_m = 1.

Consequently, the smart targets are characterized by the dynamics model probability matrix, and the states 𝒙𝒙\bm{x}bold_italic_x of the targets are affected not only by the CV model but also by the other dynamics models.

II-C Kalman filter-based TEC update

Target TECs evolve based on the Kalman filter [52]. Denote by 𝐏n,tsubscript𝐏𝑛𝑡{\mathbf{P}}_{n,t}bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT the TEC of target n𝑛nitalic_n at the beginning of time slot t𝑡titalic_t. Starting from the initial TEC 𝐏n,0subscript𝐏𝑛0{\mathbf{P}}_{n,0}bold_P start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT, 𝐏n,tsubscript𝐏𝑛𝑡{\mathbf{P}}_{n,t}bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT is recursively updated over time slots t1𝑡1t\geq 1italic_t ≥ 1 being conditioned by the chosen actions, taking into account the dynamics models, as follows.

If the target is tracked at time t1𝑡1t-1italic_t - 1 (an,t1=1subscript𝑎𝑛𝑡11a_{n,t-1}=1italic_a start_POSTSUBSCRIPT italic_n , italic_t - 1 end_POSTSUBSCRIPT = 1), then

𝐏n,t=ϕn1(𝐏n,t1)=m=1Mun1,m𝐏^n,tm,subscript𝐏𝑛𝑡superscriptsubscriptitalic-ϕ𝑛1subscript𝐏𝑛𝑡1superscriptsubscript𝑚1𝑀superscriptsubscript𝑢𝑛1𝑚superscriptsubscript^𝐏𝑛𝑡𝑚\displaystyle{\mathbf{P}}_{n,t}=\mathrm{\phi}_{n}^{1}\left({\mathbf{P}}_{n,t-1% }\right)=\sum_{m=1}^{M}u_{n}^{1,m}\hat{\mathbf{P}}_{n,t}^{m},bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( bold_P start_POSTSUBSCRIPT italic_n , italic_t - 1 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 , italic_m end_POSTSUPERSCRIPT over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , (3)

where

𝐏^n,tm=ϕn1,m(𝐏n,t1)=(𝐈𝐊n,tm𝐇n)𝐏¯n,t|t1m,superscriptsubscript^𝐏𝑛𝑡𝑚superscriptsubscriptitalic-ϕ𝑛1𝑚subscript𝐏𝑛𝑡1𝐈superscriptsubscript𝐊𝑛𝑡𝑚subscript𝐇𝑛superscriptsubscript¯𝐏𝑛conditional𝑡𝑡1𝑚\hat{\mathbf{P}}_{n,t}^{m}=\mathrm{\phi}_{n}^{1,m}\left({\mathbf{P}}_{n,t-1}% \right)=\left(\bm{\mathrm{I}}-\bm{\mathrm{K}}_{n,t}^{m}\bm{\mathrm{H}}_{n}% \right)\bar{\mathbf{P}}_{n,t|t-1}^{m},over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = italic_ϕ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 , italic_m end_POSTSUPERSCRIPT ( bold_P start_POSTSUBSCRIPT italic_n , italic_t - 1 end_POSTSUBSCRIPT ) = ( bold_I - bold_K start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) over¯ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_n , italic_t | italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , (4)
𝐏¯n,t|t1m=𝑭nm𝐏n,t1(𝑭nm)+𝐐nm,superscriptsubscript¯𝐏𝑛conditional𝑡𝑡1𝑚superscriptsubscript𝑭𝑛𝑚subscript𝐏𝑛𝑡1superscriptsuperscriptsubscript𝑭𝑛𝑚superscriptsubscript𝐐𝑛𝑚\bar{\mathbf{P}}_{n,t|t-1}^{m}=\bm{F}_{n}^{m}{\mathbf{P}}_{n,t-1}\left(\bm{F}_% {n}^{m}\right)^{\prime}+\bm{\mathrm{Q}}_{n}^{m},over¯ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_n , italic_t | italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = bold_italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_n , italic_t - 1 end_POSTSUBSCRIPT ( bold_italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + bold_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , (5)
𝐊n,tm=𝐏¯n,t|t1m𝐇n(𝐇n𝐏¯n,t|t1m𝐇n+𝐑n)1,superscriptsubscript𝐊𝑛𝑡𝑚subscriptsuperscript¯𝐏𝑚𝑛conditional𝑡𝑡1subscriptsuperscript𝐇𝑛superscriptsubscript𝐇𝑛superscriptsubscript¯𝐏𝑛conditional𝑡𝑡1𝑚subscriptsuperscript𝐇𝑛subscript𝐑𝑛1\bm{\mathrm{K}}_{n,t}^{m}=\bar{\mathbf{P}}^{m}_{n,t|t-1}\bm{\mathrm{H}}^{% \prime}_{n}\left(\bm{\mathrm{H}}_{n}\bar{\mathbf{P}}_{n,t|t-1}^{m}\bm{\mathrm{% H}}^{\prime}_{n}+\bm{\mathrm{R}}_{n}\right)^{-1},bold_K start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = over¯ start_ARG bold_P end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t | italic_t - 1 end_POSTSUBSCRIPT bold_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over¯ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_n , italic_t | italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT bold_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , (6)

and 𝐈𝐈\bm{\mathrm{I}}bold_I is the L×L𝐿𝐿L\times Litalic_L × italic_L identity matrix.

If the target is not tracked at time t1𝑡1t-1italic_t - 1 (an,t1=0subscript𝑎𝑛𝑡10a_{n,t-1}=0italic_a start_POSTSUBSCRIPT italic_n , italic_t - 1 end_POSTSUBSCRIPT = 0), then

𝐏n,t=ϕn0(𝐏n,t1)=m=1Mun0,m𝐏¯n,t|t1m.subscript𝐏𝑛𝑡superscriptsubscriptitalic-ϕ𝑛0subscript𝐏𝑛𝑡1superscriptsubscript𝑚1𝑀superscriptsubscript𝑢𝑛0𝑚superscriptsubscript¯𝐏𝑛conditional𝑡𝑡1𝑚\displaystyle{\mathbf{P}}_{n,t}=\mathrm{\phi}_{n}^{0}\left({\mathbf{P}}_{n,t-1% }\right)=\sum_{m=1}^{M}u_{n}^{0,m}\bar{\mathbf{P}}_{n,t|t-1}^{m}.bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( bold_P start_POSTSUBSCRIPT italic_n , italic_t - 1 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 , italic_m end_POSTSUPERSCRIPT over¯ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_n , italic_t | italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT . (7)

Note that the TEC update recursion given by (3)–(7) is deterministic. It follows that action an,t1subscript𝑎𝑛𝑡1a_{n,t-1}italic_a start_POSTSUBSCRIPT italic_n , italic_t - 1 end_POSTSUBSCRIPT updates the TEC 𝐏n,tsubscript𝐏𝑛𝑡{\bm{\mathrm{P}}}_{n,t}bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT through sum-weighted method based on the model probability vector 𝒖nan,t1superscriptsubscript𝒖𝑛subscript𝑎𝑛𝑡1\bm{u}_{n}^{a_{n,t-1}}bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_n , italic_t - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. When an,t1=1subscript𝑎𝑛𝑡11a_{n,t-1}=1italic_a start_POSTSUBSCRIPT italic_n , italic_t - 1 end_POSTSUBSCRIPT = 1, the target n𝑛nitalic_n maneuvers with higher probability un1,msuperscriptsubscript𝑢𝑛1𝑚u_{n}^{1,m}italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 , italic_m end_POSTSUPERSCRIPT, m=2,3,,M𝑚23𝑀m=2,3,\ldots,Mitalic_m = 2 , 3 , … , italic_M than CV with lower probability un1,1superscriptsubscript𝑢𝑛11u_{n}^{1,1}italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 , 1 end_POSTSUPERSCRIPT. It leads to a larger determinant of TEC state 𝐏n,tsubscript𝐏𝑛𝑡{\bm{\mathrm{P}}}_{n,t}bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT than that of 𝐏^n,t1,1superscriptsubscript^𝐏𝑛𝑡11\hat{\bm{\mathrm{P}}}_{n,t}^{1,1}over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 , 1 end_POSTSUPERSCRIPT solely calculated by CV model. If an,t1=0subscript𝑎𝑛𝑡10a_{n,t-1}=0italic_a start_POSTSUBSCRIPT italic_n , italic_t - 1 end_POSTSUBSCRIPT = 0, then, since qn1<qnmsuperscriptsubscript𝑞𝑛1superscriptsubscript𝑞𝑛𝑚q_{n}^{1}<q_{n}^{m}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT < italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, m=2,3,,M𝑚23𝑀m=2,3,\ldots,Mitalic_m = 2 , 3 , … , italic_M, with larger un0,1superscriptsubscript𝑢𝑛01u_{n}^{0,1}italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 , 1 end_POSTSUPERSCRIPT in (7), the determinant of TEC state 𝐏n,tsubscript𝐏𝑛𝑡{\bm{\mathrm{P}}}_{n,t}bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT will be smaller than that with probability 𝒖n1superscriptsubscript𝒖𝑛1\bm{u}_{n}^{1}bold_italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT.

II-D RMABP model formulation

We next formulate the optimal beam scheduling problem for K𝐾Kitalic_K radars and N𝑁Nitalic_N smart targets as an infinite-horizon discrete-time RMABP, aiming to achieve the optimal dynamic selection of K𝐾Kitalic_K out of N𝑁Nitalic_N binary-action projects. For such a purpose, we identify a target with a project, which can be operated under the active action (track the target) and the passive action (do not track it). We take as the state of project n=1,,N𝑛1𝑁n=1,\ldots,Nitalic_n = 1 , … , italic_N the corresponding target’s TEC, 𝐏n,tsubscript𝐏𝑛𝑡\mathbf{P}_{n,t}bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT, which moves over the state space 𝕊++Lsuperscriptsubscript𝕊absent𝐿\mathbb{S}_{++}^{L}blackboard_S start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT of symmetric positive definite L×L𝐿𝐿L\times Litalic_L × italic_L matrices, according to Kalman filter dynamics as shown in Section II-C.

For target n𝑛nitalic_n at time t𝑡titalic_t, the immediate cost caused by imposing action an,tsubscript𝑎𝑛𝑡a_{n,t}italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT is defined as

Cn(𝐏n,t,an,t)dntr(𝐏n,t)/L+hnan,t,subscript𝐶𝑛subscript𝐏𝑛𝑡subscript𝑎𝑛𝑡subscript𝑑𝑛trsubscript𝐏𝑛𝑡𝐿subscript𝑛subscript𝑎𝑛𝑡C_{n}\left(\mathbf{P}_{n,t},a_{n,t}\right)\triangleq d_{n}\mathrm{tr}\left(% \mathbf{P}_{n,t}\right)/L+h_{n}a_{n,t},italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) ≜ italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_tr ( bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) / italic_L + italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT , (8)

where tr()tr\mathrm{tr}(\cdot)roman_tr ( ⋅ ) denotes the trace of a matrix, dn0subscript𝑑𝑛0d_{n}\geq 0italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≥ 0 is the target’s weight, which is used to model the importance of targets, and hn0subscript𝑛0h_{n}\geq 0italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≥ 0 is a measurement cost.

The radar network selects at most K𝐾Kitalic_K out of N𝑁Nitalic_N smart targets to track in each time slot t𝑡titalic_t. We consider that each radar can only steer one beam to track one target at each time, which is formulated as the constraint

n=1Nan,tK,t=0,1,2,.formulae-sequencesuperscriptsubscript𝑛1𝑁subscript𝑎𝑛𝑡𝐾𝑡012\sum_{n=1}^{N}a_{n,t}\leq K,~{}t=0,1,2,\ldots.∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ≤ italic_K , italic_t = 0 , 1 , 2 , … . (9)

Given the joint model probabilities 𝕌=(𝐔n)n=1N𝕌superscriptsubscriptsubscript𝐔𝑛𝑛1𝑁\bm{\mathbb{U}}=(\bm{\mathrm{U}}_{n})_{n=1}^{N}blackboard_bold_U = ( bold_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and a TEC state at the beginning 0=(𝐏n,0)n=1Nsubscript0superscriptsubscriptsubscript𝐏𝑛0𝑛1𝑁\mathbf{\mathbb{P}}_{0}=(\mathbf{P}_{n,0})_{n=1}^{N}blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( bold_P start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, we consider the dynamic optimization problem with discounted cost over an infinite time horizon.

min𝝅Π(K)subscript𝝅Π𝐾\displaystyle\min_{\bm{\pi}\in\Pi(K)}roman_min start_POSTSUBSCRIPT bold_italic_π ∈ roman_Π ( italic_K ) end_POSTSUBSCRIPT E0𝝅[t=0n=1NβtCn(𝐏n,t,an,t)],superscriptsubscriptEsubscript0𝝅delimited-[]superscriptsubscript𝑡0superscriptsubscript𝑛1𝑁superscript𝛽𝑡subscript𝐶𝑛subscript𝐏𝑛𝑡subscript𝑎𝑛𝑡\displaystyle\mathrm{E}_{\mathbf{\mathbb{P}}_{0}}^{\bm{\pi}}\left[\sum_{t=0}^{% \infty}\sum_{n=1}^{N}\beta^{t}C_{n}\left(\mathbf{P}_{n,t},a_{n,t}\right)\right],roman_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) ] , (10)

where 0<β<10𝛽10<\beta<10 < italic_β < 1 is the discount factor, E0𝝅[]superscriptsubscriptEsubscript0𝝅delimited-[]\mathrm{E}_{\mathbf{\mathbb{P}}_{0}}^{\bm{\pi}}[\cdot]roman_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT [ ⋅ ] is expectation under policy 𝝅𝝅\bm{\mathrm{\pi}}bold_italic_π conditioned on the initial state 0subscript0\mathbf{\mathbb{P}}_{0}blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and denote by Π(K)Π𝐾\Pi(K)roman_Π ( italic_K ) the set of stationary scheduling policies that satisfy (9). We denote by V*(0)superscript𝑉subscript0V^{*}\left(\mathbf{\mathbb{P}}_{0}\right)italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) the optimal cost of problem (10).

III RMABP-based beam scheduling policy

Generally, determining an optimal policy for problem (10) is a computationally intractable task, due both to the curse of dimensionality and to the continuous state space. We next discuss how Whittle’s [28] approach for obtaining a heuristic index policy would apply to this smart target tracking problem and discuss the challenges it poses. See also [46].

III-A Problem relaxation and decomposition

To obtain a relaxed version of problem (10), we replace the sample-path constraint (9) by the following constraint,

E0𝝅[t=0n=1Nβtan,t]K1β.superscriptsubscriptEsubscript0𝝅delimited-[]superscriptsubscript𝑡0superscriptsubscript𝑛1𝑁superscript𝛽𝑡subscript𝑎𝑛𝑡𝐾1𝛽\mathrm{E}_{\mathbf{\mathbb{P}}_{0}}^{\bm{\pi}}\left[\sum_{t=0}^{\infty}\sum_{% n=1}^{N}\beta^{t}a_{n,t}\right]\leq\frac{K}{1-\beta}.roman_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ] ≤ divide start_ARG italic_K end_ARG start_ARG 1 - italic_β end_ARG . (11)

Now, consistently with the above notation Π(K)Π𝐾{\Pi}(K)roman_Π ( italic_K ), let Π(N)Π𝑁{\Pi}(N)roman_Π ( italic_N ) represent the set of stationary scheduling policies capable of activating any quantity of projects (tracking any number of targets) at each time. Then, a relaxation of problem (10) is as follows.

min𝝅Π(N)subscript𝝅Π𝑁\displaystyle\min_{\bm{\pi}\in{\Pi}(N)}~{}roman_min start_POSTSUBSCRIPT bold_italic_π ∈ roman_Π ( italic_N ) end_POSTSUBSCRIPT E0𝝅[t=0n=1NβtCn(𝐏n,t,an,t)],superscriptsubscriptEsubscript0𝝅delimited-[]superscriptsubscript𝑡0superscriptsubscript𝑛1𝑁superscript𝛽𝑡subscript𝐶𝑛subscript𝐏𝑛𝑡subscript𝑎𝑛𝑡\displaystyle\mathrm{E}_{\mathbf{\mathbb{P}}_{0}}^{\bm{\pi}}\left[\sum_{t=0}^{% \infty}\sum_{n=1}^{N}\beta^{t}C_{n}\left(\mathbf{P}_{n,t},a_{n,t}\right)\right],roman_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) ] , (12)
s.t.:(11).s.t.:italic-(11italic-)\displaystyle\text{s.t.:}~{}\eqref{eq15}.s.t.: italic_( italic_) .

Therefore, the minimum cost VR(0)superscript𝑉Rsubscript0V^{\textup{R}}(\mathbf{\mathbb{P}}_{0})italic_V start_POSTSUPERSCRIPT R end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) derived from problem (12) establishes a lower bound on V*(0)superscript𝑉subscript0V^{*}(\mathbf{\mathbb{P}}_{0})italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ).

Attach now a Lagrange multiplier λ0𝜆0\lambda\geq 0italic_λ ≥ 0 to the aggregate constraint (11). The Lagrangian relaxation applied to (12) is

min𝝅Π(N)E0𝝅[t=0n=1Nβt{Cn(𝐏n,t,an,t)+λan,t}]Kλ1βsubscript𝝅Π𝑁superscriptsubscriptEsubscript0𝝅delimited-[]superscriptsubscript𝑡0superscriptsubscript𝑛1𝑁superscript𝛽𝑡subscript𝐶𝑛subscript𝐏𝑛𝑡subscript𝑎𝑛𝑡𝜆subscript𝑎𝑛𝑡𝐾𝜆1𝛽\min_{\bm{\pi}\in{\Pi}(N)}\mathrm{E}_{\mathbf{\mathbb{P}}_{0}}^{\bm{\pi}}\left% [\sum_{t=0}^{\infty}\sum_{n=1}^{N}\beta^{t}\{C_{n}\left(\mathbf{P}_{n,t},a_{n,% t}\right)+\lambda a_{n,t}\}\right]-\frac{K\lambda}{1-\beta}roman_min start_POSTSUBSCRIPT bold_italic_π ∈ roman_Π ( italic_N ) end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT { italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) + italic_λ italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT } ] - divide start_ARG italic_K italic_λ end_ARG start_ARG 1 - italic_β end_ARG (13)

Given a λ0𝜆0\lambda\geq 0italic_λ ≥ 0, the optimal value of the Lagrangian relaxation, denoted as VL(0;λ)superscript𝑉Lsubscript0𝜆V^{\textup{L}}(\mathbf{\mathbb{P}}_{0};\lambda)italic_V start_POSTSUPERSCRIPT L end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_λ ), serves as a lower bound for VR(0)superscript𝑉Rsubscript0V^{\textup{R}}(\mathbf{\mathbb{P}}_{0})italic_V start_POSTSUPERSCRIPT R end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ).

If we search for an optimal Lagrange multiplier λ*(0)superscript𝜆subscript0\lambda^{*}(\mathbf{\mathbb{P}}_{0})italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) based on the Lagrangian relaxation, the best lower bound for VR(0)superscript𝑉Rsubscript0V^{\textup{R}}(\mathbf{\mathbb{P}}_{0})italic_V start_POSTSUPERSCRIPT R end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is obtained. This defines the dual problem of (12):

VD(0)maxλ0VL(0;λ).superscript𝑉Dsubscript0subscript𝜆0superscript𝑉Lsubscript0𝜆V^{\textup{D}}(\mathbf{\mathbb{P}}_{0})\triangleq\max_{\lambda\geq 0}\,V^{% \textup{L}}\left(\mathbf{\mathbb{P}}_{0};\lambda\right).italic_V start_POSTSUPERSCRIPT D end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≜ roman_max start_POSTSUBSCRIPT italic_λ ≥ 0 end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT L end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_λ ) . (14)

Problem (13) can be decomposed into N𝑁Nitalic_N single-target independent subproblems given by

minπnΠnE𝐏n,0πn[t=0βt{Cn(𝐏n,t,an,t)+λan,t}].subscriptsuperscript𝜋𝑛superscriptΠ𝑛superscriptsubscriptEsubscript𝐏𝑛0superscript𝜋𝑛delimited-[]superscriptsubscript𝑡0superscript𝛽𝑡subscript𝐶𝑛subscript𝐏𝑛𝑡subscript𝑎𝑛𝑡𝜆subscript𝑎𝑛𝑡\min_{\pi^{n}\in\Pi^{n}}\mathrm{E}_{\mathbf{P}_{n,0}}^{{\pi}^{n}}\left[\sum_{t% =0}^{\infty}\beta^{t}\{C_{n}\left(\mathbf{P}_{n,t},a_{n,t}\right)+\lambda a_{n% ,t}\}\right].roman_min start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ roman_Π start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT { italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) + italic_λ italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT } ] . (15)

In (15), ΠnsuperscriptΠ𝑛\Pi^{n}roman_Π start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the set of stationary scheduling policies corresponding to target n𝑛nitalic_n. Here, multiplier λ𝜆\lambdaitalic_λ has the economic interpretation of an additional tracking cost.

III-B Computing the Lower bound VD(0)superscript𝑉Dsubscript0V^{\textup{D}}(\mathbf{\mathbb{P}}_{0})italic_V start_POSTSUPERSCRIPT D end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )

In the one-dimensional case with real-valued state space for a single bandit process, VD((Pn,0)n=1N)superscript𝑉Dsuperscriptsubscriptsubscript𝑃𝑛0𝑛1𝑁V^{\textup{D}}\left((P_{n,0})_{n=1}^{N}\right)italic_V start_POSTSUPERSCRIPT D end_POSTSUPERSCRIPT ( ( italic_P start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) can be calculated through the value iteration approach [53]. In general, consider the lower bound VD(0)superscript𝑉Dsubscript0V^{\textup{D}}(\mathbf{\mathbb{P}}_{0})italic_V start_POSTSUPERSCRIPT D end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) in (14) which is the optimal value of the Lagrangian dual problem. Recall that we aim to minimize the expected total discounted (ETD) cost. Based on the Bellman equation, we firstly derive the optimal value function for the process {𝐏n,t,t=0,1,}formulae-sequencesubscript𝐏𝑛𝑡𝑡01\{\mathbf{P}_{n,t},t=0,1,\ldots\}{ bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT , italic_t = 0 , 1 , … } of target n𝑛nitalic_n in (15). Given the Lagrange multiplier λ𝜆\lambdaitalic_λ, we obtain

vnλ,*(𝐏n,t)=minan,t{0,1}{\displaystyle v^{\lambda,*}_{n}\left(\mathbf{P}_{n,t}\right)=\min\limits_{a_{n% ,t}\in\{0,1\}}\Biggl{\{}italic_v start_POSTSUPERSCRIPT italic_λ , * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) = roman_min start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } end_POSTSUBSCRIPT { Cn(𝐏n,t,an,t)+λan,tsubscript𝐶𝑛subscript𝐏𝑛𝑡subscript𝑎𝑛𝑡𝜆subscript𝑎𝑛𝑡\displaystyle C_{n}\left(\mathbf{P}_{n,t},a_{n,t}\right)+\lambda a_{n,t}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) + italic_λ italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT (16)
+\displaystyle++ βE[vnλ,*(ϕ1an,t(𝐏n,t))]}.\displaystyle\beta\mathrm{E}\left[v^{\lambda,*}_{n}\left(\phi_{1}^{a_{n,t}}% \left(\mathbf{P}_{n,t}\right)\right)\right]\Biggl{\}}.italic_β roman_E [ italic_v start_POSTSUPERSCRIPT italic_λ , * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) ) ] } .

Subsequently given the Lagrange multiplier λ𝜆\lambdaitalic_λ in (13), we can derive the multitarget optimal value function Lλ,*(t)superscript𝐿𝜆subscript𝑡L^{\lambda,*}\left(\mathbf{\mathbb{P}}_{t}\right)italic_L start_POSTSUPERSCRIPT italic_λ , * end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and substitute (16) into it, as given by

Lλ,*(t)superscript𝐿𝜆subscript𝑡\displaystyle L^{\lambda,*}\left(\mathbf{\mathbb{P}}_{t}\right)italic_L start_POSTSUPERSCRIPT italic_λ , * end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (17)
=\displaystyle== min𝒂t{0,1}N{n=1NCn(𝐏n,t,an,t)+βE[Lλ,*(ϕ~𝒂t(t))]\displaystyle\min\limits_{\bm{a}_{t}\in\{0,1\}^{N}}\left\{\sum\limits_{n=1}^{N% }C_{n}\left(\mathbf{P}_{n,t},a_{n,t}\right)+\beta\mathrm{E}\left[L^{\lambda,*}% \left(\tilde{\phi}^{\bm{a}_{t}}\left(\mathbf{\mathbb{P}}_{t}\right)\right)% \right]\right.roman_min start_POSTSUBSCRIPT bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) + italic_β roman_E [ italic_L start_POSTSUPERSCRIPT italic_λ , * end_POSTSUPERSCRIPT ( over~ start_ARG italic_ϕ end_ARG start_POSTSUPERSCRIPT bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ]
+λβ(n=1N(an,tK))}\displaystyle\left.+\lambda\beta\left(\sum_{n=1}^{N}\left(a_{n,t}-K\right)% \right)\right\}+ italic_λ italic_β ( ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT - italic_K ) ) }
=\displaystyle== n=1Nvnλ,*(𝐏n,t)Kλ1β,superscriptsubscript𝑛1𝑁subscriptsuperscript𝑣𝜆𝑛subscript𝐏𝑛𝑡𝐾𝜆1𝛽\displaystyle\sum_{n=1}^{N}v^{\lambda,*}_{n}\left(\mathbf{P}_{n,t}\right)-% \frac{K\lambda}{1-\beta},∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT italic_λ , * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) - divide start_ARG italic_K italic_λ end_ARG start_ARG 1 - italic_β end_ARG ,

where

ϕ~𝒂t(t)=(ϕ1a1,t(𝐏1,t),,ϕNaN,t(𝐏N,t)),superscript~italic-ϕsubscript𝒂𝑡subscript𝑡superscriptsubscriptitalic-ϕ1subscript𝑎1𝑡subscript𝐏1𝑡superscriptsubscriptitalic-ϕ𝑁subscript𝑎𝑁𝑡subscript𝐏𝑁𝑡\tilde{\phi}^{\bm{a}_{t}}\left(\mathbf{\mathbb{P}}_{t}\right)=\Bigl{(}\phi_{1}% ^{a_{1,t}}\left(\mathbf{P}_{1,t}\right),\ldots,\phi_{N}^{a_{N,t}}\left(\mathbf% {P}_{N,t}\right)\Bigr{)},over~ start_ARG italic_ϕ end_ARG start_POSTSUPERSCRIPT bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_P start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT ) , … , italic_ϕ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_N , italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_P start_POSTSUBSCRIPT italic_N , italic_t end_POSTSUBSCRIPT ) ) , (18)

representing the next-slot joint target states.

Based on the established value function and a given λ𝜆\lambdaitalic_λ, we establish the k𝑘kitalic_k-th iteration of value function vn,kλ(𝐏n,t)subscriptsuperscript𝑣𝜆𝑛𝑘subscript𝐏𝑛𝑡v^{\lambda}_{n,k}\left(\mathbf{P}_{n,t}\right)italic_v start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) as

vn,kλ(𝐏n,t)=minan,t{0,1}subscriptsuperscript𝑣𝜆𝑛𝑘subscript𝐏𝑛𝑡subscriptsubscript𝑎𝑛𝑡01\displaystyle v^{\lambda}_{n,k}\left(\mathbf{P}_{n,t}\right)=\min\limits_{a_{n% ,t}\in\{0,1\}}italic_v start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) = roman_min start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } end_POSTSUBSCRIPT {Cn(𝐏n,t,an,t)+λan,t\displaystyle\Biggl{\{}C_{n}\left(\mathbf{P}_{n,t},a_{n,t}\right)+\lambda a_{n% ,t}{ italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) + italic_λ italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT (19)
+βE[𝕀an,t=1vn,k1λ(ϕn1(𝐏n,t))]𝛽Edelimited-[]subscript𝕀subscript𝑎𝑛𝑡1subscriptsuperscript𝑣𝜆𝑛𝑘1subscriptsuperscriptitalic-ϕ1𝑛subscript𝐏𝑛𝑡\displaystyle+\beta\mathrm{E}\left[\mathbb{I}_{a_{n,t}=1}v^{\lambda}_{n,k-1}% \left(\phi^{1}_{n}(\mathbf{P}_{n,t})\right)\right]+ italic_β roman_E [ blackboard_I start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k - 1 end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) ) ]
+βE[𝕀an,t=0vn,k1λ(ϕn0(𝐏n,t))]}.\displaystyle+\beta\mathrm{E}\left[\mathbb{I}_{a_{n,t}=0}v^{\lambda}_{n,k-1}% \left(\phi^{0}_{n}(\mathbf{P}_{n,t})\right)\right]\Biggl{\}}.+ italic_β roman_E [ blackboard_I start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k - 1 end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) ) ] } .

Then, we use the value iteration approach to continue the iterative process over the state space until convergence of values on diverse joint state tsubscript𝑡\mathbf{\mathbb{P}}_{t}blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Finally, we obtain the optimal value vnλ,*(𝐏n,t)subscriptsuperscript𝑣𝜆𝑛subscript𝐏𝑛𝑡v^{\lambda,*}_{n}\left(\mathbf{P}_{n,t}\right)italic_v start_POSTSUPERSCRIPT italic_λ , * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) in (16) and derive Lλ,*(t)superscript𝐿𝜆subscript𝑡L^{\lambda,*}\left(\mathbf{\mathbb{P}}_{t}\right)italic_L start_POSTSUPERSCRIPT italic_λ , * end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) in (17). Given the initial joint state 0subscript0\mathbf{\mathbb{P}}_{0}blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and a given λ𝜆\lambdaitalic_λ, the minimum cost VL(0;λ)superscript𝑉Lsubscript0𝜆V^{\textup{L}}(\mathbf{\mathbb{P}}_{0};\lambda)italic_V start_POSTSUPERSCRIPT L end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_λ ) can be computed by Lλ,*(0)superscript𝐿𝜆subscript0L^{\lambda,*}\left(\mathbf{\mathbb{P}}_{0}\right)italic_L start_POSTSUPERSCRIPT italic_λ , * end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). The lower bound VD(0)superscript𝑉Dsubscript0V^{\textup{D}}\left(\mathbf{\mathbb{P}}_{0}\right)italic_V start_POSTSUPERSCRIPT D end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) of the dual problem in (14) can be computed by searching a λ(0,)𝜆0\lambda\in(0,\infty)italic_λ ∈ ( 0 , ∞ ) to satisfy

VD(0)maxλ0Lλ,*(0).superscript𝑉Dsubscript0subscript𝜆0superscript𝐿𝜆subscript0V^{\textup{D}}(\mathbf{\mathbb{P}}_{0})\triangleq\max_{\lambda\geq 0}\,L^{% \lambda,*}\left(\mathbf{\mathbb{P}}_{0}\right).italic_V start_POSTSUPERSCRIPT D end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≜ roman_max start_POSTSUBSCRIPT italic_λ ≥ 0 end_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT italic_λ , * end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) . (20)

Above all, we regard the lower bound of problem (14) as the optimal cost of the original problem (10) to further calculate the suboptimality gap of policies and numerically analyze the near optimality of the Whittle index policy, given a constant ratio ξ=K/N𝜉𝐾𝑁\xi=K/Nitalic_ξ = italic_K / italic_N.

III-C Indexability and Whittle index policy

Consider now the following structural property of subproblems (15), referred to as indexability in [28], which we formulate next as in [47, Section 3.2]. Note that target n𝑛nitalic_n’s subproblem (15) is indexable if there exists an index λn*:𝕊++L:superscriptsubscript𝜆𝑛superscriptsubscript𝕊absent𝐿\lambda_{n}^{*}\colon\mathbb{S}_{++}^{L}\to\mathbb{R}italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT : blackboard_S start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT → blackboard_R that characterizes its optimal policies, as follows: for any λ𝜆\lambda\in\mathbb{R}italic_λ ∈ blackboard_R, when the target is in state 𝐏nsubscript𝐏𝑛\mathbf{P}_{n}bold_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, λn*(𝐏n)>λsuperscriptsubscript𝜆𝑛subscript𝐏𝑛𝜆\lambda_{n}^{*}(\mathbf{P}_{n})>\lambdaitalic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( bold_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > italic_λ (λn*(𝐏n)λsuperscriptsubscript𝜆𝑛subscript𝐏𝑛𝜆\lambda_{n}^{*}(\mathbf{P}_{n})\leq\lambdaitalic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( bold_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ italic_λ) is necessary and sufficient conditions for that taking action an=1subscript𝑎𝑛1a_{n}=1italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 (an=0subscript𝑎𝑛0a_{n}=0italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 0) is optimal.

If each subproblem were indexable, the Whittle index policy would be to track up to K𝐾Kitalic_K out of the N𝑁Nitalic_N targets with larger index values, among those with nonnegative indices. If the current index of a target is negative, it is not tracked.

Yet, at present it is unknown whether restless projects with multi-dimensional Kalman filter dynamics such as those above are indexable, even for a single dynamics model (M=1𝑀1M=1italic_M = 1). Indexability has only been established (in [49]) for the special case of target tracking with scalar Kalman filter dynamics and a single dynamics model, by applying the PCL-indexability approach for real-state projects in [43].

III-D PCL-indexability approach for real-state targets

We next outline the PCL-indexability approach for real-state restless bandit projects as it applies to the present model (so L=1𝐿1L=1italic_L = 1). In the sequel we pay attention to the MP index of a single target with multiple dynamics models as described above, and drop the subscript n𝑛nitalic_n from notations.

Starting from an initial state P0=P𝕊++(0,)subscript𝑃0𝑃subscript𝕊absent0P_{0}=P\in\mathbb{S}_{++}\triangleq(0,\infty)italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_P ∈ blackboard_S start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT ≜ ( 0 , ∞ ), for a policy π𝜋\piitalic_π, define the cost metric

F(P,π)EPπ[t=0βtC(Pt,at)],𝐹𝑃𝜋subscriptsuperscriptE𝜋𝑃delimited-[]superscriptsubscript𝑡0superscript𝛽𝑡𝐶subscript𝑃𝑡subscript𝑎𝑡F(P,\pi)\triangleq\mathrm{E}^{\pi}_{P}\left[\sum_{t=0}^{\infty}\beta^{t}C\left% (P_{t},a_{t}\right)\right],italic_F ( italic_P , italic_π ) ≜ roman_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_C ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] , (21)

which gives the ETD cost, where P𝑃Pitalic_P is the target’s tracking error variance (TEV) state.

The work metric is defined as

G(P,π)EPπ[t=0βtat].𝐺𝑃𝜋subscriptsuperscriptE𝜋𝑃delimited-[]superscriptsubscript𝑡0superscript𝛽𝑡subscript𝑎𝑡G(P,\pi)\triangleq\mathrm{E}^{\pi}_{P}\left[\sum_{t=0}^{\infty}\beta^{t}a_{t}% \right].italic_G ( italic_P , italic_π ) ≜ roman_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] . (22)

The subproblem (15) can be reformulated as

minπΠF(P,π)+λG(P,π).subscript𝜋Π𝐹𝑃𝜋𝜆𝐺𝑃𝜋\min_{\pi\in\Pi}\,F(P,\pi)+\lambda G(P,\pi).roman_min start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_F ( italic_P , italic_π ) + italic_λ italic_G ( italic_P , italic_π ) . (23)

The indexability of subproblem (23) is studied under deterministic stationary policies and threshold policies. Deterministic stationary policies are represented as (Borel measurable) subsets of TEV states, where the corresponding target is tracked. We will employ an active set denoted as B𝕊++𝐵subscript𝕊absentB\subseteq\mathbb{S}_{++}italic_B ⊆ blackboard_S start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT to characterize the B𝐵Bitalic_B-active policy. In particular, for any given threshold level z𝕊¯𝕊++{,}𝑧¯𝕊subscript𝕊absentz\in\bar{\mathbb{S}}\triangleq\mathbb{S}_{{++}}\cup\{-\infty,\infty\}italic_z ∈ over¯ start_ARG blackboard_S end_ARG ≜ blackboard_S start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT ∪ { - ∞ , ∞ }, we will designate this as the z𝑧zitalic_z-threshold policy. This means that, for a target in TEV state P𝑃Pitalic_P, if P>z𝑃𝑧P>zitalic_P > italic_z, the target is tracked; otherwise, it is not tracked. Therefore, a z𝑧zitalic_z-threshold policy has active set B(z){P𝕊++:P>z}𝐵𝑧conditional-set𝑃subscript𝕊absent𝑃𝑧B(z)\triangleq\{P\in\mathbb{S}_{++}:P>z\}italic_B ( italic_z ) ≜ { italic_P ∈ blackboard_S start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT : italic_P > italic_z }. It is clear that, if 0<z<0𝑧0<z<\infty0 < italic_z < ∞, then B(z)=(z,)𝐵𝑧𝑧B(z)=(z,\infty)italic_B ( italic_z ) = ( italic_z , ∞ ); if z0𝑧0z\leq 0italic_z ≤ 0, then B(z)=𝕊++𝐵𝑧subscript𝕊absentB(z)=\mathbb{S}_{++}italic_B ( italic_z ) = blackboard_S start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT; and if z=𝑧z=\inftyitalic_z = ∞, then B(z)=𝐵𝑧B(z)=\emptysetitalic_B ( italic_z ) = ∅. We shall denote the corresponding cost and work metrics in (21) and (22) by F(P,z)𝐹𝑃𝑧F(P,z)italic_F ( italic_P , italic_z ) and G(P,z)𝐺𝑃𝑧G(P,z)italic_G ( italic_P , italic_z ), respectively.

Given a threshold z𝑧zitalic_z, the cost and work metrics are characterized by the functional equations.

F(P,z)={C(P,1)+βF(ϕ1(P),z),P>zC(P,0)+βF(ϕ0(P),z),Pz.𝐹𝑃𝑧cases𝐶𝑃1𝛽𝐹superscriptitalic-ϕ1𝑃𝑧𝑃𝑧𝐶𝑃0𝛽𝐹superscriptitalic-ϕ0𝑃𝑧𝑃𝑧F(P,z)=\begin{cases}C\left(P,1\right)+\beta F\left(\phi^{1}(P),z\right),&P>z\\ C\left(P,0\right)+\beta F\left(\phi^{0}(P),z\right),&P\leq z.\\ \end{cases}italic_F ( italic_P , italic_z ) = { start_ROW start_CELL italic_C ( italic_P , 1 ) + italic_β italic_F ( italic_ϕ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_P ) , italic_z ) , end_CELL start_CELL italic_P > italic_z end_CELL end_ROW start_ROW start_CELL italic_C ( italic_P , 0 ) + italic_β italic_F ( italic_ϕ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_P ) , italic_z ) , end_CELL start_CELL italic_P ≤ italic_z . end_CELL end_ROW (24)

and

G(P,z)={1+βG(ϕ1(P),z),P>zβG(ϕ0(P),z),Pz.𝐺𝑃𝑧cases1𝛽𝐺superscriptitalic-ϕ1𝑃𝑧𝑃𝑧𝛽𝐺superscriptitalic-ϕ0𝑃𝑧𝑃𝑧G(P,z)=\begin{cases}1+\beta G\left(\phi^{1}(P),z\right),&P>z\\ \beta G\left(\phi^{0}(P),z\right),&P\leq z.\\ \end{cases}italic_G ( italic_P , italic_z ) = { start_ROW start_CELL 1 + italic_β italic_G ( italic_ϕ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_P ) , italic_z ) , end_CELL start_CELL italic_P > italic_z end_CELL end_ROW start_ROW start_CELL italic_β italic_G ( italic_ϕ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_P ) , italic_z ) , end_CELL start_CELL italic_P ≤ italic_z . end_CELL end_ROW (25)

In practice, such metrics can be approximately computed by a value-iteration scheme with a finite truncated time horizon τ𝜏\tauitalic_τ, as discussed in [43, Section 11].

Next, we represent as a,z𝑎𝑧\langle a,z\rangle⟨ italic_a , italic_z ⟩ the policy that executes action a𝑎aitalic_a at time t=0𝑡0t=0italic_t = 0 and subsequently follows the z𝑧zitalic_z-threshold policy from t=1𝑡1t=1italic_t = 1 onwards. Accordingly, we introduce the corresponding marginal metrics for the previously discussed measures. Specifically, we define the marginal cost metric f(P,z)F(P,0,z)F(P,1,z)𝑓𝑃𝑧𝐹𝑃0𝑧𝐹𝑃1𝑧f(P,z)\triangleq F(P,\langle 0,z\rangle)-F(P,\langle 1,z\rangle)italic_f ( italic_P , italic_z ) ≜ italic_F ( italic_P , ⟨ 0 , italic_z ⟩ ) - italic_F ( italic_P , ⟨ 1 , italic_z ⟩ ) and the marginal work metric g(P,z)G(P,1,z)G(P,0,z)𝑔𝑃𝑧𝐺𝑃1𝑧𝐺𝑃0𝑧g(P,z)\triangleq G(P,\langle 1,z\rangle)-G(P,\langle 0,z\rangle)italic_g ( italic_P , italic_z ) ≜ italic_G ( italic_P , ⟨ 1 , italic_z ⟩ ) - italic_G ( italic_P , ⟨ 0 , italic_z ⟩ ). If g(P,z)>0𝑔𝑃𝑧0g(P,z)>0italic_g ( italic_P , italic_z ) > 0, we further define the MP metric function by

mp(P,z)=f(P,z)g(P,z).mp𝑃𝑧𝑓𝑃𝑧𝑔𝑃𝑧\mathrm{mp}(P,z)=\frac{f(P,z)}{g(P,z)}.roman_mp ( italic_P , italic_z ) = divide start_ARG italic_f ( italic_P , italic_z ) end_ARG start_ARG italic_g ( italic_P , italic_z ) end_ARG . (26)

If g(P,P)>0𝑔𝑃𝑃0g(P,P)>0italic_g ( italic_P , italic_P ) > 0 for all P𝑃Pitalic_P, we define the MP index by

mp*(P)mp(P,P).superscriptmp𝑃mp𝑃𝑃\mathrm{mp}^{*}(P)\triangleq\mathrm{mp}\left(P,P\right).roman_mp start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_P ) ≜ roman_mp ( italic_P , italic_P ) . (27)

As in [43, Definition 7], we say that the subproblem (23) is PCL-indexable (with respect to threshold policies) if the following PCL-indexability conditions hold:

  • (PCLI1) g(P,z)>0𝑔𝑃𝑧0g(P,z)>0italic_g ( italic_P , italic_z ) > 0 for every state P𝑃Pitalic_P and threshold z𝑧zitalic_z;

  • (PCLI2) mp*(P)superscriptmp𝑃\mathrm{mp}^{*}(P)roman_mp start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_P ) is monotone non-decreasing, continuous, and bounded below;

  • (PCLI3) for each P𝑃Pitalic_P, the metrics F(P,z)𝐹𝑃𝑧F(P,z)italic_F ( italic_P , italic_z ), G(P,z)𝐺𝑃𝑧G(P,z)italic_G ( italic_P , italic_z ) and the index mp*(P)superscriptmp𝑃\mathrm{mp}^{*}(P)roman_mp start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_P ) are related by: for <z1<z2<subscript𝑧1subscript𝑧2-\infty<z_{1}<z_{2}<\infty- ∞ < italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < ∞,

    F(P,z2)F(P,z1)=(z1,z2]mp*(z)G(P,dz),𝐹𝑃subscript𝑧2𝐹𝑃subscript𝑧1subscriptsubscript𝑧1subscript𝑧2superscriptmp𝑧𝐺𝑃𝑑𝑧F(P,z_{2})-F(P,z_{1})=\int_{(z_{1},z_{2}]}\mathrm{mp}^{*}(z)\,G(P,dz),italic_F ( italic_P , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_F ( italic_P , italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = ∫ start_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT roman_mp start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_z ) italic_G ( italic_P , italic_d italic_z ) ,

    where the right-hand side is a Lebesgue–Stieltjes integral.

The interest of the above PCL-indexability conditions lies in their applicability through the verification theorem in [43, Theorem 1], which ensures that, for a real-state project, conditions (PCLI1)–(PCLI3) above imply that the project is indexable, and the MP index mp*(P)superscriptmp𝑃\mathrm{mp}^{*}(P)roman_mp start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_P ) is its Whittle index.

Corollary 1.

If subproblem (23) is PCL-indexable, then it is indexable with Whittle index mp*(P)superscriptnormal-mp𝑃\mathrm{mp}^{*}(P)roman_mp start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_P ).

This was the approach deployed in [49] to prove indexability for the special case of a single dynamics model (M=1𝑀1M=1italic_M = 1). Yet, proving that conditions (PCLI1)–(PCLI3) hold for the present model with M2𝑀2M\geq 2italic_M ≥ 2 is beyond the scope of this work. Still, we conjecture that the model satisfies such conditions, and will present partial numerical evidence supporting satisfaction of such conditions. We shall further use the MP index above as a surrogate of the Whittle index, and refer to it as the Whittle index. Note that only if it were proven that conditions (PCLI1)–(PCLI3) hold could we ensure that the MP index is indeed the Whittle index.

III-E A heuristic MP index for multi-dimensional state targets

In the case of target dynamic states moving over in a L𝐿Litalic_L-dimensional state space Lsuperscript𝐿\mathbb{R}^{L}blackboard_R start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, with L2𝐿2L\geq 2italic_L ≥ 2, the TEC matrix indicates the tracking accuracy of the target state’s components, e.g., location and velocity. Since the TEC state is an L×L𝐿𝐿L\times Litalic_L × italic_L matrix, a convenient scalar measure of tracking performance is given by its trace. The application of the PCL-indexability approach for proving indexability and evaluating the Whittle index in such a case remains an open problem, even in the case of single target dynamics (M=1𝑀1M=1italic_M = 1). This section leverages the past success of the PCL-indexability and extends a new heuristic approach to define an MP index for the multi-dimensional case.

In particular, we define the cost metric F(𝐏,π)𝐹𝐏𝜋F(\mathbf{P},\pi)italic_F ( bold_P , italic_π ) and the work metric G(𝐏,π)𝐺𝐏𝜋G(\mathbf{P},\pi)italic_G ( bold_P , italic_π ) by replacing the one-dimensional P𝑃Pitalic_P in (21) and (22) with the multi-dimensional 𝐏𝐏\mathbf{P}bold_P. The target’s optimal tracking subproblem (15) becomes

minπΠF(𝐏,π)+λG(𝐏,π).subscript𝜋Π𝐹𝐏𝜋𝜆𝐺𝐏𝜋\min_{\pi\in\Pi}\,F\left(\mathbf{P},\pi\right)+\lambda G\left(\mathbf{P},\pi% \right).roman_min start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_F ( bold_P , italic_π ) + italic_λ italic_G ( bold_P , italic_π ) . (28)

We need to define the meaning of threshold policies in the present setting. For such a purpose, we shall consider that the z𝑧zitalic_z-threshold policy tracks the target in state 𝐏𝐏\mathbf{P}bold_P if and only if tr(𝐏)/L>ztr𝐏𝐿𝑧\mathrm{tr}(\mathbf{P})/{L}>zroman_tr ( bold_P ) / italic_L > italic_z. Then, the corresponding cost metric F(𝐏,z)𝐹𝐏𝑧F(\mathbf{P},z)italic_F ( bold_P , italic_z ) and work metric G(𝐏,z)𝐺𝐏𝑧G(\mathbf{P},z)italic_G ( bold_P , italic_z ) are the unique solutions to the following functional equations:

F(𝐏,z)={C(𝐏,1)+βF(ϕ1(𝐏),z),tr(𝐏)/L>zC(𝐏,0)+βF(ϕ0(𝐏),z),tr(𝐏)/Lz.𝐹𝐏𝑧cases𝐶𝐏1𝛽𝐹superscriptitalic-ϕ1𝐏𝑧tr𝐏𝐿𝑧𝐶𝐏0𝛽𝐹superscriptitalic-ϕ0𝐏𝑧tr𝐏𝐿𝑧F\left(\mathbf{P},z\right)=\begin{cases}C(\mathbf{P},1)+\beta F(\phi^{1}(% \mathbf{P}),z),&\mathrm{tr}(\mathbf{P})/{L}>z\\ C(\mathbf{P},0)+\beta F(\phi^{0}(\mathbf{P}),z),&\mathrm{tr}(\mathbf{P})/{L}% \leq z.\\ \end{cases}italic_F ( bold_P , italic_z ) = { start_ROW start_CELL italic_C ( bold_P , 1 ) + italic_β italic_F ( italic_ϕ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( bold_P ) , italic_z ) , end_CELL start_CELL roman_tr ( bold_P ) / italic_L > italic_z end_CELL end_ROW start_ROW start_CELL italic_C ( bold_P , 0 ) + italic_β italic_F ( italic_ϕ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( bold_P ) , italic_z ) , end_CELL start_CELL roman_tr ( bold_P ) / italic_L ≤ italic_z . end_CELL end_ROW (29)
G(𝐏,z)={1+βG(ϕ1(𝐏),z),tr(𝐏)/L>zβG(ϕ0(𝐏,z)),tr(𝐏)/Lz.𝐺𝐏𝑧cases1𝛽𝐺superscriptitalic-ϕ1𝐏𝑧tr𝐏𝐿𝑧𝛽𝐺superscriptitalic-ϕ0𝐏𝑧tr𝐏𝐿𝑧G(\mathbf{P},z)=\begin{cases}1+\beta G(\phi^{1}(\mathbf{P}),z),&\mathrm{tr}(% \mathbf{P})/L>z\\ \beta G(\phi^{0}(\mathbf{P},z)),&\mathrm{tr}(\mathbf{P})/L\leq z.\\ \end{cases}italic_G ( bold_P , italic_z ) = { start_ROW start_CELL 1 + italic_β italic_G ( italic_ϕ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( bold_P ) , italic_z ) , end_CELL start_CELL roman_tr ( bold_P ) / italic_L > italic_z end_CELL end_ROW start_ROW start_CELL italic_β italic_G ( italic_ϕ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( bold_P , italic_z ) ) , end_CELL start_CELL roman_tr ( bold_P ) / italic_L ≤ italic_z . end_CELL end_ROW (30)

Again, we can approximate such solutions through a value-iteration approach using a truncated time horizon τ𝜏\tauitalic_τ.

Now, similar to the scalar case above, we define the marginal cost metric f(𝐏,z)𝑓𝐏𝑧f(\mathbf{P},z)italic_f ( bold_P , italic_z ), the marginal work metric g(𝐏,z)𝑔𝐏𝑧g(\mathbf{P},z)italic_g ( bold_P , italic_z ), and the MP metric

mp(𝐏,z)=f(𝐏,z)g(𝐏,z)=F(𝐏,0,z)F(𝐏,1,z)G(𝐏,1,z)G(𝐏,0,z),mp𝐏𝑧𝑓𝐏𝑧𝑔𝐏𝑧𝐹𝐏0𝑧𝐹𝐏1𝑧𝐺𝐏1𝑧𝐺𝐏0𝑧\mathrm{mp}(\mathbf{P},z)=\frac{f(\mathbf{P},z)}{g(\mathbf{P},z)}=\frac{F(% \mathbf{P},\langle 0,z\rangle)-F(\mathbf{P},\langle 1,z\rangle)}{G(\mathbf{P},% \langle 1,z\rangle)-G(\mathbf{P},\langle 0,z\rangle)},roman_mp ( bold_P , italic_z ) = divide start_ARG italic_f ( bold_P , italic_z ) end_ARG start_ARG italic_g ( bold_P , italic_z ) end_ARG = divide start_ARG italic_F ( bold_P , ⟨ 0 , italic_z ⟩ ) - italic_F ( bold_P , ⟨ 1 , italic_z ⟩ ) end_ARG start_ARG italic_G ( bold_P , ⟨ 1 , italic_z ⟩ ) - italic_G ( bold_P , ⟨ 0 , italic_z ⟩ ) end_ARG , (31)

provided that g(𝐏,z)>0𝑔𝐏𝑧0g(\mathbf{P},z)>0italic_g ( bold_P , italic_z ) > 0.

The corresponding MP index is defined by

mp*(𝐏)mp(𝐏,tr(𝐏)/L),superscriptmp𝐏mp𝐏tr𝐏𝐿\mathrm{mp}^{*}(\mathbf{P})\triangleq\mathrm{mp}(\mathbf{P},\mathrm{tr}(% \mathbf{P})/L),roman_mp start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( bold_P ) ≜ roman_mp ( bold_P , roman_tr ( bold_P ) / italic_L ) , (32)

provided that g(𝐏,tr(𝐏)/L)>0𝑔𝐏tr𝐏𝐿0g(\mathbf{P},\mathrm{tr}(\mathbf{P})/L)>0italic_g ( bold_P , roman_tr ( bold_P ) / italic_L ) > 0 for all 𝐏𝐏\mathbf{P}bold_P.

However, there is currently no theory supporting any relation of the index mp*(𝐏)superscriptmp𝐏\mathrm{mp}^{*}(\mathbf{P})roman_mp start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( bold_P ) with the Whittle index in the multi-dimensional case, assuming that both are well defined. Still, we will discuss in Section IV the results of a simulation study on the performance of the index policy based on the heuristic MP index mp*(𝐏)superscriptmp𝐏\mathrm{mp}^{*}(\mathbf{P})roman_mp start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( bold_P ).

III-F Scheduling scheme based on the Whittle index policy

In Sections III-D and III-E, we discussed the MP-based indices for a bandit process with real-valued and multi-dimensional state spaces, respectively. In general, the MP-based index policy prioritizes tracking specific targets based on their state-dependent MP indices, regardless of the dimensions of the state, in descending order. The pseudo-code of implementing the index policy with given indices is provided in Algorithm 1, which is applicable to both cases with one-dimensional and multi-dimensional state spaces.

 // Initialization
1 The model parameters of the problem, i.e., N𝑁Nitalic_N, K𝐾Kitalic_K, T𝑇Titalic_T, β𝛽\betaitalic_β, n(𝐐n)n=1Nsubscript𝑛subscriptsuperscriptsubscript𝐐𝑛𝑁𝑛1\bm{\mathbb{Q}}_{n}\triangleq\left(\bm{\mathrm{Q}}_{n}\right)^{N}_{n=1}blackboard_bold_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≜ ( bold_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT, n(𝐑n)n=1Nsubscript𝑛subscriptsuperscriptsubscript𝐑𝑛𝑁𝑛1\bm{\mathbb{R}}_{n}\triangleq\left(\bm{\mathrm{R}}_{n}\right)^{N}_{n=1}blackboard_bold_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≜ ( bold_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT, and the target parameters, i.e. 0subscript0\mathbf{\mathbb{P}}_{0}blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, 𝕌𝕌\bm{\mathbb{U}}blackboard_bold_U.
 // Main loop
2 for t1normal-←𝑡1t\leftarrow 1italic_t ← 1 to T𝑇Titalic_T do
3       for n1normal-←𝑛1n\leftarrow 1italic_n ← 1 to N𝑁Nitalic_N do
4             Calculate the index of the TEC state 𝐏n,tsubscript𝐏𝑛𝑡\mathbf{P}_{n,t}bold_P start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT of arm n𝑛nitalic_n in (32);
5       end for
6      Select the K𝐾Kitalic_K targets with the largest indexes and generate the an,tsubscript𝑎𝑛𝑡a_{n,t}italic_a start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT, n=1,,N𝑛1𝑁n=1,\ldots,Nitalic_n = 1 , … , italic_N, where for the multi-target with the same indexes, they are randomly selected; for n1normal-←𝑛1n\leftarrow 1italic_n ← 1 to N𝑁Nitalic_N do
7             Update the TEC state 𝐏n,t+1subscript𝐏𝑛𝑡1\mathbf{P}_{n,t+1}bold_P start_POSTSUBSCRIPT italic_n , italic_t + 1 end_POSTSUBSCRIPT of arm n𝑛nitalic_n in (3) and (7);
8       end for
9      Calculate the cost in time t𝑡titalic_t.
10 end for
Algorithm 1 The scheduling scheme based on the Whittle index policy

III-G Computational complexity analysis

For the case with real-valued bandit states, the complexity for computing the indices in (32) is only linear in the number of targets N𝑁Nitalic_N and the truncated time τ𝜏\tauitalic_τ. Nonetheless, for the case with multi-dimensional state space for each bandit process, the linearity does not hold in general due to the complex matrix operations in (3) and (7).

In the Whittle index policy, the truncated time horizon τ𝜏\tauitalic_τ is the recursive horizon in (29) and (30). If the actions of each target across τ𝜏\tauitalic_τ time horizon are all assumed as 0, the computational complexity of (31) is 2Nτ(nϕ0+4)+2+12𝑁𝜏subscript𝑛superscriptitalic-ϕ04212N\tau(n_{\phi^{0}}+4)+2+12 italic_N italic_τ ( italic_n start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 4 ) + 2 + 1. Otherwise, the computational complexity is 2Nτ(nϕ1+4)+2+12𝑁𝜏subscript𝑛superscriptitalic-ϕ14212N\tau(n_{\phi^{1}}+4)+2+12 italic_N italic_τ ( italic_n start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 4 ) + 2 + 1, where nϕ1subscript𝑛superscriptitalic-ϕ1n_{\phi^{1}}italic_n start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and nϕ0subscript𝑛superscriptitalic-ϕ0n_{\phi^{0}}italic_n start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (nϕ0<nϕ1subscript𝑛superscriptitalic-ϕ0subscript𝑛superscriptitalic-ϕ1n_{\phi^{0}}<n_{\phi^{1}}italic_n start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT < italic_n start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT) represent the complexity of updating matrix states when the corresponding action variable is set to 1111 and 00, respectively. Due to the diverse combination of actions across τ𝜏\tauitalic_τ time horizon, the computational complexity of the Whittle index is less than 2Nτ(nϕ1+4)+2+12𝑁𝜏subscript𝑛superscriptitalic-ϕ14212N\tau(n_{\phi^{1}}+4)+2+12 italic_N italic_τ ( italic_n start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 4 ) + 2 + 1. That is, we have a complexity of O(Nτ)𝑂𝑁𝜏O(N\tau)italic_O ( italic_N italic_τ ).

We introduce two greedy approaches as baseline policies. One is the TEC index policy, which defines the indices by

λTEC(𝐏)=dtr(𝐏)/L.superscript𝜆TEC𝐏𝑑tr𝐏𝐿\lambda^{\text{TEC}}(\mathbf{P})=d\,\mathrm{tr}(\mathbf{P})/L.italic_λ start_POSTSUPERSCRIPT TEC end_POSTSUPERSCRIPT ( bold_P ) = italic_d roman_tr ( bold_P ) / italic_L . (33)

The computational complexity of the TEC index is O(N)𝑂𝑁O(N)italic_O ( italic_N ). The other prioritizes actions according to the MP indices with β=0𝛽0\beta=0italic_β = 0, namely, the myopic index policy with the indices given by

λmyopic=d[tr(ϕ0(𝐏))tr(ϕ1(𝐏))]/L.superscript𝜆myopic𝑑delimited-[]trsuperscriptitalic-ϕ0𝐏trsuperscriptitalic-ϕ1𝐏𝐿\lambda^{\text{myopic}}=d\left[\mathrm{tr}(\phi^{0}(\mathbf{P}))-\mathrm{tr}(% \phi^{1}(\mathbf{P}))\right]/L.italic_λ start_POSTSUPERSCRIPT myopic end_POSTSUPERSCRIPT = italic_d [ roman_tr ( italic_ϕ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( bold_P ) ) - roman_tr ( italic_ϕ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( bold_P ) ) ] / italic_L . (34)

The computational complexity for mpmyopicsuperscriptmpmyopic\mathrm{mp}^{\text{myopic}}roman_mp start_POSTSUPERSCRIPT myopic end_POSTSUPERSCRIPT is O(N(nϕ1+nϕ0))𝑂𝑁subscript𝑛superscriptitalic-ϕ1subscript𝑛superscriptitalic-ϕ0O(N(n_{\phi^{1}}+n_{\phi^{0}}))italic_O ( italic_N ( italic_n start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ). The baseline policies exhibit lower computational complexity but, in general, fail to take consideration of long future effects of the employed actions. Based on numerical results in [46, 47, 38], with appropriate β𝛽\betaitalic_β (β0.9𝛽0.9\beta\leq 0.9italic_β ≤ 0.9), the MP indices mp*(𝑷)superscriptmp𝑷\mathrm{mp}^{*}(\bm{P})roman_mp start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( bold_italic_P ) (described in (32)) with truncated horizon τ=100𝜏100\tau=100italic_τ = 100 is sufficiently close to the analytical Whittle indices, for which the computational complexity is comparable to the greedy approaches.

IV Simulation and Analysis

In this section, we report on numerical experiments to assess the satisfaction of the PCL-indexability conditions and the effectiveness of the proposed Whittle (MP) index policy in the real-state case. We will compare the proposed index policy with the two greedy baselines, i.e., the TEC policy and the myopic policy, in the scalar and the multi-dimensional state cases.

IV-A PCL-indexability conditions and MP index evaluation for real-state smart targets

We start by considering the PCL-indexability conditions stated in Section III-D as they apply to scalar state targets. Recall that they are sufficient conditions for indexability (existence of the Whittle index) and further give an explicit expression of the Whittle index in the form of an MP index. Establishing analytically that the present model satisfies such conditions is beyond the scope of this work. Still, we present below a sample of numerical evidence supporting the conjecture that the model satisfies such conditions.

To test the conditions numerically, we consider targets with M=2𝑀2M=2italic_M = 2 dynamics models, which we refer to as CV (m=1𝑚1m=1italic_m = 1) and CT (m=2𝑚2m=2italic_m = 2), and parameter values FCV=1.1superscript𝐹CV1.1F^{\text{CV}}=1.1italic_F start_POSTSUPERSCRIPT CV end_POSTSUPERSCRIPT = 1.1, FCT=1.3superscript𝐹CT1.3F^{\text{CT}}=1.3italic_F start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = 1.3, QCV=1superscript𝑄CV1Q^{\text{CV}}=1italic_Q start_POSTSUPERSCRIPT CV end_POSTSUPERSCRIPT = 1, QCT>QCVsuperscript𝑄CTsuperscript𝑄CVQ^{\text{CT}}>Q^{\text{CV}}italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT > italic_Q start_POSTSUPERSCRIPT CV end_POSTSUPERSCRIPT, H=1𝐻1H=1italic_H = 1, and R=2𝑅2R=2italic_R = 2. There are no measurement costs (h=00h=0italic_h = 0) and the weight is d=1𝑑1d=1italic_d = 1. We assume two types of smart targets, called reckless and cautious, where reckless targets have a higher probability of maneuvering (using the CT model) when tracked than cautious targets. Hence, the dynamics model probability matrix 𝐔=[𝒖0,𝒖1]𝐔superscript𝒖0superscript𝒖1\bm{\mathrm{U}}=\left[\bm{u}^{0},\bm{u}^{1}\right]bold_U = [ bold_italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_italic_u start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ] differs with the target types. Specifically, we take

𝐔={[[0.90,0.10],[0.20,0.80]],for reckless targets[[0.95,0.05],[0.60,0.40]],for cautious targets.𝐔casessuperscript0.900.10superscript0.200.80for reckless targetssuperscript0.950.05superscript0.600.40for cautious targets.\bm{\mathrm{U}}=\begin{cases}\left[[0.90,0.10]^{\prime},[0.20,0.80]^{\prime}% \right],&\text{for reckless targets}\\ \left[[0.95,0.05]^{\prime},[0.60,0.40]^{\prime}\right],&\text{for cautious % targets.}\\ \end{cases}bold_U = { start_ROW start_CELL [ [ 0.90 , 0.10 ] start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , [ 0.20 , 0.80 ] start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] , end_CELL start_CELL for reckless targets end_CELL end_ROW start_ROW start_CELL [ [ 0.95 , 0.05 ] start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , [ 0.60 , 0.40 ] start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] , end_CELL start_CELL for cautious targets. end_CELL end_ROW (35)

where, if target n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT belongs to reckless targets, target n2subscript𝑛2n_{2}italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT belongs to cautious targets, we can obtain un10,m>un20,msuperscriptsubscript𝑢subscript𝑛10𝑚superscriptsubscript𝑢subscript𝑛20𝑚u_{n_{1}}^{0,m}>u_{n_{2}}^{0,m}italic_u start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 , italic_m end_POSTSUPERSCRIPT > italic_u start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 , italic_m end_POSTSUPERSCRIPT and un11,m>un21,msuperscriptsubscript𝑢subscript𝑛11𝑚superscriptsubscript𝑢subscript𝑛21𝑚u_{n_{1}}^{1,m}>u_{n_{2}}^{1,m}italic_u start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 , italic_m end_POSTSUPERSCRIPT > italic_u start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 , italic_m end_POSTSUPERSCRIPT, m=2,,M𝑚2𝑀m=2,\ldots,Mitalic_m = 2 , … , italic_M.

In the following numerical experiments, the infinite series defining the metrics of interest have been approximately computed by using the truncated time horizon τ=100𝜏100\tau=100italic_τ = 100. The discount factor is β=0.9𝛽0.9\beta=0.9italic_β = 0.9, and the TEV state P𝑃Pitalic_P has been taken from a grid of values, each separated by a distance of 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT.

Refer to caption (a)
Refer to caption (b)
Refer to caption (c)
Refer to caption (d)
Figure 2: Marginal work metric g(P,z)𝑔𝑃𝑧g(P,z)italic_g ( italic_P , italic_z ) and marginal cost metric f(P,z)𝑓𝑃𝑧f(P,z)italic_f ( italic_P , italic_z ) for different z𝑧zitalic_z and QCTsuperscript𝑄CTQ^{\text{CT}}italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT for reckless targets: (a) z=4𝑧4z=4italic_z = 4 and QCT=4superscript𝑄CT4Q^{\text{CT}}=4italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = 4; (b) z=10𝑧10z=10italic_z = 10 and QCT=4superscript𝑄CT4Q^{\text{CT}}=4italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = 4; (c) z=4𝑧4z=4italic_z = 4 and QCT=10superscript𝑄CT10Q^{\text{CT}}=10italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = 10; (d) z=10𝑧10z=10italic_z = 10 and QCT=10superscript𝑄CT10Q^{\text{CT}}=10italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = 10.
Refer to caption (a)
Refer to caption (b)
Refer to caption (c)
Refer to caption (d)
Figure 3: Marginal work metric g(P,z)𝑔𝑃𝑧g(P,z)italic_g ( italic_P , italic_z ) and marginal cost metric f(P,z)𝑓𝑃𝑧f(P,z)italic_f ( italic_P , italic_z ) for different z𝑧zitalic_z and QCTsuperscript𝑄CTQ^{\text{CT}}italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT for cautious targets: (a) z=4𝑧4z=4italic_z = 4 and QCT=4superscript𝑄CT4Q^{\text{CT}}=4italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = 4; (b) z=10𝑧10z=10italic_z = 10 and QCT=4superscript𝑄CT4Q^{\text{CT}}=4italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = 4; (c) z=4𝑧4z=4italic_z = 4 and QCT=10superscript𝑄CT10Q^{\text{CT}}=10italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = 10; (d) z=10𝑧10z=10italic_z = 10 and QCT=10superscript𝑄CT10Q^{\text{CT}}=10italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = 10.

Start with PCL-indexability condition (PCLI1), namely that marginal work metric g(P,z)𝑔𝑃𝑧g(P,z)italic_g ( italic_P , italic_z ) be positive for each TEV state P𝑃Pitalic_P and threshold z𝑧zitalic_z. Figs. 2 and 3 plot both g(P,z)𝑔𝑃𝑧g(P,z)italic_g ( italic_P , italic_z ) and the marginal cost metric f(P,z)𝑓𝑃𝑧f(P,z)italic_f ( italic_P , italic_z ) against P𝑃Pitalic_P for several choices of threshold z𝑧zitalic_z and of the QCTsuperscript𝑄CTQ^{\text{CT}}italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT parameter, for reckless targets and cautious targets in Fig. 2 and Fig. 3, respectively. The plots support the validity of condition (PCLI1), as g(P,z)>0𝑔𝑃𝑧0g(P,z)>0italic_g ( italic_P , italic_z ) > 0 in each instance, a result that has been observed under other parameter choices not reported here.

Regarding PCL-indexability condition (PCLI2), namely that the MP index defined by mp*(P)f(P,P)/g(P,P)superscriptmp𝑃𝑓𝑃𝑃𝑔𝑃𝑃\mathrm{mp}^{*}(P)\triangleq f(P,P)/g(P,P)roman_mp start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_P ) ≜ italic_f ( italic_P , italic_P ) / italic_g ( italic_P , italic_P ) be bounded below, continuous and monotone non-decreasing, Fig. 4 plots the MP index with choices of QCTsuperscript𝑄CTQ^{\text{CT}}italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT and P𝑃Pitalic_P for the two target types considered. Figs. 4 (c) and (d) show the MP index vs. P𝑃Pitalic_P for QCT=4,10superscript𝑄CT410Q^{\text{CT}}=4,10italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = 4 , 10, respectively. The plots are clearly consistent with condition (PCLI2), which has been observed in other instances not reported here.

Refer to caption (a)
Refer to caption (b)
Refer to caption (c)
Refer to caption (d)
Figure 4: MP index for different target types with different process noise variances QCTsuperscript𝑄CTQ^{\text{CT}}italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT and states P𝑃Pitalic_P: (a) P=1𝑃1P=1italic_P = 1; (b) P=10𝑃10P=10italic_P = 10; (c) QCT=4superscript𝑄CT4Q^{\text{CT}}=4italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = 4; (d) QCT=10superscript𝑄CT10Q^{\text{CT}}=10italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = 10.

Fig. 4 (a) and (b) further show the MP index vs. QCT[0,40]superscript𝑄CT040Q^{\text{CT}}\in[0,40]italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT ∈ [ 0 , 40 ] for states P=1,10𝑃110P=1,10italic_P = 1 , 10, respectively. When the state P𝑃Pitalic_P is small, the long-term cost is predominantly influenced by QCTsuperscript𝑄CTQ^{\text{CT}}italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT. Consequently, with the same P=1𝑃1P=1italic_P = 1, the reckless target with a higher QCTsuperscript𝑄CTQ^{\text{CT}}italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT will obtain a higher MP index value. Otherwise, with a large P=10𝑃10P=10italic_P = 10, prioritizing the long-term cost minimization, the MP index policy will sacrifice the immediate cost at the current time and track the target with higher QCTsuperscript𝑄CTQ^{\text{CT}}italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT after a certain number of time steps. That is, the target with a lower QCTsuperscript𝑄CTQ^{\text{CT}}italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT is tracked currently. Hence, the MP index will decrease over QCTsuperscript𝑄CTQ^{\text{CT}}italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT. Due to the higher probability of a cautious target acquiring a lower cost with the same state P𝑃Pitalic_P and QCTsuperscript𝑄CTQ^{\text{CT}}italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT under action 1, cautious targets obtain a higher index than reckless targets in all states, which indicates that the MP index policy takes the tracking action prudently for reckless targets.

In summary, the above partial numerical evidence that the model of concern satisfies PCL-indexability conditions (PCLI1) and (PCLI2), and hence we shall further use the MP index as a surrogate of the Whittle index, and refer to it as the Whittle index.

IV-B Performance results with real-valued states

In this subsection, N=8𝑁8N=8italic_N = 8 smart targets are tracked by the radars network with K=1,,3𝐾13K=1,\ldots,3italic_K = 1 , … , 3 radars. For each target, the radar tracking time horizon T=100𝑇100T=100italic_T = 100 s and β=0.9𝛽0.9\beta=0.9italic_β = 0.9. We set up the Whittle index policy with the truncated time τ=100𝜏100\tau=100italic_τ = 100, the TEV index policy where λTEV(P)=d*Psuperscript𝜆TEV𝑃𝑑𝑃\lambda^{\text{TEV}}(P)=d*Pitalic_λ start_POSTSUPERSCRIPT TEV end_POSTSUPERSCRIPT ( italic_P ) = italic_d * italic_P , and the myopic policy that degenerates to mpmyopic=d*(ϕ0(P,𝒖0)ϕ1(P,𝒖1))superscriptmpmyopic𝑑superscriptitalic-ϕ0𝑃superscript𝒖0superscriptitalic-ϕ1𝑃superscript𝒖1\mathrm{mp}^{\text{myopic}}=d*(\phi^{0}(P,\bm{u}^{0})-\phi^{1}(P,\bm{u}^{1}))roman_mp start_POSTSUPERSCRIPT myopic end_POSTSUPERSCRIPT = italic_d * ( italic_ϕ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_P , bold_italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - italic_ϕ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_P , bold_italic_u start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) ) to assess tracking performances.

The parameters FnCVsuperscriptsubscript𝐹𝑛CVF_{n}^{\text{CV}}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CV end_POSTSUPERSCRIPT, FnCTsuperscriptsubscript𝐹𝑛CTF_{n}^{\text{CT}}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT, QnCVsuperscriptsubscript𝑄𝑛CVQ_{n}^{\text{CV}}italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CV end_POSTSUPERSCRIPT, hnsubscript𝑛h_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of target n𝑛nitalic_n and H𝐻Hitalic_H, R𝑅Ritalic_R are assumed the same as Section IV-A, n=1,,N𝑛1𝑁n=1,\ldots,Nitalic_n = 1 , … , italic_N. The initial state Pn,0U(0,2)similar-tosubscript𝑃𝑛0U02P_{n,0}\sim\mathrm{U}(0,2)italic_P start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT ∼ roman_U ( 0 , 2 ). Then, we define three scenarios with only reckless type, cautious type, and each type of target accounting for 4, where the process noise variance QnCT={2,2,,2}superscriptsubscript𝑄𝑛CT222Q_{n}^{\text{CT}}=\{2,2,\ldots,2\}italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = { 2 , 2 , … , 2 } and {2,3,,9}239\{2,3,\ldots,9\}{ 2 , 3 , … , 9 } in the first two scenarios, respectively; Otherwise, QnCT={2,2,,2}superscriptsubscript𝑄𝑛CT222Q_{n}^{\text{CT}}=\{2,2,\ldots,2\}italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = { 2 , 2 , … , 2 } and {2,3,4,5,2,3,4,5}23452345\{2,3,4,5,2,3,4,5\}{ 2 , 3 , 4 , 5 , 2 , 3 , 4 , 5 } in the third scenario. We assume dn=5subscript𝑑𝑛5d_{n}=5italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 5, n=1𝑛1n=1italic_n = 1 and dn=1subscript𝑑𝑛1d_{n}=1italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1, n1𝑛1n\neq 1italic_n ≠ 1 in the first two scenarios. Moreover, dn=5subscript𝑑𝑛5d_{n}=5italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 5 for reckless targets and dn=1subscript𝑑𝑛1d_{n}=1italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 for cautious targets in the third scenario.

Table I, II, III show the discounted objective performance in three policies. Each policy is analyzed by varying the optional number K𝐾Kitalic_K of radars and Nmc=100subscript𝑁𝑚𝑐100N_{mc}=100italic_N start_POSTSUBSCRIPT italic_m italic_c end_POSTSUBSCRIPT = 100 Monte Carlo simulations. When all the targets are homogeneous and heterogeneous, the Whittle index policy can obtain the best scheduling performance and outperform the TEV index and myopic policies. The myopic policy considers the one-step cost optimization, which is beneficial to the different types of targets; While the TEV index policy is not insensitive to the variation of targets. Consequently, the performance of the myopic policy is better than the TEV policy. Hence, the simulation results of all the scenarios show the superiority of the Whittle index policies.

TABLE I: Results with different numbers of radars for all the same reckless targets
QnCTsubscriptsuperscript𝑄CT𝑛Q^{\text{CT}}_{n}italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT Policies K=1𝐾1K=1italic_K = 1 K=2𝐾2K=2italic_K = 2 K=3𝐾3K=3italic_K = 3
{2,2,,2}222\{2,2,\ldots,2\}{ 2 , 2 , … , 2 } Whittle index 823.19 400.53 284.65
myopic 868.71 405.85 293.80
TEV 871.19 406.17 293.72
{2,3,,9}239\{2,3,\ldots,9\}{ 2 , 3 , … , 9 } Whittle index 961.25 458.49 319.84
myopic 993.93 464.43 326.04
TEV 1009.15 465.12 334.06
TABLE II: Results with different numbers of radars for all the same cautious targets
QnCTsubscriptsuperscript𝑄CT𝑛Q^{\text{CT}}_{n}italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT Policies K=1𝐾1K=1italic_K = 1 K=2𝐾2K=2italic_K = 2 K=3𝐾3K=3italic_K = 3
{2,2,,2}222\{2,2,\ldots,2\}{ 2 , 2 , … , 2 } Whittle index 750.91 377.36 268.30
myopic 790.61 381.92 275.81
TEV 790.40 384.06 275.75
{2,3,,9}239\{2,3,\ldots,9\}{ 2 , 3 , … , 9 } Whittle index 817.23 406.26 285.67
myopic 849.91 409.88 296.35
TEV 861.88 410.56 296.19
TABLE III: Results with different numbers of radars for different type targets
QnCTsubscriptsuperscript𝑄CT𝑛Q^{\text{CT}}_{n}italic_Q start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT Policies K=1𝐾1K=1italic_K = 1 K=2𝐾2K=2italic_K = 2 K=3𝐾3K=3italic_K = 3
{2,2,,2}222\{2,2,\ldots,2\}{ 2 , 2 , … , 2 } Whittle index 1554.35 772.19 547.38
myopic 1614.41 807.35 567.14
TEV 1622.97 808.89 567.94
{2,3,4,5,\{2,3,4,5,{ 2 , 3 , 4 , 5 , 2,3,4,5}2,3,4,5\}2 , 3 , 4 , 5 } Whittle index 1664.83 821.05 581.47
myopic 1731.69 859.02 605.75
TEV 1733.95 860.42 605.73

IV-C Near optimality with real-valued states

Within this section, we validate the near optimality of the Whittle index policy discussed in Section III-C through different dnsubscript𝑑𝑛d_{n}italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and QnCTsuperscriptsubscript𝑄𝑛CTQ_{n}^{\text{CT}}italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT, when the number of radars and targets increases with a fixed ratio ξ=K/N=1/4𝜉𝐾𝑁14\xi=K/N=1/4italic_ξ = italic_K / italic_N = 1 / 4. The parameters FnCVsuperscriptsubscript𝐹𝑛CVF_{n}^{\text{CV}}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CV end_POSTSUPERSCRIPT, FnCTsuperscriptsubscript𝐹𝑛CTF_{n}^{\text{CT}}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT, QnCVsuperscriptsubscript𝑄𝑛CVQ_{n}^{\text{CV}}italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CV end_POSTSUPERSCRIPT, hnsubscript𝑛h_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of target n𝑛nitalic_n and H𝐻Hitalic_H, R𝑅Ritalic_R are assumed the same as Section IV-B.

We first assume three scenarios, where dn=1subscript𝑑𝑛1d_{n}=1italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1, n=1,,N𝑛1𝑁n=1,\ldots,Nitalic_n = 1 , … , italic_N. Three scenarios are divided by the types of targets, including all the reckless targets, all the cautious targets, and each type of target accounts for 50%percent5050\%50 %. For each target n𝑛nitalic_n, let Pn,0=0.01subscript𝑃𝑛00.01{{P}}_{n,0}=0.01italic_P start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT = 0.01. In addition, the process noise variances of the three scenarios are QnCT={2,,N+1}superscriptsubscript𝑄𝑛CT2𝑁1Q_{n}^{\text{CT}}=\{2,\ldots,N+1\}italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = { 2 , … , italic_N + 1 }, QnCT={2,,N+1}superscriptsubscript𝑄𝑛CT2𝑁1Q_{n}^{\text{CT}}=\{2,\ldots,N+1\}italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = { 2 , … , italic_N + 1 }, QnCT={2,,N/2+1,2,,N/2+1}superscriptsubscript𝑄𝑛CT2𝑁212𝑁21Q_{n}^{\text{CT}}=\{2,\ldots,N/2+1,2,\ldots,N/2+1\}italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = { 2 , … , italic_N / 2 + 1 , 2 , … , italic_N / 2 + 1 }, n=1,,N𝑛1𝑁n=1,\ldots,Nitalic_n = 1 , … , italic_N, respectively. We modify the simulation population by letting K𝐾Kitalic_K vary. For each scenario, the Whittle index truncates the corresponding infinite series to τ=100𝜏100\tau=100italic_τ = 100 terms with β=0.9𝛽0.9\beta=0.9italic_β = 0.9, and performances are evaluated by Nmc=100subscript𝑁𝑚𝑐100N_{mc}=100italic_N start_POSTSUBSCRIPT italic_m italic_c end_POSTSUBSCRIPT = 100 Monte Carlo simulations.

Through the calculations of the lower bound in Section III-B, we derive the simulation results about the suboptimality gap between the lower bound per target and the objective per target of index policies, as shown in Fig. 5, respectively.

Refer to caption (a)
Refer to caption (b)
Refer to caption (c)
Figure 5: The suboptimality gap of policies for heterogeneous targets. (a) Reckless targets. (b) Cautious targets. (c) Two types of targets.

Different from the above three scenarios, we assume another scenario with the ratio ξ=K/N=1/4𝜉𝐾𝑁14\xi=K/N=1/4italic_ξ = italic_K / italic_N = 1 / 4. Firstly, there are 50%percent5050\%50 % reckless targets with dn=5subscript𝑑𝑛5d_{n}=5italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 5 in the population of N𝑁Nitalic_N and others are cautious with dn=1subscript𝑑𝑛1d_{n}=1italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1. Secondly, we assume that all of the targets share the identical process noise variance QnCT=4superscriptsubscript𝑄𝑛CT4Q_{n}^{\text{CT}}=4italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = 4. Other simulation parameters of each target are the same as the above three scenarios. The initial state for each target n𝑛nitalic_n is assumed to be Pn,0U(0,2)similar-tosubscript𝑃𝑛0U02{{P}}_{n,0}\sim\mathrm{U}(0,2)italic_P start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT ∼ roman_U ( 0 , 2 ). We modify the simulation population by letting K𝐾Kitalic_K vary. The results are shown in Fig. 6.

Refer to caption
Figure 6: The suboptimality gap in the constant ratio of Two types of targets.

Fig. 5 shows that the suboptimality gaps of the Whittle index policy and the TEV index policy decrease as the radar system scale increases. However, the suboptimality gap of the Whittle index policy converges to a lower level, which is lower than 10.5%percent10.510.5\%10.5 %, while the TEV index policy obtains a larger suboptimality gap. Unfortunately, the myopic policy has the worst performance and the suboptimality gap may increase as K𝐾Kitalic_K grows. In Fig. 6, the Whittle index policy also achieves the lowest suboptimality gap, which is stable at about 3.0%percent3.03.0\%3.0 %. While the TEV index policy and the myopic policy obtain 8.0%percent8.08.0\%8.0 % and 7.8%percent7.87.8\%7.8 %, respectively. Consequently, the Whittle index policy obtains lower objectives per target than other policies and the lowest suboptimality gap in all four scenarios.

Above all, the near optimality of the Whittle index policy is validated in this beam scheduling problem. When all the targets are homogeneous in different groups related to the target type and target parameters in Fig. 6, except Pn,0U(0,2)similar-tosubscript𝑃𝑛0U02{{P}}_{n,0}\sim\mathrm{U}(0,2)italic_P start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT ∼ roman_U ( 0 , 2 ), and the suboptimality gap of the Whittle index policy is approximately stable at the lowest level. Accordingly, the Whittle index policy can efficiently solve the beam scheduling problem with real-valued TEV states and achieve better non-myopic performance than other greedy policies.

IV-D Performance results with multi-dimensional states

In this subsection, we attempt to expand the application of the MP index policy to the multi-dimensional states case. N=8𝑁8N=8italic_N = 8 targets are tracked by the radar network with K=1,,3𝐾13K=1,\ldots,3italic_K = 1 , … , 3 radars. For each target, the radar tracking time horizon T=100𝑇100T=100italic_T = 100 s. The MP index policy, the TEC index policy, and the myopic policy are described in Section III-G to assess the tracking performance.

Let 𝒙n,t=[xn,t,x˙n,t,yn,t,y˙n,t]subscript𝒙𝑛𝑡superscriptsubscript𝑥𝑛𝑡subscript˙𝑥𝑛𝑡subscript𝑦𝑛𝑡subscript˙𝑦𝑛𝑡\bm{x}_{n,t}=\left[x_{n,t},\dot{x}_{n,t},y_{n,t},\dot{y}_{n,t}\right]^{\prime}bold_italic_x start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT , over˙ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT , over˙ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be the dynamic state of target n𝑛nitalic_n at time t𝑡titalic_t, where [xn,t,yn,t]subscript𝑥𝑛𝑡subscript𝑦𝑛𝑡[x_{n,t},y_{n,t}][ italic_x start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ] represents the position of target n𝑛nitalic_n at time t𝑡titalic_t, [x˙n,t,y˙n,t]subscript˙𝑥𝑛𝑡subscript˙𝑦𝑛𝑡[\dot{x}_{n,t},\dot{y}_{n,t}][ over˙ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT , over˙ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ] is the velocity of target n𝑛nitalic_n at time t𝑡titalic_t. The state transition matrices corresponding to the CV and CT model are given by

𝑭nCV=𝐈2[1Ts01],superscriptsubscript𝑭𝑛CVtensor-productsubscript𝐈2matrix1subscript𝑇𝑠01\bm{F}_{n}^{\text{CV}}=\bm{\mathrm{I}}_{2}\otimes\begin{bmatrix}1&T_{s}\\ 0&1\\ \end{bmatrix},bold_italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CV end_POSTSUPERSCRIPT = bold_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⊗ [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] , (36)
𝑭nCT=[1sin(ωTs)ω0cos(ωTs)1ω0cos(ωTs)0sin(ωTs)01cos(ωTs)ω1sin(ωTs)ω0sin(ωTs)0cos(ωTs)],superscriptsubscript𝑭𝑛CTmatrix1𝜔subscript𝑇𝑠𝜔0𝜔subscript𝑇𝑠1𝜔0𝜔subscript𝑇𝑠0𝜔subscript𝑇𝑠01𝜔subscript𝑇𝑠𝜔1𝜔subscript𝑇𝑠𝜔0𝜔subscript𝑇𝑠0𝜔subscript𝑇𝑠\bm{F}_{n}^{\text{CT}}=\begin{bmatrix}1&\dfrac{\sin\left(\omega T_{s}\right)}{% \omega}&0&\dfrac{\cos\left(\omega T_{s}\right)-1}{\omega}\\ 0&\cos\left(\omega T_{s}\right)&0&\sin\left(\omega T_{s}\right)\\ 0&\dfrac{1-\cos\left(\omega T_{s}\right)}{\omega}&1&\dfrac{\sin\left(\omega T_% {s}\right)}{\omega}\\ 0&-\sin\left(\omega T_{s}\right)&0&\cos\left(\omega T_{s}\right)\\ \end{bmatrix},bold_italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL divide start_ARG roman_sin ( italic_ω italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG start_ARG italic_ω end_ARG end_CELL start_CELL 0 end_CELL start_CELL divide start_ARG roman_cos ( italic_ω italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - 1 end_ARG start_ARG italic_ω end_ARG end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL roman_cos ( italic_ω italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_CELL start_CELL 0 end_CELL start_CELL roman_sin ( italic_ω italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL divide start_ARG 1 - roman_cos ( italic_ω italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG start_ARG italic_ω end_ARG end_CELL start_CELL 1 end_CELL start_CELL divide start_ARG roman_sin ( italic_ω italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG start_ARG italic_ω end_ARG end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - roman_sin ( italic_ω italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_CELL start_CELL 0 end_CELL start_CELL roman_cos ( italic_ω italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] , (37)

where Ts=1subscript𝑇𝑠1T_{s}=1italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 1 denotes the tracking time interval, ω𝜔\omegaitalic_ω denotes the turn rate, and tensor-product\otimes represents the Kronecker product.

The process noise variance matrix

𝐐nm=qnm𝐈2[Ts3/3Ts2/2Ts2/2Ts],superscriptsubscript𝐐𝑛𝑚tensor-productsuperscriptsubscript𝑞𝑛𝑚subscript𝐈2matrixsuperscriptsubscript𝑇𝑠33superscriptsubscript𝑇𝑠22superscriptsubscript𝑇𝑠22subscript𝑇𝑠\bm{\mathrm{Q}}_{n}^{m}=q_{n}^{m}\bm{\mathrm{I}}_{2}\otimes\begin{bmatrix}T_{s% }^{3}/3&T_{s}^{2}/2\\ T_{s}^{2}/2&T_{s}\\ \end{bmatrix},bold_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⊗ [ start_ARG start_ROW start_CELL italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / 3 end_CELL start_CELL italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_CELL end_ROW start_ROW start_CELL italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_CELL start_CELL italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] , (38)

where qnmsuperscriptsubscript𝑞𝑛𝑚q_{n}^{m}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT denotes the amplitude of the process noise. 𝐈2subscript𝐈2\bm{\mathrm{I}}_{2}bold_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denotes the 2×2222\times 22 × 2 identity matrix.

The measurement matrix 𝐇nsubscript𝐇𝑛\bm{\mathrm{H}}_{n}bold_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is

𝐇n=[10000010].subscript𝐇𝑛matrix10000010\bm{\mathrm{H}}_{n}=\begin{bmatrix}1&0&0&0\\ 0&0&1&0\\ \end{bmatrix}.bold_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] . (39)

We assume that ω=3𝜔superscript3\omega=3^{\circ}italic_ω = 3 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT, qnCV=1superscriptsubscript𝑞𝑛CV1q_{n}^{\text{CV}}=1italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CV end_POSTSUPERSCRIPT = 1, 𝐑n=2*𝐈2subscript𝐑𝑛2subscript𝐈2\bm{\mathrm{R}}_{n}=2*\bm{\mathrm{I}}_{2}bold_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 2 * bold_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and measurement cost hn=0subscript𝑛0h_{n}=0italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 0, n=1,,N𝑛1𝑁n=1,\ldots,Nitalic_n = 1 , … , italic_N. The initial state 𝐏n,0subscript𝐏𝑛0\mathbf{P}_{n,0}bold_P start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT is generated by 𝐑0𝐑0superscriptsubscript𝐑0subscript𝐑0\bm{\mathrm{R}}_{0}^{\prime}\bm{\mathrm{R}}_{0}bold_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, where each element of 𝐑0subscript𝐑0\bm{\mathrm{R}}_{0}bold_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT follows the uniform distribution U(0,1)U01\mathrm{U}(0,1)roman_U ( 0 , 1 ), ensuring that 𝐏n,0+4×4subscript𝐏𝑛0subscriptsuperscript44\mathbf{P}_{n,0}\in\mathbb{R}^{4\times 4}_{+}bold_P start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 × 4 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. Then, we define two scenarios with the reckless type and the cautious type of targets with qnCT={2,3,,9}superscriptsubscript𝑞𝑛CT239q_{n}^{\text{CT}}=\{2,3,\ldots,9\}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = { 2 , 3 , … , 9 }, respectively, and another scenario with two types of targets with qnCT={2,3,4,5,2,3,4,5}superscriptsubscript𝑞𝑛CT23452345q_{n}^{\text{CT}}=\{2,3,4,5,2,3,4,5\}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = { 2 , 3 , 4 , 5 , 2 , 3 , 4 , 5 }, where the number of each type of targets is 4, respectively. In the first two scenarios, the weight is assumed as dn=5subscript𝑑𝑛5d_{n}=5italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 5, n=1𝑛1n=1italic_n = 1, and dn=1subscript𝑑𝑛1d_{n}=1italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1, n1𝑛1n\neq 1italic_n ≠ 1. Otherwise, dn=5subscript𝑑𝑛5d_{n}=5italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 5 for reckless targets and dn=1subscript𝑑𝑛1d_{n}=1italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 for cautious targets in the third scenario.

The results with Nmc=100subscript𝑁𝑚𝑐100N_{mc}=100italic_N start_POSTSUBSCRIPT italic_m italic_c end_POSTSUBSCRIPT = 100 Monte Carlo simulations are shown in Table IV. As the value of K𝐾Kitalic_K varies from 1 to 3, While the number of targets N𝑁Nitalic_N remains constant at 8, the number of targets tracked will increase. Consequently, the tracking costs obtained by all policies naturally decrease, and the difference between the MP index policy and other greedy policies decreases. The MP index policy based on the trace of TEC matrices always obtains lower tracking errors and outperforms the TEC index policy and myopic policy.

TABLE IV: Results with different numbers of radars for different type targets with multi-dimensional states
Type Policies K=1𝐾1K=1italic_K = 1 K=2𝐾2K=2italic_K = 2 K=3𝐾3K=3italic_K = 3
Reckless MP index 4364.18 1142.87 610.10
myopic 4480.28 1153.19 613.15
TEC 4436.27 1153.43 633.80
Cautious MP index 3468.00 902.06 492.69
myopic 3584.63 931.63 500.40
TEC 3534.40 917.27 504.08
Reckless and Cautious MP index 6777.79 1824.24 990.12
myopic 7014.92 1895.25 1022.45
TEC 6879.65 1860.21 1040.56

Here, we also analyze the performance of policies changing over the number of targets and radars. The initial state 𝐏n,0subscript𝐏𝑛0\mathbf{P}_{n,0}bold_P start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT is generated by 𝐑0𝐑0superscriptsubscript𝐑0subscript𝐑0\bm{\mathrm{R}}_{0}^{\prime}\bm{\mathrm{R}}_{0}bold_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, where each element of 𝐑0subscript𝐑0\bm{\mathrm{R}}_{0}bold_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT follows the uniform distribution U(0,1)U01\mathrm{U}(0,1)roman_U ( 0 , 1 ), ensuring that 𝐏n,0𝕊++4subscript𝐏𝑛0subscriptsuperscript𝕊4absent\mathbf{P}_{n,0}\in\mathbb{S}^{4}_{++}bold_P start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + + end_POSTSUBSCRIPT. The two scenarios are considered, where there are 50%percent5050\%50 % reckless targets and 50%percent5050\%50 % cautious targets, respectively. In the first scenario, qnCT={2,,N/2+1}superscriptsubscript𝑞𝑛CT2𝑁21q_{n}^{\text{CT}}=\{2,\ldots,N/2+1\}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = { 2 , … , italic_N / 2 + 1 } in each type of targets, and dn=1subscript𝑑𝑛1d_{n}=1italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 for all of the targets. In the second scenario, the process noise variance qnCT=4superscriptsubscript𝑞𝑛CT4q_{n}^{\text{CT}}=4italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = 4 for all of the targets, but dn=5subscript𝑑𝑛5d_{n}=5italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 5 particularly for reckless targets. The performance results under Nmc=100subscript𝑁𝑚𝑐100N_{mc}=100italic_N start_POSTSUBSCRIPT italic_m italic_c end_POSTSUBSCRIPT = 100 Monte Carlo simulations are shown in Fig. 7 and Fig. 8. As the number of radars and targets increases, the MP index policy obtains better tracking performance in all of the cases.

Refer to caption
Figure 7: Performance for two types of targets with qnCT={2,,N/2+1,2,,N/2+1}superscriptsubscript𝑞𝑛CT2𝑁212𝑁21q_{n}^{\text{CT}}=\{2,\ldots,N/2+1,2,\ldots,N/2+1\}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT = { 2 , … , italic_N / 2 + 1 , 2 , … , italic_N / 2 + 1 } and dn1subscript𝑑𝑛1d_{n}\equiv 1italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≡ 1.
Refer to caption
Figure 8: Performance for two types of targets with qnCT4superscriptsubscript𝑞𝑛CT4q_{n}^{\text{CT}}\equiv 4italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CT end_POSTSUPERSCRIPT ≡ 4 and dn=5subscript𝑑𝑛5d_{n}=5italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 5 only for reckless targets.

These simulation results of targets with multi-dimensional TEC states show that the MP index policy also can efficiently solve the beam scheduling problem. However, the indexability validation and the lower bound need further calculations and verifications.

Above all, the adapted MP index policy improves the non-myopic beam scheduling performance and outperforms other greedy policies. Numerical simulation results with different parameters of the problem model indicate the superiority of this policy.

V Conclusion

Considering the beam scheduling problem for multiple smart target tracking, we have formulated each target as an MDP, where the state transition is established based on the Kalman filter and the dynamics model probability parameters. Then the trade-off, which is between more observation of smart targets for better tracking performance and the maneuvering reaction of smart targets to elude themselves, was solved by the RMABP problem model based on the Whittle index policy. Through numerical simulation results for both one-dimensional and multi-dimensional TEC states, we have demonstrated the better performance of the MP index policy by comparing it with the baseline greedy policies. For future research, we plan to prompt the resource scheduling for smart targets tacking with multi-dimensional states in the RMABP model and prove the validity of the Whittle index policy by theoretical analysis.

References

  • [1] A. M. Haimovich, R. S. Blum, and L. J. Cimini, “MIMO radar with widely separated antennas,” IEEE Signal Processing Magazine, vol. 25, no. 1, pp. 116–129, 2007.
  • [2] W. Zhang, C. Shi, J. Zhou, and R. Lv, “Joint aperture and transmit resource allocation strategy for multi-target localization in phased array radar network,” IEEE Transactions on Aerospace and Electronic Systems, vol. 59, no. 2, pp. 1551–1565, 2023.
  • [3] J. Yan, H. Jiao, W. Pu, C. Shi, J. Dai, and H. Liu, “Radar sensor network resource allocation for fused target tracking: a brief review,” Information Fusion, vol. 86, pp. 104–115, 2022.
  • [4] C. Kreucher, D. Blatt, A. Hero, and K. Kastella, “Adaptive multi-modality sensor scheduling for detection and tracking of smart targets,” Digital Signal Processing, vol. 16, no. 5, pp. 546–567, 2006.
  • [5] C. O. Savage and B. F. La Scala, “Sensor management for tracking smart targets,” Digital Signal Processing, vol. 19, no. 6, pp. 968–977, 2009.
  • [6] Z. Zhang and G. Shan, “Non-myopic sensor scheduling to track multiple reactive targets,” IET Signal Processing, vol. 9, no. 1, pp. 37–47, 2015.
  • [7] A. O. Hero, D. Castañón, D. Cochran, and K. Kastella, Foundations and applications of sensor management.   Springer Science & Business Media, 2007.
  • [8] M. R. Taner, O. E. Karasan, and E. Yavuzturk, “Scheduling beams with different priorities on a military surveillance radar,” IEEE Transactions on Aerospace and Electronic Systems, vol. 48, no. 2, pp. 1725–1739, 2012.
  • [9] H. Godrich, A. P. Petropulu, and H. V. Poor, “Sensor selection in distributed multiple-radar architectures for localization: A knapsack problem formulation,” IEEE Transactions on Signal Processing, vol. 60, no. 1, pp. 247–260, 2011.
  • [10] G. Zhang, J. Xie, H. Zhang, Z. Li, and C. Qi, “Dynamic antenna selection for colocated MIMO radar,” Remote Sensing, vol. 14, no. 12, p. 2912, 2022.
  • [11] H. Zhang, J. Shi, Q. Zhang, B. Zong, and J. Xie, “Antenna selection for target tracking in collocated MIMO radar,” IEEE Transactions on Aerospace and Electronic Systems, vol. 57, no. 1, pp. 423–436, 2020.
  • [12] J. Dai, J. Yan, W. Pu, H. Liu, and M. S. Greco, “Adaptive channel assignment for maneuvering target tracking in multistatic passive radar,” IEEE Transactions on Aerospace and Electronic Systems, vol. 59, no. 3, pp. 2780–2793, 2023.
  • [13] W. Yi, Y. Yuan, R. Hoseinnezhad, and L. Kong, “Resource scheduling for distributed multi-target tracking in netted colocated MIMO radar systems,” IEEE Transactions on Signal Processing, vol. 68, pp. 1602–1617, 2020.
  • [14] H. Zhang, W. Liu, J. Xie, Z. Zhang, and W. Lu, “Joint subarray selection and power allocation for cognitive target tracking in large-scale MIMO radar networks,” IEEE Systems Journal, vol. 14, no. 2, pp. 2569–2580, 2020.
  • [15] M. Xie, W. Yi, T. Kirubarajan, and L. Kong, “Joint node selection and power allocation strategy for multitarget tracking in decentralized radar networks,” IEEE Transactions on Signal Processing, vol. 66, no. 3, pp. 729–743, 2017.
  • [16] C. Shi, Y. Wang, S. Salous, J. Zhou, and J. Yan, “Joint transmit resource management and waveform selection strategy for target tracking in distributed phased array radar network,” IEEE Transactions on Aerospace and Electronic Systems, vol. 58, no. 4, pp. 2762–2778, 2021.
  • [17] J. Yan, W. Pu, S. Zhou, H. Liu, and M. S. Greco, “Optimal resource allocation for asynchronous multiple targets tracking in heterogeneous radar networks,” IEEE Transactions on Signal Processing, vol. 68, pp. 4055–4068, 2020.
  • [18] J. Yan, W. Pu, S. Zhou, H. Liu, and Z. Bao, “Collaborative detection and power allocation framework for target tracking in multiple radar system,” Information Fusion, vol. 55, pp. 173–183, 2020.
  • [19] C. Pang and G. Shan, “Sensor scheduling based on risk for target tracking,” IEEE Sensors Journal, vol. 19, no. 18, pp. 8224–8232, 2019.
  • [20] V. Krishnamurthy, Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing.   Cambridge, UK: Cambridge University Press, 2016.
  • [21] C. Kreucher, A. O. Hero, K. Kastella, and D. Chang, “Efficient methods of non-myopic sensor management for multitarget tracking,” in 2004 43rd IEEE Conference on Decision and Control (CDC)(IEEE Cat. No. 04CH37601), vol. 1.   IEEE, 2004, pp. 722–727.
  • [22] C. M. Kreucher and A. O. Hero, “Monte Carlo methods for sensor management in target tracking,” in 2006 IEEE Nonlinear Statistical Signal Processing Workshop.   IEEE, 2006, pp. 232–237.
  • [23] X. Gongguo, S. Ganlin, and D. Xiusheng, “Non-myopic scheduling method of mobile sensors for manoeuvring target tracking,” IET Radar, Sonar & Navigation, vol. 13, no. 11, pp. 1899–1908, 2019.
  • [24] G. Shan, G. Xu, and C. Qiao, “A non-myopic scheduling method of radar sensors for maneuvering target tracking and radiation control,” Defence Technology, vol. 16, no. 1, pp. 242–250, 2020.
  • [25] Q. Dong and C. Pang, “Risk-based non-myopic sensor scheduling in target threat level assessment,” IEEE Access, vol. 9, pp. 76 379–76 394, 2021.
  • [26] G. Shan and Z. Zhang, “Non-myopic sensor scheduling for low radiation risk tracking using mixed POMDP,” Transactions of the Institute of Measurement and Control, vol. 39, no. 2, pp. 230–243, 2017.
  • [27] S. Howard, S. Suvorova, and B. Moran, “Optimal policy for scheduling of Gauss–Markov systems,” in Proceedings of the Seventh International Conference on Information Fusion.   Citeseer, 2004, pp. 888–892.
  • [28] P. Whittle, “Restless bandits: Activity allocation in a changing world,” Journal of Applied Probability, vol. 25, no. A, pp. 287–298, 1988.
  • [29] J. C. Gittins, “Bandit processes and dynamic allocation indices,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 41, no. 2, pp. 148–164, 1979.
  • [30] V. Krishnamurthy and R. J. Evans, “Hidden Markov model multiarm bandits: A methodology for beam scheduling in multitarget tracking,” IEEE Transactions on Signal Processing, vol. 49, no. 12, pp. 2893–2908, 2001.
  • [31] J. Gittins, K. Glazebrook, and R. Weber, Multi-armed bandit allocation indices.   John Wiley & Sons, 2011.
  • [32] C. H. Papadimitriou and J. N. Tsitsiklis, “The complexity of optimal queuing network control,” Mathematics of Operations Research, vol. 24, no. 2, pp. 293–305, 1999.
  • [33] J. Niño-Mora, “Markovian restless bandits and index policies: A review,” Mathematics, vol. 11, no. 7, p. 1639, 2023.
  • [34] R. R. Weber and G. Weiss, “On an index policy for restless bandits,” Journal of applied probability, vol. 27, no. 3, pp. 637–648, 1990.
  • [35] B. F. La Scala and B. Moran, “Optimal target tracking with restless bandits,” Digital Signal Processing, vol. 16, no. 5, pp. 479–487, 2006.
  • [36] C. R. Dance and T. Silander, “When are Kalman-filter restless bandits indexable?” in Proceedings of the 28th Conference on Neural Information Processing Systems (NIPS), Montreal, Canada.   Cambridge, MA: MIT Press, 2015, pp. 1711–1719.
  • [37] K. Liu and Q. Zhao, “Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access,” IEEE Transactions on Information Theory, vol. 56, no. 11, pp. 5547–5567, 2010.
  • [38] J. Niño-Mora and S. S. Villar, “Sensor scheduling for hunting elusive hiding targets via Whittle’s restless bandit index policy,” in International Conference on NETwork Games, Control and Optimization (NetGCooP 2011).   IEEE, 2011, pp. 1–8.
  • [39] J. Wang, X. Ren, Y. Mo, and L. Shi, “Whittle index policy for dynamic multichannel allocation in remote state estimation,” IEEE Transactions on Automatic Control, vol. 65, no. 2, pp. 591–603, 2019.
  • [40] J. Niño-Mora, “Restless bandits, partial conservation laws and indexability,” Advances in Applied Probability, vol. 33, no. 1, pp. 76–98, 2001.
  • [41] ——, “Dynamic allocation indices for restless projects and queueing admission control: A polyhedral approach,” Mathematical programming, vol. 93, no. 3, pp. 361–413, 2002.
  • [42] ——, “Restless bandit marginal productivity indices, diminishing returns and optimal control of make-to-order/make-to-stock M/G/1 queues,” Mathematics of Operations Research, vol. 31, no. 1, pp. 50–84, 2006.
  • [43] ——, “A verification theorem for threshold-indexability of real-state discounted restless bandits,” Mathematics of Operations Research, vol. 45, no. 2, pp. 465–496, 2020.
  • [44] ——, “Conservation laws and related applications,” in Wiley Encyclopedia of Operations Research and Management Science.   New York, NY, USA: Wiley Online Library, 2011.
  • [45] ——, “An index policy for dynamic fading-channel allocation to heterogeneous mobile users with partial observations,” in Proceedings of the 4th EuroNGI Conference on Next Generation Internet Networks (NGI), Krakow, Poland.   New York, NY, USA: IEEE, 2008, pp. 231–238.
  • [46] J. Niño-Mora and S. S. Villar, “Multitarget tracking via restless bandit marginal productivity indices and Kalman filter in discrete time,” in Proceedings of the Joint 48th IEEE Conference on Decision and Control (CDC) / 28th Chinese Control Conference (CCC), Shanghai, China.   New York, NY, USA: IEEE, 2009, pp. 2905–2910.
  • [47] J. Niño-Mora, “Whittle’s index policy for multi-target tracking with jamming and nondetections,” in Proceedings of the 23rd International Conference on Analytical and Stochastic Modelling Techniques and Applications (ASMTA), Cardiff, Wales.   Cham, Switzerland: Springer, 2016, pp. 210–222.
  • [48] ——, “A verification theorem for indexability of discrete time real state discounted restless bandits,” arXiv:1512.04403, 2015.
  • [49] C. R. Dance and T. Silander, “Optimal policies for observing time series and related restless bandit problems,” The Journal of Machine Learning Research, vol. 20, no. 1, pp. 1218–1310, 2019.
  • [50] F. Yang and X. Luo, “A restless MAB-based index policy for UL pilot allocation in massive MIMO over Gauss–Markov fading channels,” IEEE Transactions on Vehicular Technology, vol. 69, no. 3, pp. 3034–3047, 2020.
  • [51] R. Visina, Y. Bar-Shalom, and P. Willett, “Multiple-model estimators for tracking sharply maneuvering ground targets,” IEEE Transactions on Aerospace and Electronic Systems, vol. 54, no. 3, pp. 1404–1414, 2018.
  • [52] W. D. Blair and Y. Bar-Shalom, “MSE design of nearly constant velocity Kalman filters for tracking targets with deterministic maneuvers,” IEEE Transactions on Aerospace and Electronic Systems, vol. 59, no. 4, pp. 4180–4191, 2023.
  • [53] D. B. Brown and J. E. Smith, “Index policies and performance bounds for dynamic selection problems,” Management Science, vol. 66, no. 7, pp. 3029–3050, 2020.
Yuhang Hao received the B.Sc. degree and the M.Sc. degree in control theory and control engineering from Northwestern Polytechnical University, Xi’an, China, in 2016 and 2019 respectively. He is currently pursuing the Ph.D. degree with the Key Laboratory of information Fusion Technology, Ministry of Education, Northwestern Polytechnical University, Xi’an, China. His research interests include resource scheduling, multi-armed bandits, target tracking, and information fusion.
Zengfu Wang received the B.Sc. degree in applied mathematics, the M.Sc. degree in control theory and control engineering, the Ph.D. degree in control science and engineering from Northwestern Polytechnical University, Xi’an, China, in 2005, 2008, and 2013 respectively. From 2014 to 2017, he was a Lecturer with Northwestern Polytechnical University, where he is currently an Associate Professor. From 2014 to 2015, he was a Postdoctoral Research Fellow with the Department of Electronic Engineering, City University of Hong Kong, Hong Kong, China. Since December 2019, he has been a Visiting Researcher with the Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, The Netherlands. His research interests include path planning, discrete optimization, and information fusion.
José Niño-Mora earned a Lic. (B.Sc./M.Sc.) degree in Mathematical Sciences from Complutense University in Madrid, Spain, and a Ph.D. degree in Operations Research from Massachusetts Institute of Technology (MIT) on a Fulbright fellowship, in 1989 and 1995, respectively. After postdoc stints at MIT and at the Catholic University of Louvain at Louvain-la-Neuve, Belgium (as a Marie Curie fellow), he was a Visiting Professor at the Economics & Business Department of Pompeu Fabra University in Barcelona, and then joined in 2003 the Statistics Department at Carlos III University of Madrid, Spain, where he is Full Professor of Operations Research & Statistics. He was the recipient of the 1st Award for Best Methodological Contribution in Operations Research in 2020 by the Spanish Statistics and Operations Research Society and the BBVA Foundation. His research interests include dynamic resource allocation in stochastic systems, Markov decision models, restless bandit problems, and index policies.
**g Fu received the B.Eng. degree in computer science from Shanghai Jiao Tong University, Shanghai, China, in 2011, and the Ph.D. degree in electronic engineering from the City University of Hong Kong in 2016. She has been with the School of Mathematics and Statistics, the University of Melbourne as a Post-Doctoral Research Associate from 2016 to 2019. She has been a lecturer in the discipline of Electronic & Telecommunications Engineering, RMIT University, since 2020. Her research interests now include energy-efficient networking/scheduling, resource allocation in large-scale networks, semi-Markov/Markov decision processes, restless multi-armed bandit problems, and stochastic optimization.
Min Yang received the B.Sc. degree from Shandong Agricultural University in 2015, Taian, China. She received the M.Sc. degree in Control science and engineering from University of Science and Technology Liaoning, Liaoning, China, in 2019. She is currently pursuing the Ph.D.degree with the Key Laboratory of information Fusion Technology, Ministry of Education, Northwestern Polytechnical University, Xi’an, China. Her research interests include multi-armed bandits, resource allocation, and target tracking.
Quan Pan received the B.Sc. degree from Huazhong Institute of Technology in 1991, and the M.Sc. and Ph.D. degrees from Northwestern Polytechnical University in 1991 and 1997, respectively. Since 1997, he has been a professor with the School of Automation, Northwestern Polytechnical University. His research interests include information fusion, hybrid system estimation theory, multi-scale estimation theory, target tracking and image processing. He is a Member of the International Society of Information Fusion, a Board Member of the Chinese Association of Automation, and a Member of the Chinese Association of Aeronautics and Astronautics. He was the recipient of the 6th Chinese National Youth Award for Outstanding Contribution to Science and Technology in 1998 and the Chinese National New Century Excellent Professional Talent in 2000.
zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg==" alt="[LOGO]">