A Batch Sequential Halving Algorithm without
Performance Degradation

Sotetsu Koyamada
[email protected]
ATR, Kyoto University &Soichiro Nishimori
[email protected]
The University of Tokyo &Shin Ishii
[email protected]
Kyoto University, ATR
Abstract

In this paper, we investigate the problem of pure exploration in the context of multi-armed bandits, with a specific focus on scenarios where arms are pulled in fixed-size batches. Batching has been shown to enhance computational efficiency, but it can potentially lead to a degradation compared to the original sequential algorithm’s performance due to delayed feedback and reduced adaptability. We introduce a simple batch version of the Sequential Halving (SH) algorithm (Karnin et al., 2013) and provide theoretical evidence that batching does not degrade the performance of the original algorithm under practical conditions. Furthermore, we empirically validate our claim through experiments, demonstrating the robust nature of the SH algorithm in fixed-size batch settings.

1 Introduction

In this study, we consider the pure exploration problem in the field of stochastic multi-armed bandits, which aims to identify the best arm within a given budget (Audibert et al., 2010). Specifically, we concentrate on the fixed-size batch pulls setting, where we pull a fixed number of arms simultaneously. Batch computation plays a crucial role in improving computational efficiency, especially in large-scale bandit applications where reward computation can be expensive. For instance, consider applying this to tree search algorithms like Monte Carlo tree search (Tolpin & Shimony, 2012). The reward computation here typically involves the value network evaluation (Silver et al., 2016; 2017), which can be computationally expensive. By leveraging batch computation and hardware accelerators (e.g., GPUs), we can significantly reduce the computational cost of the reward computation. However, while batch computation enhances computational efficiency, its performance (e.g., simple regret) may not match that of sequential computation with the same total budget, due to delayed feedback reducing adaptability. Therefore, the objective of this study is to develop a pure exploration algorithm that maintains its performance regardless of the batch size.

We focus on the Sequential Halving (SH) algorithm (Karnin et al., 2013), a popular and well-analyzed pure exploration algorithm. Due to its simplicity, efficiency, and lack of task-dependent hyperparameters, SH finds practical applications in, but not limited to, hyperparameter tuning (Jamieson & Talwalkar, 2016), recommendation systems (Aziz et al., 2022), and state-of-the-art AlphaZero (Silver et al., 2018) and MuZero (Schrittwieser et al., 2020) family (Danihelka et al., 2022). In this study, we aim to extend SH to a batched version that matches the original SH algorithm’s performance, even with large batch sizes. To date, Jun et al. (2016) introduced a simple batched extension of SH and reported that it performed well in their experiments. However, the theoretical properties of batched SH have not yet been well-studied in the setting of fixed-size batch pulls.

We consider two simple and natural batched variants of SH (Sec. 3): Breadth-first Sequential Halving (BSH) and Advance-first Sequential Halving (ASH). We introduce BSH as an intermediate step to understanding ASH, which is our main focus. Our main contribution is providing a theoretical guarantee for ASH (Sec. 4), showing that it is algorithmically equivalent to SH as long as the batch budget is not extremely small — For example, in a 32-armed stochastic bandit problem, ASH can match SH’s choice with 100K sequential pulls using just 20 batch pulls, each of size 5K. This means that ASH can achieve the same performance as SH with significantly fewer pulls when the batch size is reasonably large. Moreover, one can understand the theoretical properties of ASH using the theoretical properties of SH, which have been well-studied (Karnin et al., 2013; Zhao et al., 2023). In our experiments, we validate our claim by comparing the behavior of ASH and SH (Sec. 5.1) and analyze the behavior of ASH with the extremely small batch budget as well (Sec. 5.2).

2 Preliminary

Pure Exploration Problem.

Consider a pure exploration problem involving n𝑛nitalic_n arms and a budget T𝑇Titalic_T. We define a reward matrix [0,1]n×Tsuperscript01𝑛𝑇\mathcal{R}\in[0,1]^{n\times T}caligraphic_R ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_n × italic_T end_POSTSUPERSCRIPT, where each element i,j[0,1]subscript𝑖𝑗01\mathcal{R}_{i,j}\in[0,1]caligraphic_R start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ [ 0 , 1 ] represents the reward of the j𝑗jitalic_j-th pull of arm i[n]{1,,n}𝑖delimited-[]𝑛1𝑛i\in[n]\coloneqq\{1,\ldots,n\}italic_i ∈ [ italic_n ] ≔ { 1 , … , italic_n }, with j𝑗jitalic_j being counted independently for each arm. Each element in the i𝑖iitalic_i-th row is an i.i.d. sample from an unknown reward distribution of i𝑖iitalic_i-th arm with mean μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Without loss of generality, we assume that 1μ1μ2μn01subscript𝜇1subscript𝜇2subscript𝜇𝑛01\geq\mu_{1}\geq\mu_{2}\geq\ldots\geq\mu_{n}\geq 01 ≥ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ … ≥ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≥ 0. In the standard sequential setting, a pure exploration algorithm sequentially observes T𝑇Titalic_T elements from \mathcal{R}caligraphic_R by pulling arms one by one for T𝑇Titalic_T times. The algorithm then selects one arm as the best arm candidate. Note that we only consider deterministic pure exploration algorithms in this study. Such an algorithm can be characterized by a map** π:[0,1]n×T[n]:𝜋superscript01𝑛𝑇delimited-[]𝑛\pi:[0,1]^{n\times T}\to[n]italic_π : [ 0 , 1 ] start_POSTSUPERSCRIPT italic_n × italic_T end_POSTSUPERSCRIPT → [ italic_n ] that takes \mathcal{R}caligraphic_R as input and outputs the selected arm aTsubscript𝑎𝑇a_{T}italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. The natural performance measure in pure exploration is the simple regret, defined as 𝔼[μ1μaT]subscript𝔼delimited-[]subscript𝜇1subscript𝜇subscript𝑎𝑇\mathbb{E}_{\mathcal{R}}[\mu_{1}-\mu_{a_{T}}]blackboard_E start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT [ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] (Bubeck et al., 2009), which compares the performance of the selected arm aTsubscript𝑎𝑇a_{T}italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT with the best arm 1111.

Sequential Halving (SH; Karnin et al. (2013)) is a sequential elimination algorithm designed for the pure exploration problem. It begins by initializing the set of best arm candidates as 𝒮0[n]subscript𝒮0delimited-[]𝑛\mathcal{S}_{0}\coloneqq[n]caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≔ [ italic_n ]. In each of the log2nsubscript2𝑛\lceil\log_{2}n\rceil⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ rounds, the algorithm halves the set of candidates (i.e., |𝒮r+1|=|𝒮r|/2subscript𝒮𝑟1subscript𝒮𝑟2|\mathcal{S}_{r+1}|=\left\lceil|\mathcal{S}_{r}|/2\right\rceil| caligraphic_S start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT | = ⌈ | caligraphic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | / 2 ⌉) until it narrows down the candidates to a single arm in 𝒮log2nsubscript𝒮subscript2𝑛\mathcal{S}_{\lceil\log_{2}n\rceil}caligraphic_S start_POSTSUBSCRIPT ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ end_POSTSUBSCRIPT. During each round r{0,,log2n1}𝑟0subscript2𝑛1r\in\{0,\ldots,\lceil\log_{2}n\rceil-1\}italic_r ∈ { 0 , … , ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ - 1 }, the arms in the active arm set 𝒮rsubscript𝒮𝑟\mathcal{S}_{r}caligraphic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT are pulled equally JrT|𝒮r|log2nsubscript𝐽𝑟𝑇subscript𝒮𝑟subscript2𝑛J_{r}\coloneqq\bigl{\lfloor}\frac{T}{|\mathcal{S}_{r}|\lceil\log_{2}n\rceil}% \bigr{\rfloor}italic_J start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ≔ ⌊ divide start_ARG italic_T end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ end_ARG ⌋ times, and the total budget consumed for round r𝑟ritalic_r is TrJr×|𝒮r|subscript𝑇𝑟subscript𝐽𝑟subscript𝒮𝑟T_{r}\coloneqq J_{r}\times|\mathcal{S}_{r}|italic_T start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ≔ italic_J start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT × | caligraphic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT |. The SH algorithm is described in Algorithm 1. We denote the map** induced by the SH algorithm as πSHsubscript𝜋SH\pi_{\text{SH}}italic_π start_POSTSUBSCRIPT SH end_POSTSUBSCRIPT. It has been shown that the simple regret of SH satisfies 𝔼[μ1μaT]𝒪~(n/T)subscript𝔼delimited-[]subscript𝜇1subscript𝜇subscript𝑎𝑇~𝒪𝑛𝑇\mathbb{E}_{\mathcal{R}}[\mu_{1}-\mu_{a_{T}}]\leq\tilde{\mathcal{O}}(\sqrt{n/T})blackboard_E start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT [ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] ≤ over~ start_ARG caligraphic_O end_ARG ( square-root start_ARG italic_n / italic_T end_ARG ), where 𝒪~()~𝒪\tilde{\mathcal{O}}(\cdot)over~ start_ARG caligraphic_O end_ARG ( ⋅ ) ignores the logarithmic factors of n𝑛nitalic_n (Zhao et al., 2023). Note that the consumed budget r<log2nTrsubscript𝑟subscript2𝑛subscript𝑇𝑟\sum_{r<\lceil\log_{2}n\rceil}T_{r}∑ start_POSTSUBSCRIPT italic_r < ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT might be less than T𝑇Titalic_T. In this study, we assume that the remaining budget is consumed equally by the last two arms in the final round.

Algorithm 1 SH: Sequential Halving (Karnin et al., 2013)
1:input number of arms: n𝑛nitalic_n, budget: T𝑇Titalic_T
2:initialize best arm candidates 𝒮0[n]subscript𝒮0delimited-[]𝑛\mathcal{S}_{0}\coloneqq[n]caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≔ [ italic_n ]
3:for round r=0,,log2n1𝑟0subscript2𝑛1r=0,\ldots,\lceil\log_{2}n\rceil-1italic_r = 0 , … , ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ - 1 do
4:     pull each arm aSr𝑎subscript𝑆𝑟a\in S_{r}italic_a ∈ italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT for Jr=T|𝒮r|log2nsubscript𝐽𝑟𝑇subscript𝒮𝑟subscript2𝑛J_{r}=\left\lfloor\frac{T}{|\mathcal{S}_{r}|\lceil\log_{2}n\rceil}\right\rflooritalic_J start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = ⌊ divide start_ARG italic_T end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ end_ARG ⌋ times
5:     𝒮r+1top-|𝒮r|/2subscript𝒮𝑟1top-subscript𝒮𝑟2\mathcal{S}_{r+1}\leftarrow\textrm{top-}\lceil|\mathcal{S}_{r}|/2\rceilcaligraphic_S start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT ← top- ⌈ | caligraphic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | / 2 ⌉ arms in 𝒮rsubscript𝒮𝑟\mathcal{S}_{r}caligraphic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT w.r.t. the empirical rewards
6:return the only arm in 𝒮log2nsubscript𝒮subscript2𝑛\mathcal{S}_{\lceil\log_{2}n\rceil}caligraphic_S start_POSTSUBSCRIPT ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ end_POSTSUBSCRIPT

3 Batch Sequential Halving Algorithms

In this study, we consider the fixed-size batch pulls setting, where we simultaneously pull b𝑏bitalic_b arms for B𝐵Bitalic_B times, with b𝑏bitalic_b being the fixed batch size and B𝐵Bitalic_B being the batch budget (Jun et al., 2016). The standard sequential case corresponds to b=1𝑏1b=1italic_b = 1 and B=T𝐵𝑇B=Titalic_B = italic_T. Our interest is to compare the performance of the batch SH algorithms with a large batch size b𝑏bitalic_b and a small batch budget B𝐵Bitalic_B to that of the standard SH algorithm when pulling sequentially T𝑇Titalic_T times. Therefore, we compare the performance of the batch SH algorithms under the assumption that T=b×B𝑇𝑏𝐵T=b\times Bitalic_T = italic_b × italic_B holds, so that the total budget is the same in both the sequential and batch settings. In this section, we first reconstruct the SH algorithm so that it can be easily extended to the batched setting (Sec. 3.1). Then, we consider Breadth-first Sequential Halving (BSH), one of the simplest batched extensions of SH, as an intermediate step (Sec. 3.2). Finally, we introduce Advance-first Sequential Halving (ASH) as a further extension (Sec. 3.3).

3.1 SH implementation with target pulls

Algorithm 2 SH (Karnin et al., 2013) implementation with target pulls LBsuperscript𝐿BL^{\text{\color[rgb]{0,0,0.80}{{B}}}}italic_L start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT/LAsuperscript𝐿AL^{\text{\color[rgb]{0.80,0,0}{{A}}}}italic_L start_POSTSUPERSCRIPT A end_POSTSUPERSCRIPT
1:input number of arms: n𝑛nitalic_n, budget: T𝑇Titalic_T
2:initialize empirical mean μ¯a0subscript¯𝜇𝑎0\bar{\mu}_{a}\coloneqq 0over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ≔ 0 and arm pulls Na0subscript𝑁𝑎0N_{a}\coloneqq 0italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ≔ 0 for all a[n]𝑎delimited-[]𝑛a\in[n]italic_a ∈ [ italic_n ]
3:for t=0,,T1𝑡0𝑇1t=0,\ldots,T-1italic_t = 0 , … , italic_T - 1 do
4:     let 𝒜tsubscript𝒜𝑡\mathcal{A}_{t}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be {a[n]Na=Lt}conditional-set𝑎delimited-[]𝑛subscript𝑁𝑎subscript𝐿𝑡\{a\in[n]\mid N_{a}=L_{t}\}{ italic_a ∈ [ italic_n ] ∣ italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }\triangleright Ltsubscript𝐿𝑡L_{t}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is either LtBsuperscriptsubscript𝐿𝑡BL_{t}^{\text{\color[rgb]{0,0,0.80}{{B}}}}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT (2) or LtAsuperscriptsubscript𝐿𝑡AL_{t}^{\text{\color[rgb]{0.80,0,0}{{A}}}}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT A end_POSTSUPERSCRIPT (3)
5:     pull arm atargmaxa𝒜tμ¯asubscript𝑎𝑡subscriptargmax𝑎subscript𝒜𝑡subscript¯𝜇𝑎a_{t}\coloneqq\text{argmax}_{a\in\mathcal{A}_{t}}\bar{\mu}_{a}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≔ argmax start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT
6:     update μ¯atsubscript¯𝜇subscript𝑎𝑡\bar{\mu}_{a_{t}}over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT and NatNat+1subscript𝑁subscript𝑎𝑡subscript𝑁subscript𝑎𝑡1N_{a_{t}}\leftarrow N_{a_{t}}+1italic_N start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← italic_N start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1
7:return argmaxa[n](Na,μ¯a)subscriptargmax𝑎delimited-[]𝑛subscript𝑁𝑎subscript¯𝜇𝑎\text{argmax}_{a\in[n]}(N_{a},\bar{\mu}_{a})argmax start_POSTSUBSCRIPT italic_a ∈ [ italic_n ] end_POSTSUBSCRIPT ( italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT )
Algorithm 3 Breadth-first target pulls LBsuperscript𝐿BL^{\text{\color[rgb]{0,0,0.80}{{B}}}}italic_L start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT
1:input number of arms: n𝑛nitalic_n, budget: T𝑇Titalic_T
2:initialize empty LBsuperscript𝐿BL^{\text{\color[rgb]{0,0,0.80}{{B}}}}italic_L start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT, Kn𝐾𝑛K\coloneqq nitalic_K ≔ italic_n, J0𝐽0J\coloneqq 0italic_J ≔ 0
3:for r=0,log2n1𝑟0subscript2𝑛1r=0,\ldots\lceil\log_{2}n\rceil-1italic_r = 0 , … ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ - 1 do
4:     for \vartriangleright j=0,,Jr1𝑗0subscript𝐽𝑟1j=0,\ldots,J_{r}-1italic_j = 0 , … , italic_J start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT - 1 do
5:         for \blacktriangleright k=0,,K1𝑘0𝐾1k=0,\ldots,K-1italic_k = 0 , … , italic_K - 1 do
6:              append J+j𝐽𝑗J+jitalic_J + italic_j to LBsuperscript𝐿BL^{\text{\color[rgb]{0,0,0.80}{{B}}}}italic_L start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT               
7:     KK/2𝐾𝐾2K\leftarrow\lceil K/2\rceilitalic_K ← ⌈ italic_K / 2 ⌉ and JJ+Jr𝐽𝐽subscript𝐽𝑟J\leftarrow J+J_{r}italic_J ← italic_J + italic_J start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT
8:return LBsuperscript𝐿BL^{\text{\color[rgb]{0,0,0.80}{{B}}}}italic_L start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT \triangleright (0,0,0,...)
Algorithm 4 Advance-first target pulls LAsuperscript𝐿AL^{\text{\color[rgb]{0.80,0,0}{{A}}}}italic_L start_POSTSUPERSCRIPT A end_POSTSUPERSCRIPT
1:input number of arms: n𝑛nitalic_n, budget: T𝑇Titalic_T
2:initialize empty LAsuperscript𝐿AL^{\text{\color[rgb]{0.80,0,0}{{A}}}}italic_L start_POSTSUPERSCRIPT A end_POSTSUPERSCRIPT, Kn𝐾𝑛K\coloneqq nitalic_K ≔ italic_n, J0𝐽0J\coloneqq 0italic_J ≔ 0
3:for r=0,log2n1𝑟0subscript2𝑛1r=0,\ldots\lceil\log_{2}n\rceil-1italic_r = 0 , … ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ - 1 do
4:     for \blacktriangleright k=0,,K1𝑘0𝐾1k=0,\ldots,K-1italic_k = 0 , … , italic_K - 1 do
5:         for \vartriangleright j=0,,Jr1𝑗0subscript𝐽𝑟1j=0,\ldots,J_{r}-1italic_j = 0 , … , italic_J start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT - 1 do
6:              append J+j𝐽𝑗J+jitalic_J + italic_j to LAsuperscript𝐿AL^{\text{\color[rgb]{0.80,0,0}{{A}}}}italic_L start_POSTSUPERSCRIPT A end_POSTSUPERSCRIPT               
7:     KK/2𝐾𝐾2K\leftarrow\lceil K/2\rceilitalic_K ← ⌈ italic_K / 2 ⌉ and JJ+Jr𝐽𝐽subscript𝐽𝑟J\leftarrow J+J_{r}italic_J ← italic_J + italic_J start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT
8:return LAsuperscript𝐿AL^{\text{\color[rgb]{0.80,0,0}{{A}}}}italic_L start_POSTSUPERSCRIPT A end_POSTSUPERSCRIPT \triangleright (0,1,2,...)

Since BSH/ASH is a natural batched extension of SH, we first reconstruct the implementation of the SH algorithm as Algorithm 2 so that it can be easily extended to BSH/ASH. Note that, in this study, the operation argmaxx𝒳(x,mx)subscriptargmax𝑥𝒳subscript𝑥subscript𝑚𝑥\text{argmax}_{x\in\mathcal{X}}(\ell_{x},m_{x})argmax start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT ( roman_ℓ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) selects the element x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X that maximizes xsubscript𝑥\ell_{x}roman_ℓ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT first. If multiple elements achieve this maximum, it then selects among these the one that maximizes mxsubscript𝑚𝑥m_{x}italic_m start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. At the t𝑡titalic_t-th arm pull, SH selects the arm atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that has the highest empirical reward μ¯asubscript¯𝜇𝑎\bar{\mu}_{a}over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT among the candidates 𝒜tsubscript𝒜𝑡\mathcal{A}_{t}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

atargmaxa𝒜tμ¯a,subscript𝑎𝑡subscriptargmax𝑎subscript𝒜𝑡subscript¯𝜇𝑎\displaystyle a_{t}\coloneqq\text{argmax}_{a\in\mathcal{A}_{t}}\bar{\mu}_{a},italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≔ argmax start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , (1)

where 𝒜t{a[n]Na=Lt}subscript𝒜𝑡conditional-set𝑎delimited-[]𝑛subscript𝑁𝑎subscript𝐿𝑡\mathcal{A}_{t}\coloneqq\{a\in[n]\mid N_{a}=L_{t}\}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≔ { italic_a ∈ [ italic_n ] ∣ italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } are the candidates at the t𝑡titalic_t-th arm pull, Nasubscript𝑁𝑎N_{a}italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is the total number of pulls of arm a𝑎aitalic_a, and Ltsubscript𝐿𝑡L_{t}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the number of target pulls at t𝑡titalic_t, defined as either breadth-first manner

LtBr<r(t)Jrpulls before r(t)+tr<r(t)Tr|𝒮r(t)|pulls in r(t),superscriptsubscript𝐿𝑡Bsubscriptsubscriptsuperscript𝑟𝑟𝑡subscript𝐽superscript𝑟pulls before 𝑟𝑡subscript𝑡subscriptsuperscript𝑟𝑟𝑡subscript𝑇superscript𝑟subscript𝒮𝑟𝑡pulls in 𝑟𝑡L_{t}^{\text{\color[rgb]{0,0,0.80}{{B}}}}\coloneqq\underbrace{\sum_{r^{\prime}% <r(t)}J_{r^{\prime}}}_{\text{\scriptsize pulls before }r(t)}+\underbrace{\left% \lfloor\frac{t-\sum_{r^{\prime}<r(t)}T_{r^{\prime}}}{|\mathcal{S}_{r(t)}|}% \right\rfloor}_{\text{\scriptsize pulls in }r(t)},italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT ≔ under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_r ( italic_t ) end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT pulls before italic_r ( italic_t ) end_POSTSUBSCRIPT + under⏟ start_ARG ⌊ divide start_ARG italic_t - ∑ start_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_r ( italic_t ) end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_r ( italic_t ) end_POSTSUBSCRIPT | end_ARG ⌋ end_ARG start_POSTSUBSCRIPT pulls in italic_r ( italic_t ) end_POSTSUBSCRIPT , (2)

or advance-first manner

LtAr<r(t)Jrpulls before r(t)+((tr<r(i)Tr)modJr(t))pulls in r(t),superscriptsubscript𝐿𝑡Asubscriptsubscriptsuperscript𝑟𝑟𝑡subscript𝐽superscript𝑟pulls before 𝑟𝑡subscriptmodulo𝑡subscriptsuperscript𝑟𝑟𝑖subscript𝑇superscript𝑟subscript𝐽𝑟𝑡pulls in 𝑟𝑡L_{t}^{\text{\color[rgb]{0.80,0,0}{{A}}}}\coloneqq\underbrace{\sum_{r^{\prime}% <r(t)}J_{r^{\prime}}}_{\text{\scriptsize pulls before }r(t)}+\underbrace{\left% (\left(t-\sum_{r^{\prime}<r(i)}T_{r^{\prime}}\right)\bmod J_{r(t)}\right)}_{% \text{\scriptsize pulls in }r(t)},italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT A end_POSTSUPERSCRIPT ≔ under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_r ( italic_t ) end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT pulls before italic_r ( italic_t ) end_POSTSUBSCRIPT + under⏟ start_ARG ( ( italic_t - ∑ start_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_r ( italic_i ) end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) roman_mod italic_J start_POSTSUBSCRIPT italic_r ( italic_t ) end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT pulls in italic_r ( italic_t ) end_POSTSUBSCRIPT , (3)

where r(t)𝑟𝑡r(t)italic_r ( italic_t ) is the round of the t𝑡titalic_t-th arm pull. This LtBsuperscriptsubscript𝐿𝑡BL_{t}^{\text{\color[rgb]{0,0,0.80}{{B}}}}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT/LtAsuperscriptsubscript𝐿𝑡AL_{t}^{\text{\color[rgb]{0.80,0,0}{{A}}}}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT A end_POSTSUPERSCRIPT represents the cumulative number of pulls of the arm selected at the t𝑡titalic_t-th pull before the t𝑡titalic_t-th arm pull. We omitted the dependency on n𝑛nitalic_n and T𝑇Titalic_T for simplicity. The definition of LtBsuperscriptsubscript𝐿𝑡BL_{t}^{\text{\color[rgb]{0,0,0.80}{{B}}}}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT/LtAsuperscriptsubscript𝐿𝑡AL_{t}^{\text{\color[rgb]{0.80,0,0}{{A}}}}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT A end_POSTSUPERSCRIPT is somewhat complicated, and it may be straightforward to write down the algorithm that constructs LB(L0B,,LTB)superscript𝐿Bsuperscriptsubscript𝐿0Bsuperscriptsubscript𝐿𝑇BL^{\text{\color[rgb]{0,0,0.80}{{B}}}}\coloneqq(L_{0}^{\text{\color[rgb]{% 0,0,0.80}{{B}}}},\ldots,L_{T}^{\text{\color[rgb]{0,0,0.80}{{B}}}})italic_L start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT ≔ ( italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT , … , italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT ) and LA(L0A,,LTA)superscript𝐿Asuperscriptsubscript𝐿0Asuperscriptsubscript𝐿𝑇AL^{\text{\color[rgb]{0.80,0,0}{{A}}}}\coloneqq(L_{0}^{\text{\color[rgb]{% 0.80,0,0}{{A}}}},\ldots,L_{T}^{\text{\color[rgb]{0.80,0,0}{{A}}}})italic_L start_POSTSUPERSCRIPT A end_POSTSUPERSCRIPT ≔ ( italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT A end_POSTSUPERSCRIPT , … , italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT A end_POSTSUPERSCRIPT ) as shown in Algorithm 3 and Algorithm 4, respectively. Note that the choice between LBsuperscript𝐿BL^{\text{\color[rgb]{0,0,0.80}{{B}}}}italic_L start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT and LAsuperscript𝐿AL^{\text{\color[rgb]{0.80,0,0}{{A}}}}italic_L start_POSTSUPERSCRIPT A end_POSTSUPERSCRIPT is arbitrary and does not affect the behavior of SH — as long as the arm pull is sequential (not batched). Python code for this SH implementation is available in App. A. Note that using target pulls to implement SH is natural and not new. For example, Mctx111https://github.com/google-deepmind/mctx (Babuschkin et al., 2020) has a similar implementation.

3.2 BSH: Breadth-first Sequential Halving

Now, we extend SH to BSH, in which we select arms so that the number of pulls of each arm becomes as equal as possible using LBsuperscript𝐿BL^{\text{\color[rgb]{0,0,0.80}{{B}}}}italic_L start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT. Note that LBsuperscript𝐿BL^{\text{\color[rgb]{0,0,0.80}{{B}}}}italic_L start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT uses T=b×B𝑇𝑏𝐵T=b\times Bitalic_T = italic_b × italic_B as the scheduled total budget. When pulling arms in a batch, we need to consider not only the number of pulls of the arms but also the number of scheduled pulls in the current batch. Therefore, we introduce virtual arm pulls Masubscript𝑀𝑎M_{a}italic_M start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, the number of scheduled pulls of arm a𝑎aitalic_a in the current batch. For each batch pull, we sequentially select b𝑏bitalic_b arms with the highest empirical rewards from the candidates {a[n]Na+Ma=LtB}conditional-set𝑎delimited-[]𝑛subscript𝑁𝑎subscript𝑀𝑎superscriptsubscript𝐿𝑡B\{a\in[n]\mid N_{a}+M_{a}=L_{t}^{\text{\color[rgb]{0,0,0.80}{{B}}}}\}{ italic_a ∈ [ italic_n ] ∣ italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_M start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT } and pull them as a batch. The BSH algorithm is described in App. B. BSH is similar to a batched extension of SH introduced in Jun et al. (2016) in the sense that it selects arms so that the number of pulls of each arm becomes as equal as possible.

Algorithm 5 ASH: Advance-first Sequential Halving
1:input number of arms: n𝑛nitalic_n, batch size: b𝑏bitalic_b, batch budget: B𝐵Bitalic_B
2:initialize counter t0𝑡0t\coloneqq 0italic_t ≔ 0, empirical mean μ¯a0subscript¯𝜇𝑎0\bar{\mu}_{a}\coloneqq 0over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ≔ 0, and arm pulls Na0subscript𝑁𝑎0N_{a}\coloneqq 0italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ≔ 0 for all a[n]𝑎delimited-[]𝑛a\in[n]italic_a ∈ [ italic_n ]
3:for B𝐵Bitalic_B times do
4:     initialize empty batch \mathcal{B}caligraphic_B and virtual arm pulls Ma=0subscript𝑀𝑎0M_{a}=0italic_M start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 0 for all a[n]𝑎delimited-[]𝑛a\in[n]italic_a ∈ [ italic_n ]
5:     for b𝑏bitalic_b times do
6:         let 𝒜tsubscript𝒜𝑡\mathcal{A}_{t}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be {a[n]Na+Ma=LtA}conditional-set𝑎delimited-[]𝑛subscript𝑁𝑎subscript𝑀𝑎superscriptsubscript𝐿𝑡A\{a\in[n]\mid N_{a}+M_{a}=L_{t}^{\text{\color[rgb]{0.80,0,0}{{A}}}}\}{ italic_a ∈ [ italic_n ] ∣ italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_M start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT A end_POSTSUPERSCRIPT } \triangleright BSH uses LtBsubscriptsuperscript𝐿B𝑡L^{\text{\color[rgb]{0,0,0.80}{{B}}}}_{t}italic_L start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT insteadargmaxa𝒜tμ¯asubscriptargmax𝑎subscript𝒜𝑡subscript¯𝜇𝑎\text{argmax}_{a\in\mathcal{A}_{t}}\bar{\mu}_{a}argmax start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT
7:         push atargmaxa𝒜tsubscript𝑎𝑡subscriptargmax𝑎subscript𝒜𝑡a_{t}\coloneqq\text{argmax}_{a\in\mathcal{A}_{t}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≔ argmax start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT(Na,μ¯a)subscript𝑁𝑎subscript¯𝜇𝑎(N_{a},\bar{\mu}_{a})( italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) to \mathcal{B}caligraphic_B \triangleright BSH uses argmaxa𝒜tμ¯asubscriptargmax𝑎subscript𝒜𝑡subscript¯𝜇𝑎\text{argmax}_{a\in\mathcal{A}_{t}}\bar{\mu}_{a}argmax start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT insteadLtBsubscriptsuperscript𝐿B𝑡L^{\text{\color[rgb]{0,0,0.80}{{B}}}}_{t}italic_L start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
8:         update tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1 and MatMat+1subscript𝑀subscript𝑎𝑡subscript𝑀subscript𝑎𝑡1M_{a_{t}}\leftarrow M_{a_{t}}+1italic_M start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← italic_M start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1      
9:     batch pull arms in \mathcal{B}caligraphic_B
10:     update μ¯asubscript¯𝜇𝑎\bar{\mu}_{a}over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and NaNa+Masubscript𝑁𝑎subscript𝑁𝑎subscript𝑀𝑎N_{a}\leftarrow N_{a}+M_{a}italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ← italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_M start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT for all a𝑎a\in\mathcal{B}italic_a ∈ caligraphic_B
11:return argmaxa[n](Na,μ¯a)subscriptargmax𝑎delimited-[]𝑛subscript𝑁𝑎subscript¯𝜇𝑎\text{argmax}_{a\in[n]}(N_{a},\bar{\mu}_{a})argmax start_POSTSUBSCRIPT italic_a ∈ [ italic_n ] end_POSTSUBSCRIPT ( italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT )
Refer to caption
Figure 1: Pictorial representation of breadth-first SH (BSH; Sec. 3.2) and advance-first SH (ASH; Sec. 3.3) for an 8-armed bandit problem. Batch size b𝑏bitalic_b is 24242424 and batch budget B𝐵Bitalic_B is 8888. The same color indicates the same batch pull — For example, in the first batch pull (blue), BSH pulls each of the 8 arms 3 times, while ASH pulls 3 arms 8 times each. BSH selects arms so that the number of pulls of each active arm becomes as equal as possible, while ASH selects arms so that once an arm is selected, it is pulled until the budget for the arm in the round is exhausted. These pull sequences are characterized by the target pulls LBsuperscript𝐿BL^{\text{\color[rgb]{0,0,0.80}{{B}}}}italic_L start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT and LAsuperscript𝐿AL^{\text{\color[rgb]{0.80,0,0}{{A}}}}italic_L start_POSTSUPERSCRIPT A end_POSTSUPERSCRIPT:
LB=superscript𝐿BabsentL^{\text{\color[rgb]{0,0,0.80}{{B}}}}=italic_L start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT = (0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,5,...)
LA=superscript𝐿AabsentL^{\text{\color[rgb]{0.80,0,0}{{A}}}}=italic_L start_POSTSUPERSCRIPT A end_POSTSUPERSCRIPT = (0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7,0,1,2,3,4,5,...)

3.3 ASH: Advance-first Sequential Halving

We further extend SH to ASH in a manner similar to BSH. The ASH algorithm is described in Algorithm 5. Fig. 1 shows the pictorial representation of BSH and ASH. Python code for this ASH implementation is available in App. A. The differences between BSH and ASH are that:

  1. 1.

    ASH selects arms in advance-first manner using LAsuperscript𝐿AL^{\text{\color[rgb]{0.80,0,0}{{A}}}}italic_L start_POSTSUPERSCRIPT A end_POSTSUPERSCRIPT instead of LBsuperscript𝐿BL^{\text{\color[rgb]{0,0,0.80}{{B}}}}italic_L start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT (line 6), and

  2. 2.

    ASH considers not only the empirical rewards μ¯asubscript¯𝜇𝑎\bar{\mu}_{a}over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT but also the number of actual pulls Nasubscript𝑁𝑎N_{a}italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT when selecting arms in a batch (line 7).

The second difference ensures that, when the batch spans two rounds, the arm to be promoted is selected from the arms that have completed pulling (e.g., see the 3rd batch pull in Fig. 1). Note that this second modification is not useful for BSH. Let πASH:[0,1]n×T[n]:subscript𝜋ASHsuperscript01𝑛𝑇delimited-[]𝑛\pi_{\text{ASH}}:[0,1]^{n\times T}\to[n]italic_π start_POSTSUBSCRIPT ASH end_POSTSUBSCRIPT : [ 0 , 1 ] start_POSTSUPERSCRIPT italic_n × italic_T end_POSTSUPERSCRIPT → [ italic_n ] be the map** induced by the ASH algorithm. In Sec. 4, we will show that ASH is algorithmically equivalent to SH with the same total budget T=b×B𝑇𝑏𝐵T=b\times Bitalic_T = italic_b × italic_BπASHsubscript𝜋ASH\pi_{\text{ASH}}italic_π start_POSTSUBSCRIPT ASH end_POSTSUBSCRIPT is identical to πSHsubscript𝜋SH\pi_{\text{SH}}italic_π start_POSTSUBSCRIPT SH end_POSTSUBSCRIPT.

4 Algorithmic Equivalence of SH and ASH

This section presents a theoretical guarantee for the ASH algorithm.

Theorem 1

Given a stochastic bandit problem with n2𝑛2n\geq 2italic_n ≥ 2 arms, let b2𝑏2b\geq 2italic_b ≥ 2 be the batch size and B𝐵Bitalic_B be the batch budget satisfying Bmax{4,nb}log2n𝐵4𝑛𝑏subscript2𝑛B\geq\max\{4,\frac{n}{b}\}\lceil\log_{2}n\rceilitalic_B ≥ roman_max { 4 , divide start_ARG italic_n end_ARG start_ARG italic_b end_ARG } ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉. Then, the ASH algorithm (Algorithm 5) is algorithmically equivalent to the SH algorithm (Algorithm 2) with the same total budget T=b×B𝑇𝑏𝐵T=b\times Bitalic_T = italic_b × italic_B — the map** πASHsubscript𝜋ASH\pi_{\textnormal{ASH}}italic_π start_POSTSUBSCRIPT ASH end_POSTSUBSCRIPT is identical to πSHsubscript𝜋SH\pi_{\textnormal{SH}}italic_π start_POSTSUBSCRIPT SH end_POSTSUBSCRIPT.

Proof sketch
Refer to caption
Figure 2: Inequality (4).

A key observation is that ASH and SH differ only when a batch pull spans two rounds, like the 3rd batch pull in Fig. 1. In this case, ASH may promote an incorrect arm to the next round that would not have been promoted in SH. We can prove that such incorrect promotion does not occur under the condition Bmax{4,nb}log2n𝐵4𝑛𝑏subscript2𝑛B\geq\max\{4,\frac{n}{b}\}\lceil\log_{2}n\rceilitalic_B ≥ roman_max { 4 , divide start_ARG italic_n end_ARG start_ARG italic_b end_ARG } ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉. This is done by demonstrating that the inequality (4) holds for any z<b𝑧𝑏z<bitalic_z < italic_b, the number of pulls for the current round r𝑟ritalic_r in the batch. Fig. 2 illustrates (4).

Proof. The condition Bmax{4,nb}log2n𝐵4𝑛𝑏subscript2𝑛B\geq\max\{4,\frac{n}{b}\}\lceil\log_{2}n\rceilitalic_B ≥ roman_max { 4 , divide start_ARG italic_n end_ARG start_ARG italic_b end_ARG } ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ is divided into two separate conditions:

Bnblog2n,𝐵𝑛𝑏subscript2𝑛\displaystyle B\geq\frac{n}{b}\lceil\log_{2}n\rceil,italic_B ≥ divide start_ARG italic_n end_ARG start_ARG italic_b end_ARG ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ , (C1)

and

B4log2n.𝐵4subscript2𝑛\displaystyle B\geq 4\lceil\log_{2}n\rceil.italic_B ≥ 4 ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ . (C2)

We focus on the scenario where a batch pull spans two rounds. In this case, let z<b𝑧𝑏z<bitalic_z < italic_b be the number of pulls that consume the budget for round r𝑟ritalic_r, and bz𝑏𝑧b-zitalic_b - italic_z be the number of pulls that consume the budget for round r+1𝑟1r+1italic_r + 1. The following proposition is demonstrated: n2,b2formulae-sequencefor-all𝑛2for-all𝑏2\forall n\geq 2,\forall b\geq 2∀ italic_n ≥ 2 , ∀ italic_b ≥ 2, r<log2n1for-all𝑟subscript2𝑛1\forall r<\lceil\log_{2}n\rceil-1∀ italic_r < ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ - 1, z<bfor-all𝑧𝑏\forall z<b∀ italic_z < italic_b, if (C1) and (C2) hold, then

|𝒮r+1|bzJr+1zJr.subscript𝒮𝑟1𝑏𝑧subscript𝐽𝑟1𝑧subscript𝐽𝑟\displaystyle|\mathcal{S}_{r+1}|-\left\lceil\frac{b-z}{J_{r+1}}\right\rceil% \geq\left\lceil\frac{z}{J_{r}}\right\rceil.| caligraphic_S start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT | - ⌈ divide start_ARG italic_b - italic_z end_ARG start_ARG italic_J start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT end_ARG ⌉ ≥ ⌈ divide start_ARG italic_z end_ARG start_ARG italic_J start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG ⌉ . (4)

The left-hand side (LHS) of (4) represents the number of arms promoting to the subsequent round post-batch pull, whereas the right-hand side (RHS) quantifies the arms pending completion of their pulls at the batch pull juncture. This inequality, if satisfied, ensures that, even when a batch spans two rounds, arms supposed to advance to the next round in SH are not left behind in ASH, i.e., no incorrect promotion occurs. Considering the scenario where z=b1𝑧𝑏1z=b-1italic_z = italic_b - 1 suffices, as it represents the worst-case condition. Let x|Sr|3𝑥subscript𝑆𝑟3x\coloneqq|S_{r}|\geq 3italic_x ≔ | italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | ≥ 3 for the given r<log2n1𝑟subscript2𝑛1r<\lceil\log_{2}n\rceil-1italic_r < ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ - 1. Two cases are considered. Case 1: when n4b𝑛4𝑏n\leq 4bitalic_n ≤ 4 italic_b. Given that Jr=b×Bxlog2n4b/xsubscript𝐽𝑟𝑏𝐵𝑥subscript2𝑛4𝑏𝑥J_{r}=\bigl{\lfloor}\frac{b\times B}{x\lceil\log_{2}n\rceil}\bigr{\rfloor}\geq% \left\lfloor 4b/x\right\rflooritalic_J start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = ⌊ divide start_ARG italic_b × italic_B end_ARG start_ARG italic_x ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ end_ARG ⌋ ≥ ⌊ 4 italic_b / italic_x ⌋ as derived from (C2), it is sufficient to show

x21b14b/x𝑥21𝑏14𝑏𝑥\displaystyle\left\lceil\frac{x}{2}\right\rceil-1\geq\left\lceil\frac{b-1}{% \left\lfloor 4b/x\right\rfloor}\right\rceil⌈ divide start_ARG italic_x end_ARG start_ARG 2 end_ARG ⌉ - 1 ≥ ⌈ divide start_ARG italic_b - 1 end_ARG start_ARG ⌊ 4 italic_b / italic_x ⌋ end_ARG ⌉ (5)

in x[3,4b]𝑥34𝑏x\in[3,4b]italic_x ∈ [ 3 , 4 italic_b ]. This assertion is directly supported by Lemma 1. Case 2: when 4b<n4𝑏𝑛4b<n4 italic_b < italic_n. Given that Jr=b×Bxlog2nn/xsubscript𝐽𝑟𝑏𝐵𝑥subscript2𝑛𝑛𝑥J_{r}=\bigl{\lfloor}\frac{b\times B}{x\lceil\log_{2}n\rceil}\bigr{\rfloor}\geq% \left\lfloor n/x\right\rflooritalic_J start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = ⌊ divide start_ARG italic_b × italic_B end_ARG start_ARG italic_x ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ end_ARG ⌋ ≥ ⌊ italic_n / italic_x ⌋ as derived from (C1), it is sufficient to show x21n/41n/x𝑥21𝑛41𝑛𝑥\left\lceil\frac{x}{2}\right\rceil-1\geq\bigl{\lceil}\frac{n/4-1}{\left\lfloor n% /x\right\rfloor}\bigr{\rceil}⌈ divide start_ARG italic_x end_ARG start_ARG 2 end_ARG ⌉ - 1 ≥ ⌈ divide start_ARG italic_n / 4 - 1 end_ARG start_ARG ⌊ italic_n / italic_x ⌋ end_ARG ⌉ in x[3,n]𝑥3𝑛x\in[3,n]italic_x ∈ [ 3 , italic_n ]. This conclusion follows by the same reasoning applied in Case 1. \square

Lemma 1

For any integer b2𝑏2b\geq 2italic_b ≥ 2, the inequality x21b14b/x𝑥21𝑏14𝑏𝑥\left\lceil\frac{x}{2}\right\rceil-1\geq\left\lceil\frac{b-1}{\left\lfloor 4b/% x\right\rfloor}\right\rceil⌈ divide start_ARG italic_x end_ARG start_ARG 2 end_ARG ⌉ - 1 ≥ ⌈ divide start_ARG italic_b - 1 end_ARG start_ARG ⌊ 4 italic_b / italic_x ⌋ end_ARG ⌉ holds for all integers x[3,4b]𝑥34𝑏x\in[3,4b]italic_x ∈ [ 3 , 4 italic_b ].

Refer to caption
Figure 3: Lemma 1.

The proof of Lemma 1 is in App. C. Here, we provide the visualization of (5) in Fig. 3 to intuitively show that Lemma 1 holds. Each colored line represents the RHS for different b32𝑏32b\leq 32italic_b ≤ 32. One can see that the LHS is always greater than the RHS for any x[3,4b]𝑥34𝑏x\in[3,4b]italic_x ∈ [ 3 , 4 italic_b ].

Remark 1

The condition (C1) is common to both SH and ASH — SH implicitly assumes Tnlog2n𝑇𝑛subscript2𝑛T\geq n\lceil\log_{2}n\rceilitalic_T ≥ italic_n ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ as the minimum condition to execute. This is because we need to pull each arm at least once in the first round (i.e., J11subscript𝐽11J_{1}\geq 1italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ 1). With the same argument, the batch budget B𝐵Bitalic_B must satisfy (C1). On the other hand, (C2) is specific to ASH and is required to ensure the equivalence. As we discuss in the Sec. 4.1, we argue that this additional (C2) is not practically problematic.

Remark 2

Note that the condition (C2) is tight; Theorem 1 does not hold even if Bαlog2n𝐵𝛼subscript2𝑛B\geq\alpha\lceil\log_{2}n\rceilitalic_B ≥ italic_α ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ for any positive value α<4𝛼4\alpha<4italic_α < 4.

Proof. We aim to demonstrate the existence of a value x𝑥xitalic_x such that x21b1αb/x<0𝑥21𝑏1𝛼𝑏𝑥0\left\lceil\frac{x}{2}\right\rceil-1-\left\lceil\frac{b-1}{\left\lfloor\alpha b% /x\right\rfloor}\right\rceil<0⌈ divide start_ARG italic_x end_ARG start_ARG 2 end_ARG ⌉ - 1 - ⌈ divide start_ARG italic_b - 1 end_ARG start_ARG ⌊ italic_α italic_b / italic_x ⌋ end_ARG ⌉ < 0 when nαb𝑛𝛼𝑏n\leq\alpha bitalic_n ≤ italic_α italic_b. Consider the case when x=4𝑥4x=4italic_x = 4. In this scenario, the LHS of the inequality can be rewritten as 1b1αb/41b1αb/414αb1b14α1𝑏1𝛼𝑏41𝑏1𝛼𝑏414𝛼𝑏1𝑏14𝛼1-\left\lceil\frac{b-1}{\left\lfloor\alpha b/4\right\rfloor}\right\rceil\leq 1% -\frac{b-1}{\left\lfloor\alpha b/4\right\rfloor}\leq 1-\frac{4}{\alpha}\frac{b% -1}{b}\to 1-\frac{4}{\alpha}1 - ⌈ divide start_ARG italic_b - 1 end_ARG start_ARG ⌊ italic_α italic_b / 4 ⌋ end_ARG ⌉ ≤ 1 - divide start_ARG italic_b - 1 end_ARG start_ARG ⌊ italic_α italic_b / 4 ⌋ end_ARG ≤ 1 - divide start_ARG 4 end_ARG start_ARG italic_α end_ARG divide start_ARG italic_b - 1 end_ARG start_ARG italic_b end_ARG → 1 - divide start_ARG 4 end_ARG start_ARG italic_α end_ARG as b𝑏b\to\inftyitalic_b → ∞. As α<4𝛼4\alpha<4italic_α < 4, it follows that LHS<0LHS0\text{LHS}<0LHS < 0 for sufficiently large values of b𝑏bitalic_b. \square

Remark 3

When b𝑏bitalic_b is sufficiently large, the minimum B𝐵Bitalic_B that satisfies both (C1) and (C2) is 4log2n4subscript2𝑛4\lceil\log_{2}n\rceil4 ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉. Theorem 1 implies that for arbitrarily large target budget T𝑇Titalic_T, ASH can achieve the same performance as SH by increasing the batch size b𝑏bitalic_b without increasing the batch budget B𝐵Bitalic_B from 4log2n4subscript2𝑛4\lceil\log_{2}n\rceil4 ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ — ASH guarantees its scalability in batch computation.

Remark 4

Theorem 1 allows us to understand the properties of ASH based on existing theoretical research on SH, such as the simple regret bound (Zhao et al., 2023).

4.1 Discussion on the conditions

To show that SH and ASH are algorithmically equivalent, we used an additional condition (C2) of 𝒪(logn)𝒪𝑛\mathcal{O}(\log n)caligraphic_O ( roman_log italic_n ). However, we argue that this condition is not practically problematic because the condition (C1), the minimum condition required to execute (unbatched) SH, is dominant (𝒪(nlogn)𝒪𝑛𝑛\mathcal{O}(n\log n)caligraphic_O ( italic_n roman_log italic_n )). This condition (C1) is dominant over (C2) as shown in Fig. 4. We can see that the condition (C2) only affects the algorithm when the batch size is sufficiently larger than the number of arms (bnmuch-greater-than𝑏𝑛b\gg nitalic_b ≫ italic_n). This is a reasonable result, meaning that we cannot guarantee the equivalent behavior to SH with an extremely small batch budget, such as B=1𝐵1B=1italic_B = 1. On the other hand, if the user secures the minimum budget B=4log2n𝐵4subscript2𝑛B=4\lceil\log_{2}n\rceilitalic_B = 4 ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ that depends only on the number of arms n𝑛nitalic_n and increases only logarithmically, regardless of the batch size b𝑏bitalic_b, they can increase the batch size arbitrarily and achieve the same result as when SH is executed sequentially with the same total budget, with high computational efficiency.

Refer to caption
Figure 4: Visualization of conditions (C1) and (C2) for n1024𝑛1024n\leq 1024italic_n ≤ 1024, B1024𝐵1024B\leq 1024italic_B ≤ 1024, and b{4,64,1024}𝑏4641024b\in\{4,64,1024\}italic_b ∈ { 4 , 64 , 1024 }.

5 Empirical Validation

Refer to caption
Figure 5: Polynomial(α)𝛼(\alpha)( italic_α )

We conducted experiments to empirically demonstrate that ASH maintains its performance for large batch size b𝑏bitalic_b, in comparison to its sequential counterpart SH. To evaluate this, we utilized a polynomial family parameterized by α𝛼\alphaitalic_α as a representative batch problem instance, where the reward gap Δaμ1μasubscriptΔ𝑎subscript𝜇1subscript𝜇𝑎\Delta_{a}\coloneqq\mu_{1}-\mu_{a}roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ≔ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT follows a polynomial distribution with parameter α𝛼\alphaitalic_α: Δa(a/n)αproportional-tosubscriptΔ𝑎superscript𝑎𝑛𝛼\Delta_{a}\propto(a/n)^{\alpha}roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∝ ( italic_a / italic_n ) start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT (Jamieson et al., 2013; Zhao et al., 2023). This choice is motivated by the observation that real-world applications exhibit polynomially distributed reward gaps, as mentioned in Zhao et al. (2023). In our study, we considered three different values of α𝛼\alphaitalic_α (0.50.50.50.5, 1.01.01.01.0, and 2.02.02.02.0) to capture various reward distributions (see Fig. 5). Additionally, we characterized each bandit problem instance by specifying the minimum and maximum rewards, denoted as μminsubscript𝜇min\mu_{\text{min}}italic_μ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT and μmaxsubscript𝜇max\mu_{\text{max}}italic_μ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT respectively. Hence, we denote a bandit problem instance as 𝒯(n,α,μmin,μmax)𝒯𝑛𝛼subscript𝜇minsubscript𝜇max\mathcal{T}(n,\alpha,\mu_{\text{min}},\mu_{\text{max}})caligraphic_T ( italic_n , italic_α , italic_μ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ).

We also implemented a simple batched extension of SH introduced by Jun et al. (2016) as a baseline for comparison. We refer to this algorithm as Jun+16. The implementation of Jun+16 is described in App. D. Jun et al. (2016) did not provide a theoretical guarantee for Jun+16, but it has shown performance comparable to or better than their proposed algorithm in their experiments.

5.1 Large batch budget scenario: B4log2n𝐵4subscript2𝑛B\geq 4\lceil\log_{2}n\rceilitalic_B ≥ 4 ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉

First, we empirically confirm that, as we claimed in Sec. 4, ASH is indeed equivalent to SH under the condition (C2). We generated 10K instances of bandit problems and applied ASH and SH to each instance with 100 different seeds. We randomly sampled n𝑛nitalic_n from {2,,1024}21024\{2,\ldots,1024\}{ 2 , … , 1024 }, α𝛼\alphaitalic_α from {0.5,1.0,2.0}0.51.02.0\{0.5,1.0,2.0\}{ 0.5 , 1.0 , 2.0 }, and μminsubscript𝜇min\mu_{\text{min}}italic_μ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT and μmaxsubscript𝜇max\mu_{\text{max}}italic_μ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT from {0.1,0.2,,0.9}0.10.20.9\{0.1,0.2,\ldots,0.9\}{ 0.1 , 0.2 , … , 0.9 }. For each instance 𝒯(n,α,μmin,μmax)𝒯𝑛𝛼subscript𝜇minsubscript𝜇max\mathcal{T}(n,\alpha,\mu_{\text{min}},\mu_{\text{max}})caligraphic_T ( italic_n , italic_α , italic_μ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ), we randomly sampled the batch budget B10log2n𝐵10subscript2𝑛B\leq 10\lceil\log_{2}n\rceilitalic_B ≤ 10 ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ and the batch size b5n𝑏5𝑛b\leq 5nitalic_b ≤ 5 italic_n so that the condition (C1) and (C2) are satisfied. As a result, we confirmed that the selected arms of ASH and SH are identical in all 10K instances and 100 seeds for each instance. We also conducted the same experiment for BSH and Jun+16. We plotted the simple regret of BSH, ASH, and Jun+16 against SH in Fig. 6. There are 10K instances, and each point represents the average simple regret of 100 seeds for each instance. To compare the performance, we fitted a linear regression model to the simple regret of BSH, ASH, and Jun+16 against SH as y=βx𝑦𝛽𝑥y=\beta xitalic_y = italic_β italic_x, where y𝑦yitalic_y is the simple regret of BSH, ASH, or Jun+16, x𝑥xitalic_x is the simple regret of SH. The slope β𝛽\betaitalic_β is estimated by the least squares method. The estimated slope β𝛽\betaitalic_β is 1.008 for BSH, 1.000 for ASH, and 0.971 for Jun+16, which indicates that the simple regret of ASH, BSH, and Jun+16 is comparable to SH on average.

Refer to caption
Refer to caption
Refer to caption
Figure 6: Single regret comparison of BSH, ASH, and Jun+16 against SH when B4log2n𝐵4subscript2𝑛B\geq 4\lceil\log_{2}n\rceilitalic_B ≥ 4 ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉.

5.2 Small batch budget scenario: B<4log2n𝐵4subscript2𝑛B<4\lceil\log_{2}n\rceilitalic_B < 4 ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉

Refer to caption
Refer to caption
Refer to caption
Figure 7: Single regret comparison of BSH, ASH, and Jun+16 against SH when B<4log2n𝐵4subscript2𝑛B<4\lceil\log_{2}n\rceilitalic_B < 4 ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉.

Next, we examined the performances of BSH, ASH, and Jun+16 against SH when the additional condition (C2) is not satisfied, i.e., when the batch budget is extremely small B<4log2n𝐵4subscript2𝑛B<4\lceil\log_{2}n\rceilitalic_B < 4 ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ and thus Theorem 1 does not hold. We conducted the same experiment as in Sec. 5.1 except the batch budget B<4log2n𝐵4subscript2𝑛B<4\lceil\log_{2}n\rceilitalic_B < 4 ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉. We sampled B𝐵Bitalic_B so that B𝐵Bitalic_B is larger than the number of rounds. The results are shown in Fig. 7. The slope β𝛽\betaitalic_β is estimated as 1.059 for BSH, 1.011 for ASH, and 1.017 for Jun+16. All the estimated slopes are worse than when B4log2n𝐵4subscript2𝑛B\geq 4\lceil\log_{2}n\rceilitalic_B ≥ 4 ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉. However, the estimated slopes are still close to 1, which indicates that while we do not have a theoretical guarantee, the performance of BSH, ASH, and Jun+16 is comparable to SH on average.

6 Related Work

Sequential Halving.

Among the algorithms for the pure exploration problem in multi-armed bandits (Audibert et al., 2010), Sequential Halving (SH; Karnin et al. (2013)) is one of the most popular algorithms. The theoretical properties of SH have been well studied (Karnin et al., 2013; Zhao et al., 2023). Due to its simplicity, SH has been widely used for these (but is not limited to) applications: In the context of tree-search algorithms, as the root node selection of Monte Carlo tree search can be regarded as a pure exploration problem (Tolpin & Shimony, 2012), Danihelka et al. (2022) incorporated SH into the root node selection and significantly reduced the number of simulations to improve the performance during AlphaZero/MuZero training. From the min-max search perspective, some studies recursively applied SH to the internal nodes of the search tree (Cazenave, 2014; Pepels et al., 2014). SH is also used for hyperparameter optimization; Jamieson & Talwalkar (2016) formalized the hyperparameter optimization problem in machine learning as a non-stochastic multi-armed bandit problem, where the reward signal is not from stochastic stationary distributions but from deterministic function changing over training steps. Li et al. (2018; 2020) applied SH to hyperparameter optimization in asynchronous parallel settings, which is similar to our batch setting. Their asynchronous approach may have incorrect promotions to the next rounds but is more efficient than the synchronous approach. Aziz et al. (2022) applied SH to recommendation systems, which identify appealing podcasts for users.

Batched bandit algorithms.

Batched bandit algorithms have been studied in various contexts (Perchet et al., 2016; Gao et al., 2019; Esfandiari et al., 2021; ** et al., 2021a; b; Kalkanli & Ozgur, 2021; Karbasi et al., 2021; Provodin et al., 2022). Among the batched bandit studies for the pure exploration problem (Agarwal et al., 2017; Grover et al., 2018; Jun et al., 2016), Jun et al. (2016) is the most relevant to our work as they also consider the fixed-size batch pulls setting. To the best of our knowledge, the first batched SH variant with a fixed batch size b𝑏bitalic_b was introduced by Jun et al. (2016) as a baseline algorithm in their study (Jun+16). It is similar to BSH and it pulls arms so that the number of pulls of the arms is as equal as possible (breadth-first manner). They reported that Jun+16 experimentally performs comparably to or better than their proposed method but did not provide a theoretical guarantee for Jun+16. Our ASH is different from their batch variant in that ASH pulls arms in an advance-first manner instead of a breadth-first manner.

7 Limitation and Future Work

Our batched variants of SH assume that the reward distributions of the arms are from i.i.d. distributions. This property is essential to allow batch pulls. One limitation is that it may be difficult to apply our algorithms to bandit problems where the reward distribution is non-stationary. For example, Jamieson & Talwalkar (2016) applied SH to hyperparameter tuning, where rewards are time-series losses during model training. We cannot apply our batched variants to this problem because we cannot observe “future losses” in a batch.

Our batched variants of SH are suitable for tasks where arms can be evaluated efficiently in batches rather than sequentially. For instance, when the evaluation of arms depends on the output of neural networks, the process can be efficiently conducted in batches using accelerators like GPUs. An example of this scenario is provided by Danihelka et al. (2022), where value networks are used in Monte Carlo tree search. Applying our batched variants to such algorithms is a possible future direction. Additionally, combining them with reinforcement learning environments that run on GPU/TPU accelerators (Freeman et al., 2021; Lange, 2022; Koyamada et al., 2023; Gulino et al., 2023; Nikulin et al., 2023; Bonnet et al., 2024; Rutherford et al., 2024; Matthews et al., 2024) for efficient batch evaluation is also promising.

8 Conclusion

In this paper, we proposed ASH as a simple and natural extension of the SH algorithm. We theoretically showed that ASH is algorithmically equivalent to SH as long as the batch budget is not excessively small. This allows ASH to inherit the well-studied theoretical properties of SH, including the simple regret bound. Our experimental results confirmed this claim and demonstrated that ASH and other batched variants of SH, like Jun+16, perform comparably to SH in terms of simple regret. These findings suggest that we can utilize simple batched variants of SH for efficient evaluation of arms with large batch sizes while avoiding performance degradation compared to the sequential execution of SH. By providing a practical solution for efficient arm evaluation, our study opens up new possibilities for applications that require large budgets. Overall, our work highlights the batch robust nature of SH and its potential for large-scale bandit problems.

Broader Impact Statement

The findings in this work on the bandit problem are focused on theoretical results and do not involve direct human or ethical implications. Therefore, concerns related to broader ethical, humanitarian, and societal issues are not applicable to this research. However, if our approach is applied to large-scale bandit problems, especially when batch evaluation involves large neural networks, there could be an indirect impact on energy consumption due to the computational resources required.

Acknowledgments

This paper is based on results obtained from a project, JPNP20006, subsidized by the New Energy and Industrial Technology Development Organization (NEDO), and partly supported by KAKENHI (No. 22H04998 and 23H04676) from Japan Society for the Promotion of Science (JSPS). We sincerely thank the reviewers for their invaluable feedback and constructive comments, which have significantly enhanced the quality of this research. We would also like to express our gratitude to the developers of the software libraries utilized in this research, including NumPy (Harris et al., 2020), SciPy (Virtanen et al., 2020), Matplotlib (Hunter, 2007), JAX (Bradbury et al., 2018), and Mctx (Babuschkin et al., 2020).

References

Appendix A Python code

For the sake of reproducibility and a better understanding, we provide Python code for the Sequential Halving (SH) algorithm using advance-first target pulls and the Advance-first Sequential Halving (ASH) algorithm in Fig. 8.

Refer to caption
Figure 8: Python implementation of the SH algorithm using advance-first target pulls (Algorithm 2) and the ASH algorithm (Algorithm 5).

Appendix B BSH algorithm

Algorithm 6 shows the detailed BSH algorithm (see Sec. 3.2).

Algorithm 6 BSH: Breadth-first Sequential Halving
1:input number of arms: n𝑛nitalic_n, batch size: b𝑏bitalic_b, batch budget: B𝐵Bitalic_B
2:initialize counter t0𝑡0t\coloneqq 0italic_t ≔ 0, empirical mean μ¯a0subscript¯𝜇𝑎0\bar{\mu}_{a}\coloneqq 0over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ≔ 0, and arm pulls Na0subscript𝑁𝑎0N_{a}\coloneqq 0italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ≔ 0 for all a[n]𝑎delimited-[]𝑛a\in[n]italic_a ∈ [ italic_n ]
3:for B𝐵Bitalic_B times do
4:     initialize empty batch \mathcal{B}caligraphic_B and virtual arm pulls Ma=0subscript𝑀𝑎0M_{a}=0italic_M start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 0 for all a[n]𝑎delimited-[]𝑛a\in[n]italic_a ∈ [ italic_n ]
5:     for b𝑏bitalic_b times do
6:         let 𝒜tsubscript𝒜𝑡\mathcal{A}_{t}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be {a[n]Na+Ma=LtB}conditional-set𝑎delimited-[]𝑛subscript𝑁𝑎subscript𝑀𝑎superscriptsubscript𝐿𝑡B\{a\in[n]\mid N_{a}+M_{a}=L_{t}^{\text{\color[rgb]{0,0,0.80}{{B}}}}\}{ italic_a ∈ [ italic_n ] ∣ italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_M start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT B end_POSTSUPERSCRIPT }
7:         push atargmaxa𝒜tμ¯asubscript𝑎𝑡subscriptargmax𝑎subscript𝒜𝑡subscript¯𝜇𝑎a_{t}\coloneqq\text{argmax}_{a\in\mathcal{A}_{t}}\bar{\mu}_{a}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≔ argmax start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT to \mathcal{B}caligraphic_B
8:         update tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1 and MatMat+1subscript𝑀subscript𝑎𝑡subscript𝑀subscript𝑎𝑡1M_{a_{t}}\leftarrow M_{a_{t}}+1italic_M start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← italic_M start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1      
9:     batch pull arms in \mathcal{B}caligraphic_B
10:     update μ¯asubscript¯𝜇𝑎\bar{\mu}_{a}over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and NaNa+Masubscript𝑁𝑎subscript𝑁𝑎subscript𝑀𝑎N_{a}\leftarrow N_{a}+M_{a}italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ← italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_M start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT for all a𝑎a\in\mathcal{B}italic_a ∈ caligraphic_B
11:return argmaxa[n](Na,μ¯a)subscriptargmax𝑎delimited-[]𝑛subscript𝑁𝑎subscript¯𝜇𝑎\text{argmax}_{a\in[n]}(N_{a},\bar{\mu}_{a})argmax start_POSTSUBSCRIPT italic_a ∈ [ italic_n ] end_POSTSUBSCRIPT ( italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT )

Appendix C Proof of Lemma 1

Lemma 1

For any integer b2𝑏2b\geq 2italic_b ≥ 2, the inequality

x21b14b/x𝑥21𝑏14𝑏𝑥\displaystyle\left\lceil\frac{x}{2}\right\rceil-1\geq\left\lceil\frac{b-1}{% \left\lfloor 4b/x\right\rfloor}\right\rceil⌈ divide start_ARG italic_x end_ARG start_ARG 2 end_ARG ⌉ - 1 ≥ ⌈ divide start_ARG italic_b - 1 end_ARG start_ARG ⌊ 4 italic_b / italic_x ⌋ end_ARG ⌉ (6)

holds for all integers x[3,4b]𝑥34𝑏x\in[3,4b]italic_x ∈ [ 3 , 4 italic_b ].

Proof. This proof demonstrates that for any integer b2𝑏2b\geq 2italic_b ≥ 2 and x[3,4b]𝑥34𝑏x\in[3,4b]italic_x ∈ [ 3 , 4 italic_b ], the inequality (6) is satisfied. Given zczc𝑧𝑐𝑧𝑐z\geq c\implies z\geq\lceil c\rceilitalic_z ≥ italic_c ⟹ italic_z ≥ ⌈ italic_c ⌉ for any integer z𝑧zitalic_z and real number c𝑐citalic_c, it suffices to demonstrate that

x21b14b/xx21b14b/x0.iff𝑥21𝑏14𝑏𝑥𝑥21𝑏14𝑏𝑥0\displaystyle\left\lceil\frac{x}{2}\right\rceil-1\geq\frac{b-1}{\left\lfloor 4% b/x\right\rfloor}\iff\left\lceil\frac{x}{2}\right\rceil-1-\frac{b-1}{\left% \lfloor 4b/x\right\rfloor}\geq 0.⌈ divide start_ARG italic_x end_ARG start_ARG 2 end_ARG ⌉ - 1 ≥ divide start_ARG italic_b - 1 end_ARG start_ARG ⌊ 4 italic_b / italic_x ⌋ end_ARG ⇔ ⌈ divide start_ARG italic_x end_ARG start_ARG 2 end_ARG ⌉ - 1 - divide start_ARG italic_b - 1 end_ARG start_ARG ⌊ 4 italic_b / italic_x ⌋ end_ARG ≥ 0 . (7)

Given that 4bx>04𝑏𝑥0\left\lfloor\frac{4b}{x}\right\rfloor>0⌊ divide start_ARG 4 italic_b end_ARG start_ARG italic_x end_ARG ⌋ > 0, it follows that

(x21)4bx(b1)0,𝑥214𝑏𝑥𝑏10\displaystyle\left(\left\lceil\frac{x}{2}\right\rceil-1\right)\left\lfloor% \frac{4b}{x}\right\rfloor-(b-1)\geq 0,( ⌈ divide start_ARG italic_x end_ARG start_ARG 2 end_ARG ⌉ - 1 ) ⌊ divide start_ARG 4 italic_b end_ARG start_ARG italic_x end_ARG ⌋ - ( italic_b - 1 ) ≥ 0 , (8)

for any integer b2𝑏2b\geq 2italic_b ≥ 2 and x[3,4b]𝑥34𝑏x\in[3,4b]italic_x ∈ [ 3 , 4 italic_b ]. Two cases are considered:

Case 1: x𝑥xitalic_x is even. Suppose x=2y𝑥2𝑦x=2yitalic_x = 2 italic_y, with y[2,2b]𝑦22𝑏y\in[2,2b]italic_y ∈ [ 2 , 2 italic_b ]. We aim to show that

(y1)2by(b1)0.𝑦12𝑏𝑦𝑏10\displaystyle\left(y-1\right)\left\lfloor\frac{2b}{y}\right\rfloor-(b-1)\geq 0.( italic_y - 1 ) ⌊ divide start_ARG 2 italic_b end_ARG start_ARG italic_y end_ARG ⌋ - ( italic_b - 1 ) ≥ 0 . (9)

Two sub-cases are considered:

  1. 1.

    For y[b+1,2b]𝑦𝑏12𝑏y\in[b+1,2b]italic_y ∈ [ italic_b + 1 , 2 italic_b ], as 2by=12𝑏𝑦1\left\lfloor\frac{2b}{y}\right\rfloor=1⌊ divide start_ARG 2 italic_b end_ARG start_ARG italic_y end_ARG ⌋ = 1, LHS=(y1)(b1)0LHS𝑦1𝑏10\text{LHS}=(y-1)-(b-1)\geq 0LHS = ( italic_y - 1 ) - ( italic_b - 1 ) ≥ 0.

  2. 2.

    For y[2,b]𝑦2𝑏y\in[2,b]italic_y ∈ [ 2 , italic_b ], as c>c1𝑐𝑐1\lfloor c\rfloor>c-1⌊ italic_c ⌋ > italic_c - 1 for any real number c𝑐citalic_c, we have LHS>(y1)(2by1)(b1)=(y2)(yb)yLHS𝑦12𝑏𝑦1𝑏1𝑦2𝑦𝑏𝑦\text{LHS}>\left(y-1\right)\left(\frac{2b}{y}-1\right)-(b-1)=-\frac{(y-2)(y-b)% }{y}LHS > ( italic_y - 1 ) ( divide start_ARG 2 italic_b end_ARG start_ARG italic_y end_ARG - 1 ) - ( italic_b - 1 ) = - divide start_ARG ( italic_y - 2 ) ( italic_y - italic_b ) end_ARG start_ARG italic_y end_ARG. As y>0𝑦0y>0italic_y > 0 and (y2)(yb)0𝑦2𝑦𝑏0-(y-2)(y-b)\geq 0- ( italic_y - 2 ) ( italic_y - italic_b ) ≥ 0 in y[2,b]𝑦2𝑏y\in[2,b]italic_y ∈ [ 2 , italic_b ], we have LHS0LHS0\text{LHS}\geq 0LHS ≥ 0.

Consequently, it has been established that for even values of x𝑥xitalic_x, the inequality (9) is upheld.

Case 2: x𝑥xitalic_x is odd. Suppose x=2y+1𝑥2𝑦1x=2y+1italic_x = 2 italic_y + 1, with y[1,2b1]𝑦12𝑏1y\in[1,2b-1]italic_y ∈ [ 1 , 2 italic_b - 1 ]. We aim to show that

y4b2y+1(b1)0.𝑦4𝑏2𝑦1𝑏10\displaystyle y\left\lfloor\frac{4b}{2y+1}\right\rfloor-(b-1)\geq 0.italic_y ⌊ divide start_ARG 4 italic_b end_ARG start_ARG 2 italic_y + 1 end_ARG ⌋ - ( italic_b - 1 ) ≥ 0 . (10)

Two sub-cases are considered:

  1. 1.

    For y[b,2b1]𝑦𝑏2𝑏1y\in[b,2b-1]italic_y ∈ [ italic_b , 2 italic_b - 1 ], as 4b2y+1=14𝑏2𝑦11\left\lfloor\frac{4b}{2y+1}\right\rfloor=1⌊ divide start_ARG 4 italic_b end_ARG start_ARG 2 italic_y + 1 end_ARG ⌋ = 1, LHS=y(b1)0LHS𝑦𝑏10\text{LHS}=y-(b-1)\geq 0LHS = italic_y - ( italic_b - 1 ) ≥ 0.

  2. 2.

    For y[1,b1]𝑦1𝑏1y\in[1,b-1]italic_y ∈ [ 1 , italic_b - 1 ], as c>c1𝑐𝑐1\lfloor c\rfloor>c-1⌊ italic_c ⌋ > italic_c - 1 for any real number c𝑐citalic_c, we have LHS>y(4b2y+11)(b1)=2byb2y2+y+12y+1=2y(y(b+12))(b1)2y+10LHS𝑦4𝑏2𝑦11𝑏12𝑏𝑦𝑏2superscript𝑦2𝑦12𝑦12𝑦𝑦𝑏12𝑏12𝑦10\text{LHS}>y\left(\frac{4b}{2y+1}-1\right)-(b-1)=\frac{2by-b-2y^{2}+y+1}{2y+1}% =\frac{-2y(y-(b+\frac{1}{2}))-(b-1)}{2y+1}\geq 0LHS > italic_y ( divide start_ARG 4 italic_b end_ARG start_ARG 2 italic_y + 1 end_ARG - 1 ) - ( italic_b - 1 ) = divide start_ARG 2 italic_b italic_y - italic_b - 2 italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_y + 1 end_ARG start_ARG 2 italic_y + 1 end_ARG = divide start_ARG - 2 italic_y ( italic_y - ( italic_b + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ) - ( italic_b - 1 ) end_ARG start_ARG 2 italic_y + 1 end_ARG ≥ 0. As 2y+1>02𝑦102y+1>02 italic_y + 1 > 0 and 2y(y(b+12))(b1)02𝑦𝑦𝑏12𝑏10-2y(y-(b+\frac{1}{2}))-(b-1)\geq 0- 2 italic_y ( italic_y - ( italic_b + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ) - ( italic_b - 1 ) ≥ 0 in y[1,b1]𝑦1𝑏1y\in[1,b-1]italic_y ∈ [ 1 , italic_b - 1 ], we have LHS0LHS0\text{LHS}\geq 0LHS ≥ 0.

Similarly, it has been demonstrated that for odd values of x𝑥xitalic_x, the inequality (10) is upheld.

Therefore, through the analysis of these two cases, it is proven that for any integer b2𝑏2b\geq 2italic_b ≥ 2 and x[3,4b]𝑥34𝑏x\in[3,4b]italic_x ∈ [ 3 , 4 italic_b ], the inequality (8) is satisfied, thereby confirming the validity of (6). \square

Appendix D Batch Sequential Halving introduced in Jun et al. (2016)

Algorithm 7 shows the detailed batched version of the Sequential Halving algorithm introduced in Jun et al. (2016).

Algorithm 7 Batched Sequential Halving introduced in Jun et al. (2016)
1:input number of arms: n𝑛nitalic_n, batch budget: B𝐵Bitalic_B, batch size: b𝑏bitalic_b
2:initialize best arm candidates 𝒮0[n]subscript𝒮0delimited-[]𝑛\mathcal{S}_{0}\coloneqq[n]caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≔ [ italic_n ]
3:for round r=0,,log2n1𝑟0subscript2𝑛1r=0,\ldots,\lceil\log_{2}n\rceil-1italic_r = 0 , … , ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ - 1 do
4:     for B/log2n𝐵subscript2𝑛\bigl{\lfloor}B/\lceil\log_{2}n\rceil\bigr{\rfloor}⌊ italic_B / ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ ⌋ times do
5:         select batch actions \mathcal{B}caligraphic_B so that the number of pulls of each arm in 𝒮rsubscript𝒮𝑟\mathcal{S}_{r}caligraphic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is as equal as possible
6:         pull arms \mathcal{B}caligraphic_B in the batch      
7:     𝒮r+1top-|𝒮r|/2subscript𝒮𝑟1top-subscript𝒮𝑟2\mathcal{S}_{r+1}\leftarrow\textrm{top-}\lceil|\mathcal{S}_{r}|/2\rceilcaligraphic_S start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT ← top- ⌈ | caligraphic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | / 2 ⌉ arms in 𝒮rsubscript𝒮𝑟\mathcal{S}_{r}caligraphic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT w.r.t. the empirical rewards
8:return the only arm in Slog2nsubscript𝑆subscript2𝑛S_{\lceil\log_{2}n\rceil}italic_S start_POSTSUBSCRIPT ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ⌉ end_POSTSUBSCRIPT