A Class of Computational Methods to Reduce Selection Bias when Designing Phase 3 Clinical Trials

Tianyu Zhan
Data and Statistical Sciences, AbbVie Inc., North Chicago, IL, USA Tianyu Zhan is an employee of AbbVie Inc. Corresponding author email address: [email protected].

Abstract

When designing confirmatory Phase 3 studies, one usually evaluates one or more efficacious and safe treatment option(s) based on data from previous studies. However, several retrospective research articles reported the phenomenon of “diminished treatment effect in Phase 3” based on many case studies. Even under basic assumptions, it was shown that the commonly used estimator could substantially overestimate the efficacy of selected group(s). As alternatives, we propose a class of computational methods to reduce estimation bias and mean squared error (MSE) with a broader scope of multiple treatment groups and flexibility to accommodate summary results by group as input. Based on simulation studies and a real data example, we provide practical implementation guidance for this class of methods under different scenarios. For more complicated problems, our framework can serve as a starting point with additional layers built in. Proposed methods can also be widely applied to other selection problems.

Keywords: Bias correction; Estimation; Higher-order Bootstrap; Jackknife.

1 Introduction

In clinical drug development, confirmatory Phase 3 studies are usually conducted to comprehensively evaluate the safety and efficacy of the study drug after exploratory Phase 2 studies (ICH Guideline E8, 2022). To properly design Phase 3 studies, one needs to accurately characterize the efficacy profile of one or more selected efficacious and safe treatment option(s) to inform many key decisions, for example, Go/No-Go and sample size calculation. However, quite a few retrospective research articles reported the phenomenon of “diminished treatment effect in Phase 3”: FDA studied 22 recent cases in which promising Phase 2 results were not confirmed in Phase 3 and found 21 of them were due to lack of efficacy (Food and Drug Administration, 2017); treatment effect sizes of progression-free survival (PFS) were on average $26\%$ larger in Phase 2 as compared with Phase 3 in 57 pairs of oncology studies (Liang et al., 2019); 35 out of 43 Phase 3 studies of chemotherapy in advanced solid malignancies had lower response rates than preceding Phase 2 studies (Zia et al., 2005).

This question concerning the efficacy gap between previous studies and Phase 3 studies is important, but is also challenging to resolve. There are several caveats that may contribute to this gap, for example, temporal drift due to the standard of care improvements or other factors (Saville et al., 2022), patient heterogeneity across studies (Liang et al., 2019), variability in results based on limited sample size. As a starting point, we consider a typical approach of directly choosing the treatment group(s) with the best outcome(s) based on previous studies, and use the corresponding results as assumptions for Phase 3 design. Under a basic scenario where true response means between studies are the same for a selected group, this estimator may substantially overestimate its response mean in Phase 3 with insufficient power, as further discussed in later sections, including a toy simulation in Table 1. As discussed in Section 6, this framework can be extended with additional layers to handle more complicated problems, e.g., temporal drift, and other base estimators, e.g., the minimum efficacious dose (MED) modeled from MCP-Mod (Bretz et al., 2005).

There were some previous theoretical works conducted to study this problem of estimating the larger of two means for some specific distributions. Blumenthal and Cohen (1968) and Dahiya (1974) investigated this under two Normal distributions and a common known variance, and showed that no unbiased estimator could exist (Blumenthal and Cohen, 1968). This phenomenon of the non-existence of unbiased estimators was further studied in more generalized distributions from two groups, for example, Normal distributions with common but unknown variance Hsieh (1981), a general class of distributions (Ishwaei D et al., 1985), double exponential distributions with unknown locations (Kumar and Sharma, 1993). On the other hand, unbiased estimators may exist under some special settings, e.g., two gamma distributions with a common and known shape parameter (Vellaisamy and Sharma, 1988). As a general approach, Rosenkranz (2014) proposed to correct bias using non-parametric Bootstrap (Efron and Tibshirani, 1994; Davison and Hinkley, 1997; Kosmidis, 2014) to accommodate general distribution assumptions from two groups. However, patient-level data are needed to implement this method. In this article, we consider a more general scope of ”previous studies”, in the sense that it can be in-house Phase 2 studies with patient-level data under multiple doses and/or multiple compounds, or external studies with only summary data available to characterize assumptions of the active comparator(s) in the new Phase 3 study. Additionally, it is also common to have more than two treatment options to be selected in the design of Phase 3 trials.

Additionally, there are several methods proposed to correct selection bias in clinical trials with two or more stages. Based on Whitehead (1986), Stallard and Todd (2005) developed an iterative approach to reduce the estimation bias conditional on the selection of a treatment group. This method requires analytic derivation of the conditional bias given a specific setting, e.g., equal-variance considered in Stallard and Todd (2005). The single and double Bootstrap methods introduced in Section 3.1 have a similar idea of correcting bias iteratively, but utilizes empirical Bootstrap distributions to estimate the bias. Our proposed approaches are also more general to cover settings with unequal-variance. Bauer et al. (2010) investigated the bias and MSE when estimating the efficacy of the best treatment in multi-stage trials with sample size adaptation and homogeneous variance. They emphasized that the quantification of the bias is possible only in designs with planned adaptivity (Bauer et al., 2010). Hwang (1993) and Lindley (1962) proposed a shrinkage estimator with superior performance of Bayes risk as compared with the typical maximum-likelihood estimator (MLE). This shrinkage estimator is briefly reviewed in Section 3.3 with comparison results in Section 4. Two recent papers nicely reviewed point estimation for adaptive trial designs, including bias reduction in multi-arm multi-stage designs with treatment selection (Robertson et al., 2023a, b).

The motivation of our proposed methods is to empirically estimate the bias for correction with computational approaches, such as Bootstrap (Efron and Tibshirani, 1994) or Jackknife (Quenouille, 1949). This framework can naturally accommodate general settings, such as multiple (more than two) groups based on either subject-level data or group-level summary data, homogeneous or heterogeneous variance between treatment groups. Our scope is broader to cover typical Phase 2 studies with patient-level data, and external studies with only summary data available based on literature. As compared with single Bootstrap methods, double Bootstrap methods can further reduce bias with slightly larger mean squared error (MSE) and an additional cost of computation, with results in Section 4. We further propose hybrid estimators based on double Bootstrap estimators and shrinkage estimators (Hwang, 1993; Lindley, 1962) to balance the reductions in both bias and MSE.

The remainder of this article is organized as follows. In Section 2, we introduce the setup of this problem and notations. In Section 3, a class of computational methods is proposed, and an existing shrinkage estimator (Hwang, 1993; Lindley, 1962) is reviewed. Simulation studies are performed in Section 4 to demonstrate the potential gains of those proposed methods in terms of bias and MSE under different settings. We apply our methods to a Phase 2/3 seamless trial in Section 5. Discussions are provided in Section 6.

2 Setup

Consider a previous study with $I$ active treatment groups and $n_{i}$ patients randomized to the $i$ th treatment group, for $i=1,\cdots,I$ . We consider the response $X_{i,j}$ of the treatment group $i$ , for $i=1,\cdots,I$ , and the subject $j$ , for $j=1,\cdots,n_{i}$ , follows a Normal distribution,

X_{i,j}\sim\mathcal{N}\left(\theta_{i},\sigma_{i}^{2}\right),

(1)

where $\theta_{i}$ is the mean and $\sigma_{i}$ is the standard deviation of the treatment group $i$ . We assume that a larger value of $X_{ij}$ corresponds to a better outcome, and $\sigma_{i}$ is unknown. Denote $\bm{X}_{i}=\left(X_{i,1},\cdots,X_{i,n_{i}}\right)$ as the vector of responses from group $i$ , and $\bm{X}=(\bm{X}_{1},\cdots,\bm{X}_{I})$ .

After obtaining data from multiple treatment groups, the study team will usually select one or two treatment group(s) to confirm findings in Phase 3 studies. We consider a motivating scenario where all treatment groups have similar safety profiles, and the most efficacious group will be moved to Phase 3. A key question is how to accurately characterize the efficacy of this selected group for sample size calculation.

The corresponding statistical question is to use observed data $\left(\bm{X}_{1},\cdots,\bm{X}_{I}\right)$ to estimate the parameter of interest $\theta_{max}$ , defined as,

\theta_{max}=\max(\theta_{1},\cdots,\theta_{I}).

(2)

A traditional estimator $\widehat{\theta}$ is commonly used in practice to estimate $\theta_{max}$ :

\widehat{\theta}(\bm{X})=\max\left[\widetilde{\theta}(\bm{X}_{1}),\cdots,% \widetilde{\theta}(\bm{X}_{I})\right],

(3)

where $\widetilde{\theta}(x)$ is the sample mean of $x$ , and $\widetilde{\theta}(\bm{X}_{i})$ as an unbiased estimator of $\theta_{i}$ . However, $\widehat{\theta}(\bm{X})$ may overestimate $\theta_{max}$ in finite-samples. Even though $\widetilde{\theta}(\bm{X}_{i})$ can accurately estimate $\theta_{i}$ with no bias for each treatment group $i$ , one does not know which treatment group has the highest true response mean $\theta_{i}$ in (2) based on observed data.

To provide a numerical illustration of bias, we conduct a toy simulation with $I=2$ treatment groups, $\theta_{1}=0.9$ , $\theta_{2}=1$ , and $\sigma_{1}=\sigma_{2}=5$ under three magnitudes of sample size $n$ . Table 1 shows that the traditional estimator $\widehat{\theta}$ can overestimate $\theta_{max}$ by $40\%$ under a moderate sample size $n=40$ , but the bias shrinks as $n$ increases. When $n=40,000$ , the probability of selecting the correct treatment group $i=2$ is nearly $100\%$ , and therefore, $\widehat{\theta}(\bm{X})$ is close to $\widetilde{\theta}(\bm{X}_{2})$ as an unbiased estimator of $\theta_{max}=\theta_{2}$ .

$\theta_{max}$	$n$	$E\big{(}\widehat{\theta}\big{)}$	$E(\widehat{\theta})-\theta_{max}$	Prob of correctly selecting $i=2$
1	40	1.40	0.40	0.53
	4000	1.01	0.01	0.82
	40000	1.00	0.00	1.00

Table 1: A toy simulation to evaluate the bias of

\widehat{\theta}

when estimating

\theta_{max}

3 Proposed Methods

In this section, we introduce a class of computational methods based on Bootstrap or Jackknife principles to reduce estimation bias.

3.1 Single and Double Bootstrap

Suppose that we have $\widehat{\theta}(\bm{X})$ in (3) as an initial estimator of $\theta_{max}$ . Its bias at $\theta_{0}$ is denoted as $A(\theta_{0})$ ,

A(\theta_{0})=E\left[\widehat{\theta}(\bm{X})\right]-\theta_{0}.

(4)

Since the true value $\theta_{0}$ of $\theta_{max}$ is to be estimated and the functional form of $A(\cdot)$ is usually unknown, one can use $\widehat{A}\left[\widehat{\theta}(\bm{X})\right]$ to approximate $A(\theta_{0})$ ,

\widehat{A}\left[\widehat{\theta}(\bm{X})\right]=\widehat{E}\left[\widehat{% \theta}(\bm{X}_{B})\right]-\widehat{\theta}(\bm{X}),

(5)

where $\widehat{E}$ is the empirical expectation based on Monte Carlo Bootstrap data $\bm{X}_{B}$ with size $B$ . The single Bootstrap estimator $\widehat{\theta}^{(1)}(\bm{X})$ (Efron and Tibshirani, 1994; Davison and Hinkley, 1997; Kosmidis, 2014) can then be constructed as,

\widehat{\theta}^{(1)}(\bm{X})=\widehat{\theta}(\bm{X})-\widehat{A}\left[% \widehat{\theta}(\bm{X})\right].

(6)

Figure 1 left-hand side provides a graphical illustration of the construction above. Algorithm 1 streamlines the workflow to compute $\widehat{\theta}^{(1)}(\bm{X})$ based on $B$ Bootstrap samples.

To further reduce bias, we can iteratively apply the above approach with $\widehat{\theta}^{(1)}(\bm{X})$ as the initial estimator to obtain the double Bootstrap estimator $\widehat{\theta}^{(2)}(\bm{X})$ as in Figure 1 right-hand side. Algorithm 2 demonstrates that the computation of $\widehat{\theta}^{(2)}(\bm{X})$ requires $B^{2}$ Bootstrap samples. This strategy is analog to the calibration of Bootstrap to obtain second-order accurate confidence intervals (Efron and Tibshirani, 1994). Based on our simulation studies in Section 4, $\widehat{\theta}^{(2)}(\bm{X})$ has a satisfactory finite-sample performance in terms of bias and MSE. The triple (or even a higher-order) Bootstrap estimator can also be implemented to seek potential improvements, but with a cost of a much heavier computational burden. Section 4.2 provides more discussion on higher-order Bootstrap estimators.

Next we provide more details on simulating Bootstrap samples from data. Taking the single Bootstrap as an example, our strategy is to resample $\bm{X}_{b}^{\ast}$ from observed data $\bm{X}$ blocked by groups. To be more specific, for each treatment group $i$ , we generate Bootstrap samples $\bm{Y}_{b,i}$ of size $n_{i}$ from $\bm{X}_{i}$ , and then obtain $\bm{X}_{b}^{\ast}=\left(\bm{Y}_{b,1},\cdots,\bm{Y}_{b,I}\right)$ . For sampling methods, one can adopt the Nonparametric Bootstrap (NB) to sample $n_{i}$ observations from $\bm{X}_{i}$ with replacement to get $\bm{Y}_{b,i}$ , as considered in Rosenkranz (2014). An alternative approach is the Parametric Bootstrap (PB) with distribution assumptions, for example, a Normal distribution with sample mean $\widetilde{\theta}(\bm{X}_{i})$ as the mean parameter and empirical standard deviation $\widetilde{\sigma}(\bm{X}_{i})$ as the standard deviation parameter. PB is flexible to cover scenarios where only summary statistics (e.g., $n$ , $\widetilde{\theta}$ , $\widetilde{\sigma})$ for each group are reported in the literature or other external sources. In Section 4, we have scenarios with mixture distributions to evaluate the robustness of PB.

Now we have four Bootstrap estimators: $\widehat{\theta}^{(1)}_{PB}$ , $\widehat{\theta}^{(2)}_{PB}$ , $\widehat{\theta}^{(1)}_{NB}$ , $\widehat{\theta}^{(2)}_{NB}$ , where the superscript “ $(1)$ ” corresponds to the single Bootstrap, “ $(2)$ ” to the double Bootstrap, the subscript “ $PB$ ” to Parametric Bootstrap, and “ $NB$ ” to Nonparametric Bootstrap.

Refer to caption — Figure 1: Graphical illustration of the single Bootstrap estimator $\widehat{\theta}^{(1)}$ (left) and the double Bootstrap estimator $\widehat{\theta}^{(2)}$ (right).

Algorithm 1 Single Bootstrap

Input:

\bm{X}

Procedure:

For Bootstrap index

b

from

1

B

, Do:

Simulate Bootstrap sample

\bm{X}_{b}^{\ast}

based on

\bm{X}

End

Compute

\widehat{A}\left[\widehat{\theta}(\bm{X})\right]=\sum_{b=1}^{B}\widehat{\theta% }(\bm{X}_{b}^{\ast})/B-\widehat{\theta}(\bm{X})

Output:

\widehat{\theta}^{(1)}(\bm{X})=\widehat{\theta}(\bm{X})-\widehat{A}\left[% \widehat{\theta}(\bm{X})\right]

Algorithm 2 Double Bootstrap

Input:

\bm{X}

Procedure:

For Bootstrap index

b

from

1

B

, Do:

Simulate Bootstrap sample

\bm{X}_{b}^{\ast}

based on

\bm{X}

For Bootstrap index

c

from

1

B

, Do:

Simulate Bootstrap sample

\bm{X}_{b,c}^{\ast\ast}

based on

\bm{X}_{b}^{\ast}

End

Compute

\widehat{A}\left[\bm{X}_{b}^{\ast}\right]=\sum_{c=1}^{B}\widehat{\theta}(\bm{X% }_{b,c}^{\ast\ast})/B-\widehat{\theta}(\bm{X}_{b}^{\ast})

Obtain

\widehat{\theta}^{(1)}(\bm{X}_{b}^{\ast})=\widehat{\theta}(\bm{X}_{b}^{\ast})-% \widehat{A}\left[\bm{X}_{b}^{\ast}\right]

End

Compute

\widehat{A}\left[\widehat{\theta}^{(1)}(\bm{X})\right]=\sum_{b=1}^{B}\widehat{% \theta}^{(1)}(\bm{X}_{b}^{\ast})/B-\widehat{\theta}^{(1)}(\bm{X})

Output:

\widehat{\theta}^{(2)}(\bm{X})=\widehat{\theta}^{(1)}(\bm{X})-\widehat{A}\left% [\widehat{\theta}^{(1)}(\bm{X})\right]

3.2 Jackknife

The Jackknife is a well-established technique to correct bias (Quenouille, 1949; Miller, 1974). Suppose that the bias of $\widehat{\theta}(\bm{X})$ under sample size $n$ can be expressed as,

E\left[\widehat{\theta}(\bm{X});n\right]-\theta_{0}=\frac{a_{1}}{n}+\frac{a_{2% }}{n^{2}}+\mathcal{O}(n^{-3}),

(7)

where $a_{1}$ and $a_{2}$ are unknown and do not depend on $n$ . The bias with a sample size of $n-1$ is,

E\left[\widehat{\theta}(\bm{X});n-1\right]-\theta_{0}=\frac{a_{1}}{n-1}+\frac{% a_{2}}{(n-1)^{2}}+\mathcal{O}(n^{-3}).

(8)

In order to correct the order $1/n$ term, one can construct the following Jackknife estimator $\widehat{\theta}_{JK}(\bm{X})$ ,

\widehat{\theta}_{JK}(\bm{X})=n\widehat{\theta}(\bm{X})-(n-1)\widehat{\theta}_% {(\bullet)}(\bm{X}),

(9)

where $\widehat{\theta}_{(\bullet)}(\bm{X})={\sum_{j=1}^{n}\widehat{\theta}\left(\bm{% X}_{-j}\right)}/{n}$ , and $\bm{X}_{-j}$ is $\bm{X}$ with $j$ th observation deleted.

It can be shown that the bias of $\widehat{\theta}_{JK}(\bm{X})$ is now with an order of $1/n^{2}$ (Miller, 1974). A graphical illustration of this bias reduction is provided in Figure 2. As compared with Bootstrap methods, $\widehat{\theta}_{JK}(\bm{X})$ is computationally friendly, and can also give exact results without a need to specify random seeds.

3.3 Shrinkage Estimators

Hwang (1993) and Lindley (1962) considered the following shrinkage estimator $\widehat{\theta}_{S}(\bm{X})$ for $\theta_{max}$ to reduce MSE.

$\displaystyle\widehat{\theta}_{S}(\bm{X})$	$\displaystyle=C_{+}\widehat{\theta}(\bm{X})+\left(1-C_{+}\right)\widetilde{% \theta}(\bm{X})$	(10)
$\displaystyle C_{+}$	$\displaystyle=\max(0,C)$
$\displaystyle C$	$\displaystyle=1-\frac{(I-1)\sigma^{2}}{\sum_{i=1}^{I}n_{i}\left[\widetilde{% \theta}(\bm{X}_{i})-\widetilde{\theta}(\bm{X})\right]^{2}}$	(11)

Their original estimator is based on a setting of common and known variance $\sigma^{2}$ for each treatment group $i$ in (1). For evaluation in this article, we replace $\sigma^{2}$ in (11) by the average of empirical variance estimators from all $I$ groups. Moreover, the constant $(I-1)$ is modified from $(I-3)$ for $I\geq 4$ in Hwang (1993) and Lindley (1962) to accommodate a general setting with $I\geq 2$ as suggested by Carreras and Brannath (2013). Intuitively, when $\theta_{1},\cdots,\theta_{I}$ are far from each other, $C$ in (11) will be close to $1$ , because $\sum_{i=1}^{I}n_{i}\left[\widetilde{\theta}(\bm{X}_{i})-\widetilde{\theta}(\bm% {X})\right]^{2}$ is relatively large. The estimator $\widehat{\theta}_{S}(\bm{X})$ will be close to $\widehat{\theta}(\bm{X})$ with small bias under this setting (Carreras and Brannath, 2013). Otherwise, when $\theta_{1},\cdots,\theta_{I}$ are close to each other, $\widehat{\theta}_{S}(\bm{X})$ will be close to $\widetilde{\theta}(\bm{X})$ as the overall mean. Superior performance of $\widehat{\theta}_{S}(\bm{X})$ in terms of Bayes risk was studied in Hwang (1993) and Carreras and Brannath (2013).

3.4 Hybrid Estimators based on Double Bootstrap and Shrinkage

The double bootstrap estimators $\widehat{\theta}^{(2)}_{PB}$ and $\widehat{\theta}^{(2)}_{NB}$ may further reduce estimation bias as compared with their single Bootstrap versions, but with a potential cost of increased MSE. This phenomenon was observed in some previous works (Hsu et al., 1986; MacKinnon and Smith Jr, 1998; Ouysse, 2011), and our simulation results in Section 4.

In this article, we also consider a natural generalization to ensemble double Bootstrap estimators and shrinkage estimators in Section 3.3, with a goal to balance reductions in bias and MSE. The proposed hybrid estimators $\widehat{\theta}^{(2)}_{PB,S}$ and $\widehat{\theta}^{(2)}_{NB,S}$ substitute $\widehat{\theta}$ in (10) by $\widehat{\theta}^{(2)}_{PB}$ and $\widehat{\theta}^{(2)}_{NB}$ , respectively.

4 Simulation

4.1 Main Study

In this section, we conduct simulations with $I=3$ treatment groups and $n=40$ per group to evaluate the performance of several existing estimators $\widehat{\theta}$ , $\widehat{\theta}_{S}$ (Hwang, 1993; Lindley, 1962; Carreras and Brannath, 2013) and $\widehat{\theta}^{(1)}_{NB}$ (Rosenkranz, 2014) and our proposed estimators $\widehat{\theta}^{(2)}_{NB}$ , $\widehat{\theta}^{(2)}_{NB,S}$ , $\widehat{\theta}^{(1)}_{PB}$ , $\widehat{\theta}^{(2)}_{PB}$ , $\widehat{\theta}^{(2)}_{PB,S}$ and $\widehat{\theta}_{JK}$ . An additional setting of $I=4$ is considered in Table 4. Denote $\bm{\theta}=(\theta_{1},\theta_{2},\theta_{3})$ as the response mean vector, and $\bm{\sigma}=(\sigma_{1},\sigma_{2},\sigma_{3})$ as the standard deviation vector. The number of Bootstrap samples is set at $B=80$ , and the number of simulation iterations is $10,000$ . Discussion on how to choose the value of $B$ is provided in Section 4.2.

We consider the following four simulation scenarios.

•

S1: Varying mean vector $\bm{\theta}$ at $(1,1,1)$ , $(1,1,1.2)$ , $(1,1.1,1.2)$ and $(1,1.2,1.2)$ with $\bm{\sigma}=(5,5,5)$ and Normal distribution of $\bm{X}$
•

S2: Varying mean vector $\bm{\theta}$ as in S1 but with $\bm{\sigma}=(3,4,5)$ and Normal distribution of $\bm{X}$
•

S3: Varying $w$ at $0.1$ , $0.2$ , $0.3$ and $0.5$ with $\bm{\sigma}=(5,5,5)$ , $\bm{\theta}=(1,1.1,1.2)$ and a mixture of Gamma and Normal distributions of $\bm{X}$
•

S4: Varying $w$ at $0.1$ , $0.2$ , $0.3$ and $0.5$ with $\bm{\sigma}=(5,5,5)$ , $\bm{\theta}=(1,1.1,1.2)$ and a mixture of Uniform and Normal distributions of $\bm{X}$

S1 considers different values of $\bm{\theta}$ under homogeneous $\bm{\sigma}=(5,5,5)$ , while S2 is for heterogeneous $\bm{\sigma}=(3,4,5)$ . S3 evaluates estimators based on a mixture distribution with $w$ ( $w\in[0,1]$ ) proportion of Gamma distribution and $1-w$ proportion of Normal distribution for each treatment group. S4 studies Uniform distribution as the outlier distribution. The shape and scale parameters of Gamma distributions or the minimum and maximum parameters of Uniform distributions are specified to match the mean and standard deviation for each treatment group. Those Parametric Bootstrap estimators (i.e., $\widehat{\theta}^{(1)}_{PB}$ , $\widehat{\theta}^{(2)}_{PB}$ , $\widehat{\theta}^{(2)}_{PB,S}$ ) still use Normal distribution as the re-sampling assumption. S3 and S4 essentially evaluate the robustness of different estimators based on data sampling distributions deviating from the Normal assumption.

Table 2 evaluates the unconditional or marginal bias and MSE of those estimators when estimating $\bm{\theta}_{max}$ . Among existing estimators, $\widehat{\theta}_{S}$ and $\widehat{\theta}^{(1)}_{NB}$ can generally reduce bias and MSE as compared with the traditional estimator $\widehat{\theta}$ . For our proposed estimators, both $\widehat{\theta}^{(2)}_{PB}$ and $\widehat{\theta}^{(2)}_{NB}$ can substantially reduce estimation bias but with increased MSE. As a better balance between bias reduction and MSE reduction, our hybrid parameters $\widehat{\theta}^{(2)}_{PB,S}$ and $\widehat{\theta}^{(2)}_{NB,S}$ have the smallest MSE, and also smaller bias than three existing estimators. The Normal assumption for PB is usually reasonable to assume for problems with response mean as the parameter of interest and a moderate sample size (Efron and Tibshirani, 1994), with supporting results in S3 and S4. The Jackknife estimator $\widehat{\theta}_{JK}$ has similar bias with $\widehat{\theta}_{S}$ but with increased MSE.

In Table 3, we also evaluate the conditional bias and MSE given the third treatment group $i=3$ is being selected. The true response mean of this group is larger than or equal to the other two groups under four simulation scenarios specified above. The marginal bias and MSE are evaluated under some additional settings with $I=4$ treatment groups in Table 4. Results and conclusions of these two additional analyses are consistent with Table 2.

The overall recommendation is that the double Bootstrap estimators $\widehat{\theta}^{(2)}_{PB}$ and $\widehat{\theta}^{(2)}_{NB}$ can be applied to achieve the smallest bias, but with slightly larger MSE as compared with $\widehat{\theta}$ . The hybrid parameters $\widehat{\theta}^{(2)}_{PB,S}$ and $\widehat{\theta}^{(2)}_{NB,S}$ achieve a better balance between bias reduction and MSE reduction.

		Existing Estimators			Proposed Estimators
Scenario	$\bm{\theta}$	$\widehat{\theta}$	$\widehat{\theta}_{S}$	$\widehat{\theta}^{(1)}_{NB}$	$\widehat{\theta}^{(2)}_{NB}$	$\widehat{\theta}^{(2)}_{NB,S}$	$\widehat{\theta}^{(1)}_{PB}$	$\widehat{\theta}^{(2)}_{PB}$	$\widehat{\theta}^{(2)}_{PB,S}$	$\widehat{\theta}_{JK}$
S1	(1, 1, 1)	0.67 (0.80)	0.18 (0.35)	0.41 (0.65)	0.07 (0.83)	0.14 ( $\underline{0.33}$ )	0.40 (0.65)	$\bm{0.06}$ (0.83)	0.14 ( $\underline{0.33}$ )	0.35 (1.15)
	(1, 1, 1.2)	0.54 (0.64)	0.05 (0.32)	0.27 (0.55)	-0.06 (0.83)	$\bm{0.01}$ ( $\underline{0.31}$ )	0.27 (0.55)	-0.07 (0.84)	$\bm{0.01}$ ( $\underline{0.31}$ )	0.22 (1.07)
	(1, 1.1, 1.2)	0.58 (0.69)	0.09 (0.32)	0.32 (0.58)	- $\bm{0.02}$ (0.83)	0.05 ( $\underline{0.31}$ )	0.31 (0.58)	-0.03 (0.84)	0.05 ( $\underline{0.31}$ )	0.27 (1.09)
	(1, 1.2, 1.2)	0.60 (0.72)	0.11 (0.33)	0.33 (0.60)	$\bm{0.00}$ (0.84)	0.07 (0.32)	0.33 (0.60)	-0.01 (0.85)	0.07 ( $\underline{0.31}$ )	0.27 (1.14)
S2	(1, 1, 1)	0.55 (0.55)	0.26 (0.32)	0.33 (0.44)	$\bm{0.05}$ (0.57)	0.18 ( $\underline{0.29}$ )	0.32 (0.44)	$\bm{0.05}$ (0.58)	0.18 ( $\underline{0.29}$ )	0.28 (0.80)
	(1, 1, 1.2)	0.42 (0.45)	0.13 (0.30)	0.21 (0.41)	-0.06 (0.62)	$\bm{0.05}$ ( $\underline{0.28}$ )	0.20 (0.41)	-0.07 (0.62)	$\bm{0.05}$ (0.29)	0.17 (0.76)
	(1, 1.1, 1.2)	0.46 (0.49)	0.17 (0.31)	0.25 (0.43)	- $\bm{0.02}$ (0.62)	0.09 ( $\underline{0.29}$ )	0.24 (0.43)	-0.03 (0.62)	0.09 ( $\underline{0.29}$ )	0.20 (0.78)
	(1, 1.2, 1.2)	0.49 (0.51)	0.20 (0.32)	0.27 (0.44)	$\bm{0.00}$ (0.62)	0.12 ( $\underline{0.30}$ )	0.27 (0.44)	-0.01 (0.63)	0.11 ( $\underline{0.30}$ )	0.21 (0.83)
Scenario	$w$	$\widehat{\theta}$	$\widehat{\theta}_{S}$	$\widehat{\theta}^{(1)}_{NB}$	$\widehat{\theta}^{(2)}_{NB}$	$\widehat{\theta}^{(2)}_{NB,S}$	$\widehat{\theta}^{(1)}_{PB}$	$\widehat{\theta}^{(2)}_{PB}$	$\widehat{\theta}^{(2)}_{PB,S}$	$\widehat{\theta}_{JK}$
S3	0.1	0.52 (0.56)	0.07 (0.27)	0.27 (0.48)	- $\bm{0.03}$ (0.68)	$\bm{0.03}$ ( $\underline{0.26}$ )	0.27 (0.48)	-0.04 (0.69)	$\bm{0.03}$ ( $\underline{0.26}$ )	0.22 (0.91)
	0.2	0.46 (0.45)	0.05 (0.22)	0.24 (0.39)	-0.04 (0.57)	$\bm{0.02}$ ( $\underline{0.21}$ )	0.23 (0.39)	-0.05 (0.57)	$\bm{0.02}$ ( $\underline{0.21}$ )	0.19 (0.73)
	0.3	0.43 (0.40)	0.06 (0.20)	0.22 (0.35)	-0.03 (0.49)	$\bm{0.02}$ ( $\underline{0.19}$ )	0.22 (0.34)	-0.04 (0.49)	$\bm{0.02}$ ( $\underline{0.19}$ )	0.19 (0.62)
	0.5	0.39 (0.39)	0.07 (0.21)	0.21 (0.33)	- $\bm{0.02}$ (0.45)	$\bm{0.02}$ ( $\underline{0.20}$ )	0.20 (0.32)	-0.03 (0.44)	$\bm{0.02}$ ( $\underline{0.20}$ )	0.16 (0.51)
S4	0.1	0.51 (0.54)	0.06 (0.26)	0.26 (0.46)	-0.04 (0.68)	$\bm{0.03}$ ( $\underline{0.25}$ )	0.26 (0.46)	-0.05 (0.68)	$\bm{0.03}$ ( $\underline{0.25}$ )	0.21 (0.90)
	0.2	0.45 (0.45)	0.05 (0.22)	0.23 (0.38)	-0.04 (0.56)	$\bm{0.01}$ ( $\underline{0.21}$ )	0.23 (0.38)	-0.05 (0.57)	$\bm{0.01}$ ( $\underline{0.21}$ )	0.19 (0.74)
	0.3	0.41 (0.37)	0.04 (0.18)	0.21 (0.32)	-0.05 (0.48)	$\bm{0.01}$ ( $\underline{0.17}$ )	0.20 (0.32)	-0.06 (0.49)	$\bm{0.01}$ ( $\underline{0.17}$ )	0.16 (0.64)
	0.5	0.38 (0.32)	0.03 (0.16)	0.19 (0.27)	-0.05 (0.41)	$\bm{0.00}$ ( $\underline{0.15}$ )	0.18 (0.27)	-0.05 (0.41)	$\bm{0.00}$ ( $\underline{0.15}$ )	0.15 (0.54)

Table 2: Marginal bias and MSE in parenthesis of three existing estimators and six proposed estimators. Within each row, the bias with the smallest absolute value is in bold, and the smallest MSE is underlined.

		Existing Estimators			Proposed Estimators
Scenario	$\bm{\theta}$	$\widehat{\theta}$	$\widehat{\theta}_{S}$	$\widehat{\theta}^{(1)}_{NB}$	$\widehat{\theta}^{(2)}_{NB}$	$\widehat{\theta}^{(2)}_{NB,S}$	$\widehat{\theta}^{(1)}_{PB}$	$\widehat{\theta}^{(2)}_{PB}$	$\widehat{\theta}^{(2)}_{PB,S}$	$\widehat{\theta}_{JK}$
S1	(1, 1, 1)	0.68 (0.81)	0.19 (0.35)	0.41 (0.65)	$\bm{0.07}$ (0.82)	0.15 ( $\underline{0.33}$ )	0.41 (0.65)	$\bm{0.07}$ (0.84)	0.15 ( $\underline{0.33}$ )	0.36 (1.13)
	(1, 1, 1.2)	0.58 (0.69)	0.08 (0.34)	0.33 (0.60)	0.02 (0.84)	0.03 ( $\underline{0.33}$ )	0.32 (0.60)	$\bm{0.01}$ (0.84)	0.03 (0.34)	0.29 (1.07)
	(1, 1.1, 1.2)	0.62 (0.74)	0.12 (0.34)	0.36 (0.62)	0.05 (0.84)	0.07 ( $\underline{0.33}$ )	0.36 (0.62)	$\bm{0.04}$ (0.85)	0.07 ( $\underline{0.33}$ )	0.32 (1.10)
	(1, 1.2, 1.2)	0.63 (0.76)	0.14 (0.34)	0.37 (0.63)	0.05 (0.85)	0.09 ( $\underline{0.32}$ )	0.37 (0.63)	$\bm{0.04}$ (0.87)	0.09 ( $\underline{0.32}$ )	0.31 (1.15)
S2	(1, 1, 1)	0.71 (0.81)	0.42 (0.49)	0.50 (0.67)	0.23 (0.77)	0.34 ( $\underline{0.44}$ )	0.49 (0.67)	$\bm{0.22}$ (0.78)	0.34 ( $\underline{0.44}$ )	0.44 (1.08)
	(1, 1, 1.2)	0.59 (0.66)	0.30 (0.42)	0.40 (0.59)	0.16 (0.74)	0.22 ( $\underline{0.39}$ )	0.39 (0.59)	$\bm{0.15}$ (0.74)	0.22 (0.40)	0.37 (0.93)
	(1, 1.1, 1.2)	0.64 (0.71)	0.34 (0.44)	0.43 (0.61)	$\bm{0.19}$ (0.75)	0.26 ( $\underline{0.40}$ )	0.43 (0.61)	$\bm{0.19}$ (0.74)	0.26 (0.41)	0.42 (0.92)
	(1, 1.2, 1.2)	0.65 (0.72)	0.35 (0.45)	0.44 (0.61)	0.18 (0.75)	0.27 ( $\underline{0.40}$ )	0.43 (0.62)	$\bm{0.17}$ (0.75)	0.27 ( $\underline{0.40}$ )	0.39 (1.00)
Scenario	$w$	$\widehat{\theta}$	$\widehat{\theta}_{S}$	$\widehat{\theta}^{(1)}_{NB}$	$\widehat{\theta}^{(2)}_{NB}$	$\widehat{\theta}^{(2)}_{NB,S}$	$\widehat{\theta}^{(1)}_{PB}$	$\widehat{\theta}^{(2)}_{PB}$	$\widehat{\theta}^{(2)}_{PB,S}$	$\widehat{\theta}_{JK}$
S3	0.1	0.55 (0.61)	0.10 (0.29)	0.32 (0.52)	0.03 (0.70)	0.06 ( $\underline{0.28}$ )	0.31 (0.52)	$\bm{0.02}$ (0.72)	0.06 ( $\underline{0.28}$ )	0.28 (0.92)
	0.2	0.48 (0.48)	0.07 (0.23)	0.27 (0.41)	0.02 (0.57)	0.03 ( $\underline{0.22}$ )	0.27 (0.41)	$\bm{0.01}$ (0.56)	0.04 (0.23)	0.25 (0.72)
	0.3	0.45 (0.43)	0.08 (0.22)	0.26 (0.37)	0.02 (0.51)	0.04 ( $\underline{0.21}$ )	0.25 (0.37)	$\bm{0.01}$ (0.50)	0.04 ( $\underline{0.21}$ )	0.23 (0.62)
	0.5	0.41 (0.40)	0.11 (0.24)	0.24 (0.34)	0.04 (0.45)	0.06 ( $\underline{0.22}$ )	0.24 (0.34)	$\bm{0.03}$ (0.44)	0.06 ( $\underline{0.22}$ )	0.21 (0.51)
S4	0.1	0.55 (0.59)	0.10 (0.27)	0.31 (0.50)	0.03 (0.69)	0.06 ( $\underline{0.26}$ )	0.31 (0.50)	$\bm{0.02}$ (0.69)	0.06 ( $\underline{0.26}$ )	0.28 (0.92)
	0.2	0.49 (0.49)	0.08 ( $\underline{0.23}$ )	0.28 (0.41)	0.03 (0.57)	0.04 ( $\underline{0.23}$ )	0.28 (0.42)	$\bm{0.02}$ (0.57)	0.04 ( $\underline{0.23}$ )	0.25 (0.74)
	0.3	0.44 (0.40)	0.06 (0.19)	0.25 (0.35)	0.01 (0.48)	0.02 ( $\underline{0.18}$ )	0.25 (0.34)	$\bm{0.00}$ (0.49)	0.02 ( $\underline{0.18}$ )	0.22 (0.63)
	0.5	0.40 (0.34)	0.04 (0.17)	0.22 (0.30)	$\bm{0.00}$ (0.42)	0.01 ( $\underline{0.16}$ )	0.21 (0.30)	-0.01 (0.43)	0.01 ( $\underline{0.16}$ )	0.18 (0.56)

Table 3: Conditional bias and MSE in parenthesis of three existing estimators and six proposed estimators. Within each row, the bias with the smallest absolute value is in bold, and the smallest MSE is underlined.

	Existing Estimators			Proposed Estimators
$\bm{\theta}$	$\widehat{\theta}$	$\widehat{\theta}_{S}$	$\widehat{\theta}^{(1)}_{NB}$	$\widehat{\theta}^{(2)}_{NB}$	$\widehat{\theta}^{(2)}_{NB,S}$	$\widehat{\theta}^{(1)}_{PB}$	$\widehat{\theta}^{(2)}_{PB}$	$\widehat{\theta}^{(2)}_{PB,S}$	$\widehat{\theta}_{JK}$
(1, 1, 1, 1)	0.67 (0.80)	0.19 (0.30)	0.48 (0.69)	0.07 (0.86)	0.13 (0.27)	0.48 (0.69)	$\bm{0.06}$ (0.87)	0.13 ( $\underline{0.26}$ )	0.41 (1.30)
(1, 1, 1, 1.2)	0.47 (0.57)	0.05 (0.27)	0.34 (0.58)	-0.07 (0.87)	- $\bm{0.02}$ ( $\underline{0.25}$ )	0.34 (0.58)	-0.08 (0.88)	- $\bm{0.02}$ (0.26)	0.26 (1.22)
(1, 1.05, 1.1, 1.2)	0.53 (0.63)	0.09 (0.28)	0.39 (0.61)	- $\bm{0.02}$ (0.85)	$\bm{0.02}$ ( $\underline{0.26}$ )	0.38 (0.61)	-0.03 (0.86)	$\bm{0.02}$ ( $\underline{0.26}$ )	0.32 (1.23)
(1, 1.1, 1.2, 1.2)	0.57 (0.68)	0.12 (0.29)	0.42 (0.64)	0.01 (0.85)	0.05 ( $\underline{0.26}$ )	0.41 (0.64)	$\bm{0.00}$ (0.86)	0.05 ( $\underline{0.26}$ )	0.36 (1.24)

Table 4: Marginal bias and MSE in parenthesis of three existing estimators and six proposed estimators when

I=4

. Within each row, the bias with the smallest absolute value is in bold, and the smallest MSE is underlined.

4.2 The Choice of $B$ and Higher-Order Bootstrap Methods

In this section, we provide some insights and guidance on how to choose the value of $B$ in Bootstrap methods, and the feasibility of higher-order Bootstrap methods. Under 4 different values of $\bm{\theta}$ in S1, Table 5 evaluates single, double and triple Bootstrap methods for both parametric and nonparametric versions. Due to computation burdens, triple Bootstrap methods $\widehat{\theta}^{(3)}_{PB}$ and $\widehat{\theta}^{(3)}_{NB}$ are only assessed under $B=80$ and $100$ .

For single and double Bootstrap methods, there is limited improvement in bias and MSE when increasing $B=80$ to $1000$ under scenarios we considered. Therefore, we utilize $B=80$ in this study, and $B=1000$ for the real data example in the next section. For other problems, one can implement Bootstrap methods with several values of $B$ to find a proper one with a reasonable computation time. When it comes to higher-order Bootstrap, for example triple Bootstrap methods, their bias and MSE can be even worse than the single Bootstrap version. The high MSE of double Bootstrap estimators is carried over to triple Bootstrap estimators by an additional layer of iteration. With even more intensive computation, triple Bootstrap or even higher-order Bootstrap methods are not recommended for the settings considered.

$\bm{\theta}$	B	$\widehat{\theta}^{(1)}_{PB}$	$\widehat{\theta}^{(2)}_{PB}$	$\widehat{\theta}^{(3)}_{PB}$	$\widehat{\theta}^{(1)}_{NB}$	$\widehat{\theta}^{(2)}_{NB}$	$\widehat{\theta}^{(3)}_{NB}$
(1, 1, 1)	80	0.40 (0.65)	0.06 (0.83)	-0.37 (1.97)	0.40 (0.65)	0.07 (0.83)	-0.36 (1.94)
	100	0.39 (0.63)	0.05 (0.82)	-0.39 (1.95)	0.40 (0.63)	0.06 (0.82)	-0.38 (1.92)
	500	0.41 (0.64)	0.07 (0.81)		0.41 (0.65)	0.08 (0.81)
	1000	0.38 (0.62)	0.04 (0.80)		0.39 (0.63)	0.05 (0.80)
(1, 1, 1.2)	80	0.28 (0.57)	-0.06 (0.84)	-0.49 (2.08)	0.28 (0.57)	-0.05 (0.84)	-0.47 (2.04)
	100	0.27 (0.55)	-0.07 (0.83)	-0.51 (2.07)	0.27 (0.56)	-0.06 (0.82)	-0.49 (2.03)
	500	0.28 (0.56)	-0.06 (0.82)		0.28 (0.57)	-0.05 (0.81)
	1000	0.26 (0.55)	-0.08 (0.82)		0.27 (0.55)	-0.07 (0.81)
(1, 1.1, 1.2)	80	0.31 (0.59)	-0.03 (0.84)	-0.46 (2.06)	0.31 (0.59)	-0.02 (0.84)	-0.45 (2.03)
	100	0.30 (0.57)	-0.04 (0.82)	-0.48 (2.03)	0.30 (0.57)	-0.04 (0.82)	-0.47 (2.00)
	500	0.31 (0.58)	-0.03 (0.81)		0.32 (0.58)	-0.02 (0.81)
	1000	0.29 (0.57)	-0.05 (0.82)		0.30 (0.57)	-0.04 (0.81)
(1, 1.2, 1.2)	80	0.34 (0.61)	0.00 (0.85)	-0.43 (2.04)	0.35 (0.61)	0.02 (0.84)	-0.42 (2.01)
	100	0.33 (0.59)	0.00 (0.82)	-0.44 (2.01)	0.34 (0.60)	0.00 (0.82)	-0.43 (1.97)
	500	0.35 (0.61)	0.01 (0.82)		0.35 (0.61)	0.02 (0.81)
	1000	0.33 (0.59)	-0.01 (0.82)		0.33 (0.59)	0.00 (0.81)

Table 5: Marginal bias and MSE in parenthesis of single, double and triple Bootstrap methods with varying

B

5 Real Data Example

AWARD-5 was an adaptive, dose-finding, seamless Phase 2/3 study of dulaglutide for the treatment of type 2 diabetes mellitus (Geiger et al., 2012). The study had a dose-finding portion (Stage 1) with Bayesian response adaptive randomization to evaluate 7 dulaglutide doses and a fixed scheme (Stage 2) to confirm findings of 2 selected doses (0.75 mg and 1.5 mg) (Skrivanek et al., 2014). The adaptive randomization at Stage 1 and dose selection at the end of Stage 2 was informed by a clinical utility index (CUI), a single metric that reflects four prespecified safety and efficacy response measures (Geiger et al., 2012). Sample size re-estimation was also performed for Stage 2 based on the data from Stage 1 (Geiger et al., 2012; Skrivanek et al., 2014; ClinicalTrials.gov, 2015).

For illustration purposes, we consider a simplified problem of treating Stage 1 as a previous Phase 2 study, while Stage 2 as the new Phase 3 study. Our goal is to accurately estimate the response mean of the selected group dulaglutide 1.5 mg to plan its sample size for Stage 2 based on results in Stage 1. The dosing regimen dulaglutide 1.5 mg was selected as the most efficacious group for further testing in Stage 2 during the actual trial conduct of AWARD-5 (Skrivanek et al., 2014). Assessments are based on the primary efficacy endpoint of change from Baseline (CHG) of glycosylated hemoglobin (HbA1c) at Week 52. For notation consistency, we use the negative of CHG (decrease in HbA1c) with a larger value denoting a better response. Table 6 summarizes the response mean (based on Bayesian posterior mean), the sample size, and the standard deviation (based on Normal approximation of Bayesian $95\%$ credible intervals) for each of the 7 active treatment groups in Stage 1 of dose selection (Skrivanek et al., 2014). Since publicly available results are only summary statistics by group as in Table 6, we apply the single PB $\widehat{\theta}^{(1)}_{PB}$ , the double PB $\widehat{\theta}^{(2)}_{PB}$ , and the hybrid estimator $\widehat{\theta}^{(2)}_{PB,S}$ to estimate the response mean of 1.5 mg for sample size re-assessment in Stage 2. The Bootstrap sample size is $B=1,000$ , and therefore, $\widehat{\theta}^{(1)}_{PB}$ , $\widehat{\theta}^{(2)}_{PB}$ and $\widehat{\theta}^{(2)}_{PB,S}$ require $10^{3}$ , $10^{6}$ and $10^{6}$ resamples, respectively.

The traditional estimator $\widehat{\theta}$ in (3) is $1.33$ as the maximum of -CHGs from 7 active treatment groups. Our three PB estimators are $\widehat{\theta}^{(1)}_{PB}=1.28$ , $\widehat{\theta}^{(2)}_{PB}=1.20$ and $\widehat{\theta}^{(2)}_{PB,S}=1.16$ , with computational time on a standard laptop of $0.04$ second, $23.1$ seconds and $23.1$ seconds, respectively. These results are consistent with simulation results / conclusions in Section 4, where $\widehat{\theta}$ has a large positive estimation bias, and $\widehat{\theta}^{(1)}_{PB}$ has a moderate positive bias, while $\widehat{\theta}^{(2)}_{PB}$ and $\widehat{\theta}^{(2)}_{PB,S}$ have biases close to zero.

Group	-CHG of HbA1c at Week 52	n	SD
Dulaglutide 0.25 mg	0.82	13	0.55
Dulaglutide 0.5 mg	0.95	16	0.42
Dulaglutide 0.75 mg	0.93	20	0.59
Dulaglutide 1 mg	1.00	8	0.40
Dulaglutide 1.5 mg	1.33	18	0.67
Dulaglutide 2 mg	1.28	24	0.49
Dulaglutide 3 mg	1.00	10	0.42
Estimator	Value
$\widehat{\theta}$	$1.33$
$\widehat{\theta}^{(1)}_{PB}$	$1.28$
$\widehat{\theta}^{(2)}_{PB}$	$1.20$
$\widehat{\theta}^{(2)}_{PB,S}$	$1.16$

Table 6: Summary statistics of Stage 1 are based on Bayesian posterior means and

95\%

credible intervals of CHG of HbA1c at Week 52 (Skrivanek et al., 2014). The standard deviations (SD) are computed by Normal approximation of the

95\%

credible intervals. Values of 4 estimators are presented.

6 Discussion

We summarize several attributes of those computational methods in Table 7. PB methods can either use patient-level data or summary results by treatment group, while NB methods and the Jackknife need patient-level data. With the least computational resource, the Jackknife method can give exact results. As compared with PB, NB does not necessarily require specific distribution assumptions. Based on simulation studies with outliers in Section 4, PB with the Normal assumption has a satisfactory performance when inferring response means of continuous endpoints. Bootstrap methods are capable of conducting a second-order resampling to decrease bias. However, double Bootstrap and Jackknife are observed to have larger MSEs than the traditional estimator $\widehat{\theta}$ based on results in Section 4. Generally speaking, correcting the bias may cause a larger increase in variance, and results in a larger MSE (Efron and Tibshirani, 1994). The Jackknife method makes a linear approximation to the Bootstrap method, and can be inefficient for nonlinear functions (Efron and Tibshirani, 1994). These arguments may explain why both Bootstrap and Jackknife methods can correct bias but with larger MSE than $\widehat{\theta}$ , and why Jackknife method usually has the largest MSE. Hybrid estimators based on double Bootstrap and shrinkage can reduce both bias and MSE. Therefore, our overall recommendation is to implement PB methods if only summary results are available, and to choose NB methods with patient-level data. Hybrid estimators based on double Bootstrap and shrinkage are preferred to balance reductions in both bias and MSE.

The last row of Table 7 summarized limitations of our computational methods for bias correction. Single Bootstrap methods have moderate or limited bias reduction, while double Bootstrap methods have increased MSE and require intensive computation. Hybrid estimators also require double Bootstrap with heavy computation. The Jackknife method has limited bias reduction but with increased MSE.

This article is not intended to completely fill the evidence gap between previous studies and Phase 3 studies. Under a basic scenario where efficacy profiles of the selected group(s) are the same between studies, we show that our computational methods can characterize the efficacy more accurately than the common practice. We have a broader scope with multiple groups for selection and flexibility to accommodate summary results as compared with some previous works. On top of this framework, additional layers can be added to accommodate more complicated problems, e.g., temporal drift and patient heterogeneity. Our framework can be integrated into MCP-Mod (Bretz et al., 2005) to accommodate the dose-ranging part of previous studies. The proposed method can also be broadly applied to other settings, e.g., response-adaptive randomization design with multiple active treatment groups, patient enrichment designs, and other general selection problems.

In this article, we consider continuous endpoints for illustration. Generalization can be made for binary endpoints, and time-to-event endpoints. Some other future works include targeting treatment differences by adjusting the placebo effect, regression models to accommodate covariates, and improved computational methods to reduce the burden of higher-order Bootstrap methods.

	Parametric Bootstrap			Nonparametric Bootstrap			Jackknife
	Single	Double	Hybrid	Single	Double	Hybrid
Notation	$\widehat{\theta}^{(1)}_{PB}$	$\widehat{\theta}^{(2)}_{PB}$	$\widehat{\theta}^{(2)}_{PB,S}$	$\widehat{\theta}^{(1)}_{NB}$	$\widehat{\theta}^{(2)}_{NB}$	$\widehat{\theta}^{(2)}_{NB,S}$	$\widehat{\theta}_{JK}$
Data source	Subject level data or summary statistics by groups			Subject level data
Number of sampling iterations	$B$	$B^{2}$	$B^{2}$	$B$	$B^{2}$	$B^{2}$	$n$
Features	Can utilize summary statistics; Double Bootstrap or hybrid with shrinkage to increase precision			Free of distribution assumptions; Double Bootstrap or hybrid with shrinkage to increase precision			Exact results; Less computationally intensive
Limitations	Moderate bias reduction	Increased MSE; Intensive computation	Intensive computation	Moderate bias reduction	Increased MSE; Intensive computation	Intensive computation	Moderate bias reduction; Increased MSE

Table 7: Summary table of computational methods for bias correction

Acknowledgements

The author thanks the Editor, the Associate Editor and two reviewers for their insightful comments. This manuscript was supported by AbbVie Inc. AbbVie participated in the review and approval of the content. Tianyu Zhan is employed by AbbVie Inc., and may own AbbVie stock.

Supplementary materials

The R code to replicate results in Section 4 and 5 is available on GitHub https://github.com/tian-yu-zhan/Bias_Reduction. Data sharing is not applicable, because simulation is based on results from literature.

References

Bauer et al. (2010) Bauer, P., F. Koenig, W. Brannath, and M. Posch (2010). Selection and bias—two hostile brothers. Statistics in Medicine 29(1), 1–13.
Blumenthal and Cohen (1968) Blumenthal, S. and A. Cohen (1968). Estimation of the larger of two normal means. Journal of the American Statistical Association 63(323), 861–876.
Bretz et al. (2005) Bretz, F., J. C. Pinheiro, and M. Branson (2005). Combining multiple comparisons and modeling techniques in dose-response studies. Biometrics 61(3), 738–748.
Carreras and Brannath (2013) Carreras, M. and W. Brannath (2013). Shrinkage estimation in two-stage adaptive designs with midtrial treatment selection. Statistics in Medicine 32(10), 1677–1690.
ClinicalTrials.gov (2015) ClinicalTrials.gov (2015). A Study of LY2189265 Compared to Sitagliptin in Participants With Type 2 Diabetes Mellitus on Metformin. . https://clinicaltrials.gov/ct2/show/NCT00734474.
Dahiya (1974) Dahiya, R. C. (1974). Estimation of the mean of the selected population. Journal of the American Statistical Association 69(345), 226–230.
Davison and Hinkley (1997) Davison, A. C. and D. V. Hinkley (1997). Bootstrap methods and their application. Cambridge University Press.
Efron and Tibshirani (1994) Efron, B. and R. J. Tibshirani (1994). An introduction to the bootstrap. CRC Press.
Food and Drug Administration (2017) Food and Drug Administration (2017). 22 Case Studies Where Phase 2 and Phase 3 Trials Had Divergent Results. https://www.fda.gov/media/102332/download.
Geiger et al. (2012) Geiger, M. J., Z. Skrivanek, B. Gaydos, J. Chien, S. Berry, D. Berry, and J. H. Anderson Jr (2012). An adaptive, dose-finding, seamless phase 2/3 study of a long-acting glucagon-like peptide-1 analog (dulaglutide): trial design and baseline characteristics. Journal of Diabetes Science and Technology 6(6), 1319–1327.
Hsieh (1981) Hsieh, H.-K. (1981). On estimating the mean of the selected population with unknown variance. Communications in Statistics-Theory and Methods 10(18), 1869–1878.
Hsu et al. (1986) Hsu, Y.-S., K.-N. Lau, H.-G. Fung, and E. F. Ulveling (1986). Monte carlo studies on the effectiveness of the bootstrap bias reduction method on 2sls estimates. Economics Letters 20(3), 233–239.
Hwang (1993) Hwang, J. T. (1993). Empirical bayes estimation for the means of the selected populations. Sankhyā: The Indian Journal of Statistics, Series A, 285–304.
ICH Guideline E8 (2022) ICH Guideline E8 (2022). ICH guideline E8 (R1) on general considerations for clinical studies. https://www.ema.europa.eu/en/documents/scientific-guideline/ich-e-8-general-considerations-clinical-trials-step-5_en.pdf.
Ishwaei D et al. (1985) Ishwaei D, B., D. Shabma, and K. Krishnamoorthy (1985). Non-existence of unbiased estimators of ordered parameters. Statistics: A Journal of Theoretical and Applied Statistics 16(1), 89–95.
Kosmidis (2014) Kosmidis, I. (2014). Bias in parametric estimation: reduction and useful side-effects. Wiley Interdisciplinary Reviews: Computational Statistics 6(3), 185–196.
Kumar and Sharma (1993) Kumar, S. and D. Sharma (1993). Unbiased inestimability of the larger of two parameters. Statistics: A Journal of Theoretical and Applied Statistics 24(2), 137–142.
Liang et al. (2019) Liang, F., Z. Wu, M. Mo, C. Zhou, J. Shen, Z. Wang, and Y. Zheng (2019). Comparison of treatment effect from randomised controlled phase II trials and subsequent phase III trials using identical regimens in the same treatment setting. European Journal of Cancer 121, 19–28.
Lindley (1962) Lindley, D.-V. (1962). Discussion of Professor Stein’s paper ”Confidence sets for the mean of a multivariate normal distribution”. Journal of the Royal Statistical Society, Series B 24, 285–287.
MacKinnon and Smith Jr (1998) MacKinnon, J. G. and A. A. Smith Jr (1998). Approximate bias correction in econometrics. Journal of Econometrics 85(2), 205–230.
Miller (1974) Miller, R. G. (1974). The jackknife-a review. Biometrika 61(1), 1–15.
Ouysse (2011) Ouysse, R. (2011). Computationally efficient approximation for the double bootstrap mean bias correction. Economics Bulletin 31(3), 2388–2403.
Quenouille (1949) Quenouille, M. H. (1949). Approximate tests of correlation in time-series 3. Mathematical Proceedings of the Cambridge Philosophical Society 45, 483–484.
Robertson et al. (2023a) Robertson, D. S., B. Choodari-Oskooei, M. Dimairo, L. Flight, P. Pallmann, and T. Jaki (2023a). Point estimation for adaptive trial designs i: A methodological review. Statistics in Medicine 42(2), 122–145.
Robertson et al. (2023b) Robertson, D. S., B. Choodari-Oskooei, M. Dimairo, L. Flight, P. Pallmann, and T. Jaki (2023b). Point estimation for adaptive trial designs ii: practical considerations and guidance. Statistics in Medicine.
Rosenkranz (2014) Rosenkranz, G. K. (2014). Bootstrap corrections of treatment effect estimates following selection. Computational Statistics & Data Analysis 69, 220–227.
Saville et al. (2022) Saville, B. R., D. A. Berry, N. S. Berry, K. Viele, and S. M. Berry (2022). The Bayesian time machine: accounting for temporal drift in multi-arm platform trials. Clinical Trials 19(5), 490–501.
Skrivanek et al. (2014) Skrivanek, Z., B. Gaydos, J. Chien, M. Geiger, M. Heathman, S. Berry, J. Anderson, T. Forst, Z. Milicevic, and D. Berry (2014). Dose-finding results in an adaptive, seamless, randomized trial of once-weekly dulaglutide combined with metformin in type 2 diabetes patients (award-5). Diabetes, Obesity and Metabolism 16(8), 748–756.
Stallard and Todd (2005) Stallard, N. and S. Todd (2005). Point estimates and confidence regions for sequential trials involving selection. Journal of Statistical Planning and Inference 135(2), 402–419.
Vellaisamy and Sharma (1988) Vellaisamy, P. and D. Sharma (1988). Estimation of the mean of the selected gamma population. Communications in Statistics-Theory and Methods 17(8), 2797–2817.
Whitehead (1986) Whitehead, J. (1986). On the bias of maximum likelihood estimation following a sequential test. Biometrika 73(3), 573–581.
Zia et al. (2005) Zia, M. I., L. L. Siu, G. R. Pond, and E. X. Chen (2005). Comparison of outcomes of phase II studies and subsequent randomized control studies using identical chemotherapeutic regimens. Journal of Clinical Oncology 23(28), 6982–6991.

A Class of Computational Methods to Reduce Selection Bias when Designing Phase 3 Clinical Trials

Abstract

1 Introduction

2 Setup

3 Proposed Methods

3.1 Single and Double Bootstrap

3.2 Jackknife

3.3 Shrinkage Estimators

3.4 Hybrid Estimators based on Double Bootstrap and Shrinkage

4 Simulation

4.1 Main Study

4.2 The Choice of B𝐵Bitalic_B and Higher-Order Bootstrap Methods

5 Real Data Example

6 Discussion

Acknowledgements

Supplementary materials

References

4.2 The Choice of $B$ and Higher-Order Bootstrap Methods