Causal K-Means Clusteringthanks: We thank Larry Wasserman for helpful discussions and comments. A part of this work was done while Kwangho Kim was a PhD student at Carnegie Mellon University.

Kwangho Kim                   Jisu Kim                   Edward H. Kennedy
Department of Statistics, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul 02841, Korea; email: [email protected].Department of Statistics, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Korea; email: [email protected].Department of Statistics and Data Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213, USA; email: [email protected].
Abstract

Causal effects are often characterized with population summaries. These might provide an incomplete picture when there are heterogeneous treatment effects across subgroups. Since the subgroup structure is typically unknown, it is more challenging to identify and evaluate subgroup effects than population effects. We propose a new solution to this problem: Causal k-Means Clustering, which harnesses the widely-used k-means clustering algorithm to uncover the unknown subgroup structure. Our problem differs significantly from the conventional clustering setup since the variables to be clustered are unknown counterfactual functions. We present a plug-in estimator which is simple and readily implementable using off-the-shelf algorithms, and study its rate of convergence. We also develop a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning, and show that this estimator achieves fast root-n rates and asymptotic normality in large nonparametric models. Our proposed methods are especially useful for modern outcome-wide studies with multiple treatment levels. Further, our framework is extensible to clustering with generic pseudo-outcomes, such as partially observed outcomes or otherwise unknown functions. Finally, we explore finite sample properties via simulation, and illustrate the proposed methods in a study of treatment programs for adolescent substance abuse.


Keywords: Causal inference; Heterogeneous treatment effect; Personalization; Subgroup analysis; Observational studies

1 Introduction

1.1 Heterogeneity in Treatment Effects

Statistical causal inference is all about estimating what would happen to some response when a “cause” of interest is changed or intervened upon. In causal inference, the average treatment effect (ATE) has regularly emerged as one of the most sought-after effects to measure. For a binary treatment A{0,1}𝐴01A\in\{0,1\}italic_A ∈ { 0 , 1 }, the ATE is defined by

𝔼(Y1Y0),𝔼superscript𝑌1superscript𝑌0\displaystyle\mathbb{E}(Y^{1}-Y^{0}),blackboard_E ( italic_Y start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT - italic_Y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , (1)

where Yasuperscript𝑌𝑎Y^{a}italic_Y start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT is the potential outcome that would have been observed under treatment A=a𝐴𝑎A=aitalic_A = italic_a (Rubin 1974). There has been lots of work concerning efficient and flexible estimation of the ATE and its analogs (Van der Laan et al. 2003; Chernozhukov et al. 2016; Kennedy 2022).

However, the effect of treatment often varies across subgroups, both in terms of magnitude and direction. Certain subgroups may experience larger effects than others. A treatment could even benefit certain subgroups while harming others. A potential shortcoming of the ATE is that it can mask this effect heterogeneity. Identifying treatment effect heterogeneity and corresponding subgroups plays an essential role in a variety of fields, including policy evaluation, drug development, and health care, and has sparked growing interest. For example, patients with different subtypes of cancer often react differently to the same treatment; however, our understanding of cancer subtypes at the molecular level is limited, and there is little consensus about which treatments are most effective for which patients (Kravitz et al. 2004; Hayden 2009). Typically, a functional form of the relationship between treatment effects and unit attributes is unknown a priori, therefore such effect heterogeneity has to be explored using data-driven methods. Despite a lot of recent work in this area, there are still many unsolved problems, and it has not been studied as extensively as other branches of causal inference (Kennedy 2023).

To better understand treatment effect heterogeneity, investigators often target to estimate the conditional average treatment effect (CATE):

τ(X)=𝔼[Y1Y0X],𝜏𝑋𝔼delimited-[]superscript𝑌1conditionalsuperscript𝑌0𝑋\tau(X)=\mathbb{E}[Y^{1}-Y^{0}\mid X],italic_τ ( italic_X ) = blackboard_E [ italic_Y start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT - italic_Y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∣ italic_X ] , (2)

where X𝒳𝑋𝒳X\in\mathcal{X}italic_X ∈ caligraphic_X is a vector of observed covariates. The CATE offers the potential to personalize causal effects by making them specific to each individual’s characteristics. Many methods have been proposed for CATE estimation, with a focus in recent years on leveraging the benefits of machine learning. For example, van der Laan & Luedtke (2015) developed a framework for loss-based super learning. Athey & Imbens (2016); Zhang et al. (2017) proposed a recursive partitioning approach. Foster et al. (2011); Wager & Athey (2018) and Imai et al. (2013) adopted random forests and support vector machine classifiers, respectively. Grimmer et al. (2017) proposed a weighted ensemble approach. Shalit et al. (2017) developed a neural network architecture based on integral probability metrics. Künzel et al. (2017) presented a meta-algorithm with a particular focus on unbalanced designs. Nie & Wager (2021) gave a novel adaptation of RKHS regression methods and studied conditions for oracle efficiency. Kennedy (2023) provided generic model-free error bounds and presented an algorithm achieving the fastest possible convergence rates under smoothness assumptions.

1.2 Understanding Heterogeneity via Cluster Analysis

In contrast to earlier work, which has focused on supervised learning methodologies, we consider analyzing treatment effect heterogeneity from an unsupervised learning perspective. We develop Causal Clustering, a new technique for exploring heterogeneous treatment effects leveraging tools from cluster analysis. We aim to understand the structure of effect heterogeneity by identifying underlying subgroups as clusters. Our work is therefore more descriptive and discovery-based, and fills an important gap in the literature.

We illustrate the idea of causal clustering through the case of binary treatments in Figure LABEL:fig:causal-cluster-illustration. We generate a sample where a projection (𝔼[Y0X],𝔼[Y1X])𝔼delimited-[]conditionalsuperscript𝑌0𝑋𝔼delimited-[]conditionalsuperscript𝑌1𝑋(\mathbb{E}[Y^{0}\mid X],\mathbb{E}[Y^{1}\mid X])( blackboard_E [ italic_Y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∣ italic_X ] , blackboard_E [ italic_Y start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∣ italic_X ] ) of each observation is drawn from a mixture of six Gaussian distributions with different means and covariance functions, with the overall ATE set to zero. By construction, there are six clusters, with units within each cluster being more homogeneous in terms of the CATE. When it comes to analyzing the heterogeneity of treatment effects, people often rely on the histogram of the CATE as in Figure LABEL:fig:causal-cluster-illustration-(c). However, in this case, the histogram fails to reveal the details about the true subgroup structure. By adapting the idea of cluster analysis, we aim to uncover clusters with markedly different responses to a given treatment than the rest, while maintaining a high degree of homogeneity within each cluster, as shown in Figure LABEL:fig:causal-cluster-illustration-(b). This allows for an interesting new study of subgroup structure; as far as we know, clustering methods have yet to be employed in causal inference or heterogeneous effects problems.

Our problem differs significantly from the conventional clustering setup since the variable to be clustered consists of unknown functions (i.e., potential outcome regression functions) that must be estimated. Clustering with these unknown “pseudo-outcomes” has not received as much attention as clustering on standard fully observed data, despite its importance. Some previous work considered cluster analysis using partially observed outcomes, yet still in a vector form with fixed dimensions. For example, Serafini et al. (2020) explored missing data problems in clustering, and Haviland et al. (2011) studied group-based trajectory modeling with non-random dropouts. Su et al. (2018) considered clustering with measurement errors. In a similar context, Kumar & Patel (2007) considered clustering on unknown model parameters, though without theoretical analysis. To the best of our knowledge, none of the existing methods in clustering literature have considered nonparametric approaches to clustering with unknown functions. In our analysis, we show that if the nuisance estimation error with respect to those unknown functionals is sufficiently small, then the excess clustering risk is near zero. In this sense, our work is in a similar spirit to the classification versus regression distinction in statistical learning (Devroye et al. 2013, Theorem 2.2).

In addition to the existing supervised-learning based approaches, our framework offers a complementary tool for identifying subgroups that substantially differ from each other. Our proposed methods are particularly useful in outcome-wide studies with multiple treatment levels (VanderWeele 2017; VanderWeele et al. 2016); instead of probing a high-dimensional CATE surface to assess the subgroup structure, one may attempt to uncover lower-dimensional clusters with similar responses to a given treatment set.

The remainder of the paper is structured as follows. In Section 2, we formalize the idea of causal clustering based on the k-means algorithm. In Section 3, we present a plug-in estimator, which is simple and readily implementable yet will in general not be n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG-consistent. In Section 4, we develop an efficient bias-corrected estimator for k-means causal clustering under a margin condition, which attains fast n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG rates and asymptotic normality under weak nonparametric conditions. In section 5, we illustrate our approach using simulations and real data on effects of treatment programs for substance abuse. Section 6 concludes with a discussion.

2 Setup and estimands

Consider a random sample (Z1,,Zn)subscript𝑍1subscript𝑍𝑛(Z_{1},...,Z_{n})( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) of n𝑛nitalic_n tuples Z=(Y,A,X)𝑍𝑌𝐴𝑋similar-toZ=(Y,A,X)\sim\mathbb{P}italic_Z = ( italic_Y , italic_A , italic_X ) ∼ blackboard_P, where Y𝑌Y\in\mathbb{R}italic_Y ∈ blackboard_R represents the outcome, A𝒜={1,,p}𝐴𝒜1𝑝A\in\mathcal{A}=\{1,...,p\}italic_A ∈ caligraphic_A = { 1 , … , italic_p } denotes an intervention, and X𝒳d𝑋𝒳superscript𝑑X\in\mathcal{X}\subseteq\mathbb{R}^{d}italic_X ∈ caligraphic_X ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT comprises observed covariates. For simplicity, we focus on univariate outcomes, although our methodology can be easily extended to multivariate outcomes. Throughout, we rely on the following widely-used identification assumptions (e.g., Imbens & Rubin 2015, Chapter 12):

Assumption C1 (consistency).

Y=Ya𝑌superscript𝑌𝑎Y=Y^{a}italic_Y = italic_Y start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT if A=a𝐴𝑎A=aitalic_A = italic_a.

Assumption C2 (no unmeasured confounding).

AYaXperpendicular-toabsentperpendicular-to𝐴conditionalsuperscript𝑌𝑎𝑋A\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{% \displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0% mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.% 0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}% \mkern 2.0mu{\scriptscriptstyle\perp}}}Y^{a}\mid Xitalic_A start_RELOP ⟂ ⟂ end_RELOP italic_Y start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ∣ italic_X.

Assumption C3 (positivity).

(A=aX)𝐴conditional𝑎𝑋\mathbb{P}(A=a\mid X)blackboard_P ( italic_A = italic_a ∣ italic_X ) is bounded away from 0 a.s. []delimited-[][\mathbb{P}][ blackboard_P ].

For a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, let the outcome regression function be denoted by

μa(X)subscript𝜇𝑎𝑋\displaystyle\mu_{a}(X)italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_X ) 𝔼(YaX)=𝔼(YX,A=a).absent𝔼conditionalsuperscript𝑌𝑎𝑋𝔼conditional𝑌𝑋𝐴𝑎\displaystyle\equiv\mathbb{E}(Y^{a}\mid X)=\mathbb{E}(Y\mid X,A=a).≡ blackboard_E ( italic_Y start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ∣ italic_X ) = blackboard_E ( italic_Y ∣ italic_X , italic_A = italic_a ) .

For a,a𝒜for-all𝑎superscript𝑎𝒜\forall a,a^{\prime}\in\mathcal{A}∀ italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A, one may define the pairwise CATE by

τaa(X)subscript𝜏𝑎superscript𝑎𝑋\displaystyle\tau_{aa^{\prime}}(X)italic_τ start_POSTSUBSCRIPT italic_a italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_X ) 𝔼(YX,A=a)𝔼(YX,A=a)absent𝔼conditional𝑌𝑋𝐴𝑎𝔼conditional𝑌𝑋𝐴superscript𝑎\displaystyle\equiv\mathbb{E}(Y\mid X,A=a)-\mathbb{E}(Y\mid X,A=a^{\prime})≡ blackboard_E ( italic_Y ∣ italic_X , italic_A = italic_a ) - blackboard_E ( italic_Y ∣ italic_X , italic_A = italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (3)
=μa(X)μa(X)absentsubscript𝜇𝑎𝑋subscript𝜇superscript𝑎𝑋\displaystyle=\mu_{a}(X)-\mu_{a^{\prime}}(X)= italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_X ) - italic_μ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_X )

Then, we define the conditional counterfactual mean vector μ:𝒳p:𝜇𝒳superscript𝑝\mu:\mathcal{X}\to\mathbb{R}^{p}italic_μ : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT as

μ(X)=[𝔼(Y1X),,𝔼(YpX)].𝜇𝑋superscript𝔼conditionalsuperscript𝑌1𝑋𝔼conditionalsuperscript𝑌𝑝𝑋top\displaystyle\mu(X)=\left[\mathbb{E}(Y^{1}\mid X),\ldots,\mathbb{E}(Y^{p}\mid X% )\right]^{\top}.italic_μ ( italic_X ) = [ blackboard_E ( italic_Y start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∣ italic_X ) , … , blackboard_E ( italic_Y start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ∣ italic_X ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT . (4)

If all coordinates of a point μ(X)𝜇𝑋\mu(X)italic_μ ( italic_X ) were the same, there would be no treatment effect on the conditional mean scale. Also, adjacent units in the conditional counterfactual mean vector space would have similar responses to a given set of treatments, since for two units i,j𝑖𝑗i,jitalic_i , italic_j,

μ(Xi)μ(Xj)τaa(Xi)τaa(Xj)for all a,a𝒜.formulae-sequence𝜇subscript𝑋𝑖𝜇subscript𝑋𝑗subscript𝜏𝑎superscript𝑎subscript𝑋𝑖subscript𝜏𝑎superscript𝑎subscript𝑋𝑗for all 𝑎superscript𝑎𝒜\mu(X_{i})\approx\mu(X_{j})\Rightarrow\tau_{aa^{\prime}}(X_{i})\approx\tau_{aa% ^{\prime}}(X_{j})\quad\text{for all }\ a,a^{\prime}\in\mathcal{A}.italic_μ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≈ italic_μ ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⇒ italic_τ start_POSTSUBSCRIPT italic_a italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≈ italic_τ start_POSTSUBSCRIPT italic_a italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) for all italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A .

This provides vital motivation for uncovering subgroup structure via cluster analysis on projections of a sample onto the conditional counterfactual mean vector space. Crucially, standard clustering theory is limited here since the variable to be clustered is μ𝜇\muitalic_μ, a collection of the unknown regression functions, which themselves have to be estimated.

In this work, we propose a novel k-means causal clustering. k-means (also known as vector quantization) is one of the oldest and most popular clustering algorithms, having originated in signal processing. It works by finding k𝑘kitalic_k representative points (or cluster centers) which defines a Voronoi tessellation. There has been a substantial amount of research on k𝑘kitalic_k-means clustering. (See, for review, Jain (2010) or the monograph of Graf & Luschgy (2007)). It is one of the few clustering methods whose theoretical properties are rather well-understood, as the analysis is relatable to principal components analysis (Ding & He 2004).

We call a set of k𝑘kitalic_k representative points a codebook C={c1,,ck}𝐶subscript𝑐1subscript𝑐𝑘C=\{c_{1},...,c_{k}\}italic_C = { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } where each cjpsubscript𝑐𝑗superscript𝑝c_{j}\in\mathbb{R}^{p}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. Let ΠC(x)subscriptΠ𝐶𝑥\Pi_{C}(x)roman_Π start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) be the projection of xp𝑥superscript𝑝x\in\mathbb{R}^{p}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT onto C𝐶Citalic_C:

ΠC(x)=argmincCcx22.subscriptΠ𝐶𝑥𝑐𝐶argminsuperscriptsubscriptnorm𝑐𝑥22\Pi_{C}(x)=\underset{c\in C}{\mathop{\mathrm{argmin}}}\|c-x\|_{2}^{2}.roman_Π start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) = start_UNDERACCENT italic_c ∈ italic_C end_UNDERACCENT start_ARG roman_argmin end_ARG ∥ italic_c - italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Then we define the population clustering risk R(C)𝑅𝐶R(C)italic_R ( italic_C ) with respect to μ𝜇\muitalic_μ by

R(C)=𝔼μΠC(μ)22,𝑅𝐶𝔼superscriptsubscriptnorm𝜇subscriptΠ𝐶𝜇22\displaystyle R(C)=\mathbb{E}\|\mu-\Pi_{C}(\mu)\|_{2}^{2},italic_R ( italic_C ) = blackboard_E ∥ italic_μ - roman_Π start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (5)

and the corresponding optimal codebook Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT by

C=argminC𝒞kR(C),superscript𝐶𝐶subscript𝒞𝑘argmin𝑅𝐶\displaystyle C^{*}=\underset{C\in\mathcal{C}_{k}}{\mathop{\mathrm{argmin}}}R(% C),italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_UNDERACCENT italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_argmin end_ARG italic_R ( italic_C ) , (6)

where 𝒞ksubscript𝒞𝑘\mathcal{C}_{k}caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes all codebooks of length k𝑘kitalic_k in the image of μ𝜇\muitalic_μ defined in (4). When C𝐶Citalic_C is fixed, the population clustering risk (5) can be viewed as a real-valued functional on a nonparametric model. Importantly, R(C)𝑅𝐶R(C)italic_R ( italic_C ) is a non-smooth functional of the observed data distribution, so the standard semiparametric efficiency theory does not immediately apply. In Section 4, we shall propose an efficient estimator for R(C)𝑅superscript𝐶R(C^{*})italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) under a margin condition.

The conditional counterfactual mean vector in (4) can be easily tailored for a specific use through reparametrization without compromising our subsequent results. With 𝒜={0,1}𝒜01\mathcal{A}=\{0,1\}caligraphic_A = { 0 , 1 }, for instance, one may consider μ=(μ0,μ1μ0)𝜇subscript𝜇0subscript𝜇1subscript𝜇0\mu=(\mu_{0},\mu_{1}-\mu_{0})italic_μ = ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) with A=0𝐴0A=0italic_A = 0 untreated and μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as a baseline risk instead of μ=(μ0,μ1)𝜇subscript𝜇0subscript𝜇1\mu=(\mu_{0},\mu_{1})italic_μ = ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). This may be more useful for exploring the relationship between the baseline risk and the treatment effect as illustrated in Figure LABEL:fig:alternative-parametrization. As has been shown in the literature of heterogeneous treatment effects, the difference in regression functions may be more structured and simple than the individual components (e.g., Chernozhukov et al. 2018; Kennedy 2023). Some parametrizations might help harness this nontrivial structure (e.g., smoothness or sparsity) of each CATE function. For example, clustering on μ=(μ1μ0,μ2μ0,)𝜇subscript𝜇1subscript𝜇0subscript𝜇2subscript𝜇0\mu=(\mu_{1}-\mu_{0},\mu_{2}-\mu_{0},\cdots)italic_μ = ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ ) could be easier than clustering on μ=(μ0,μ1,μ2,)𝜇subscript𝜇0subscript𝜇1subscript𝜇2\mu=(\mu_{0},\mu_{1},\mu_{2},\cdots)italic_μ = ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ ), when we are less concerned with the baseline risk. If we are interested in how a treatment shifts the quantiles (e.g. Chernozhukov & Hansen 2005; Zhang et al. 2012), we can redefine our conditional counterfactual mean vector by μ=(Q0(q),Q1(q))𝜇subscript𝑄0𝑞subscript𝑄1𝑞\mu=(Q_{0}(q),Q_{1}(q))italic_μ = ( italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_q ) , italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_q ) ) for some prespecified q(0,1)𝑞01q\in(0,1)italic_q ∈ ( 0 , 1 ) (for median, q=1/2𝑞12q=1/2italic_q = 1 / 2), where Qa(q)subscript𝑄𝑎𝑞Q_{a}(q)italic_Q start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_q ) is the quantile function of our potential outcome Yasuperscript𝑌𝑎Y^{a}italic_Y start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT, i.e., Qa(q)=inf{y:qFYa(y)}subscript𝑄𝑎𝑞infimumconditional-set𝑦𝑞subscript𝐹superscript𝑌𝑎𝑦Q_{a}(q)=\inf\left\{y\in\mathbb{R}:q\leq F_{Y^{a}}(y)\right\}italic_Q start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_q ) = roman_inf { italic_y ∈ blackboard_R : italic_q ≤ italic_F start_POSTSUBSCRIPT italic_Y start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y ) } for FYa=(YayX)subscript𝐹superscript𝑌𝑎superscript𝑌𝑎conditional𝑦𝑋F_{Y^{a}}=\mathbb{P}(Y^{a}\leq y\mid X)italic_F start_POSTSUBSCRIPT italic_Y start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = blackboard_P ( italic_Y start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ≤ italic_y ∣ italic_X ).

In the sequel, we use the shorthand μ(i)μ(Xi)subscript𝜇𝑖𝜇subscript𝑋𝑖\mu_{(i)}\equiv\mu(X_{i})italic_μ start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ≡ italic_μ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and μ^(i)μ^(Xi)=[μ^1(Xi),,μ^p(Xi)]subscript^𝜇𝑖^𝜇subscript𝑋𝑖superscriptsubscript^𝜇1subscript𝑋𝑖subscript^𝜇𝑝subscript𝑋𝑖top\widehat{\mu}_{(i)}\equiv\widehat{\mu}(X_{i})=\left[\widehat{\mu}_{1}(X_{i}),.% ..,\widehat{\mu}_{p}(X_{i})\right]^{\top}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ≡ over^ start_ARG italic_μ end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = [ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , … , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. We let xqsubscriptnorm𝑥𝑞\|x\|_{q}∥ italic_x ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT denote Lqsubscript𝐿𝑞L_{q}italic_L start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT norm for any fixed vector x𝑥xitalic_x. For a given function f𝑓fitalic_f, we use the notation f,q=[(|f|q)]1/q=[|f(z)|q𝑑(z)]1/qsubscriptnorm𝑓𝑞superscriptdelimited-[]superscript𝑓𝑞1𝑞superscriptdelimited-[]superscript𝑓𝑧𝑞differential-d𝑧1𝑞\|f\|_{\mathbb{P},q}=\left[\mathbb{P}(|f|^{q})\right]^{1/q}=\left[\int|f(z)|^{% q}d\mathbb{P}(z)\right]^{1/q}∥ italic_f ∥ start_POSTSUBSCRIPT blackboard_P , italic_q end_POSTSUBSCRIPT = [ blackboard_P ( | italic_f | start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ) ] start_POSTSUPERSCRIPT 1 / italic_q end_POSTSUPERSCRIPT = [ ∫ | italic_f ( italic_z ) | start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT italic_d blackboard_P ( italic_z ) ] start_POSTSUPERSCRIPT 1 / italic_q end_POSTSUPERSCRIPT as the Lq()subscript𝐿𝑞L_{q}(\mathbb{P})italic_L start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( blackboard_P )-norm of f𝑓fitalic_f. Also, we let {\mathbb{P}}blackboard_P denote the conditional expectation given the sample operator f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG, as in (f^)=f^(z)𝑑(z)^𝑓^𝑓𝑧differential-d𝑧\mathbb{P}(\hat{f})=\int\hat{f}(z)d\mathbb{P}(z)blackboard_P ( over^ start_ARG italic_f end_ARG ) = ∫ over^ start_ARG italic_f end_ARG ( italic_z ) italic_d blackboard_P ( italic_z ). Notice that (f^)^𝑓\mathbb{P}(\hat{f})blackboard_P ( over^ start_ARG italic_f end_ARG ) is random only if f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG depends on samples, in which case (f^)𝔼(f^)^𝑓𝔼^𝑓\mathbb{P}(\hat{f})\neq\mathbb{E}(\hat{f})blackboard_P ( over^ start_ARG italic_f end_ARG ) ≠ blackboard_E ( over^ start_ARG italic_f end_ARG ). Otherwise \mathbb{P}blackboard_P and 𝔼𝔼\mathbb{E}blackboard_E can be used exchangeably. For example, if f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG is constructed on a separate (training) sample 𝖣n=(Z1,,Zn)superscript𝖣𝑛subscript𝑍1subscript𝑍𝑛\mathsf{D}^{n}=(Z_{1},...,Z_{n})sansserif_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), then {f^(Z)}=𝔼{f^(Z)𝖣n}^𝑓𝑍𝔼conditional-set^𝑓𝑍superscript𝖣𝑛{\mathbb{P}}\left\{\hat{f}(Z)\right\}=\mathbb{E}\left\{\hat{f}(Z)\mid\mathsf{D% }^{n}\right\}blackboard_P { over^ start_ARG italic_f end_ARG ( italic_Z ) } = blackboard_E { over^ start_ARG italic_f end_ARG ( italic_Z ) ∣ sansserif_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } for a new observation Zsimilar-to𝑍Z\sim\mathbb{P}italic_Z ∼ blackboard_P. We let nsubscript𝑛\mathbb{P}_{n}blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denote the empirical measure as in n(f)=n{(f(Z)}=1ni=1nf(Zi)\mathbb{P}_{n}(f)=\mathbb{P}_{n}\{(f(Z)\}=\frac{1}{n}\sum_{i=1}^{n}f(Z_{i})blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_f ) = blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT { ( italic_f ( italic_Z ) } = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Lastly, we use the shorthand anbnless-than-or-similar-tosubscript𝑎𝑛subscript𝑏𝑛a_{n}\lesssim b_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≲ italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to denote an𝖼bnsubscript𝑎𝑛𝖼subscript𝑏𝑛a_{n}\leq\mathsf{c}b_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ sansserif_c italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT for some universal constant 𝖼>0𝖼0\mathsf{c}>0sansserif_c > 0.


3 Plug-in Estimator

Suppose the {μ(i)}subscript𝜇𝑖\{\mu_{(i)}\}{ italic_μ start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT } are all known. In this case, the optimal codebook Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT can be estimated by computing a minimizer of the empirical clustering risk, just as in the standard k-means clustering:

C^=argminC𝒞kRn(C),where Rn(C)=1ni=1nμ(i)ΠC(μ(i))22.formulae-sequencesuperscript^𝐶𝐶subscript𝒞𝑘argminsubscript𝑅𝑛𝐶where subscript𝑅𝑛𝐶1𝑛superscriptsubscript𝑖1𝑛superscriptsubscriptdelimited-∥∥subscript𝜇𝑖subscriptΠ𝐶subscript𝜇𝑖22\begin{gathered}\widehat{C}^{*}=\underset{C\in\mathcal{C}_{k}}{\mathop{\mathrm% {argmin}}}R_{n}(C),\\ \text{where }\quad R_{n}(C)=\frac{1}{n}\sum_{i=1}^{n}\|\mu_{(i)}-\Pi_{C}(\mu_{% (i)})\|_{2}^{2}.\end{gathered}start_ROW start_CELL over^ start_ARG italic_C end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_UNDERACCENT italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_argmin end_ARG italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_C ) , end_CELL end_ROW start_ROW start_CELL where italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_C ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ italic_μ start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT - roman_Π start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . end_CELL end_ROW (7)

The common method used to find C^superscript^𝐶\widehat{C}^{*}over^ start_ARG italic_C end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is known as Lloyd’s algorithm (Lloyd 1982; Kanungo et al. 2002), yet there are other recent developments as well (Leskovec et al. 2020). A solution of such algorithms normally depends on the starting values. Some popular methods for choosing good starting values are discussed in, for example, Tseng & Wong (2005); Arthur & Vassilvitskii (2007).

The problem of evaluating how good C^superscript^𝐶\widehat{C}^{*}over^ start_ARG italic_C end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is, compared to the true Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, has been extensively studied. Pollard (1981) proved strong consistency of k-means clustering in the sense that C^a.s.C\widehat{C}^{*}\xrightarrow{a.s.}C^{*}over^ start_ARG italic_C end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_ARROW start_OVERACCENT italic_a . italic_s . end_OVERACCENT → end_ARROW italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as well as R(C^)R(C)a.s.0R(\widehat{C}^{*})-R(C^{*})\xrightarrow{a.s.}0italic_R ( over^ start_ARG italic_C end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_ARROW start_OVERACCENT italic_a . italic_s . end_OVERACCENT → end_ARROW 0. Borrowing techniques from statistical learning theory, Linder et al. (1994) and Biau et al. (2008) showed that when an input vector is almost surely bounded, the expected excess risk may decay at O(logn/n)𝑂𝑛𝑛O(\sqrt{\log n/n})italic_O ( square-root start_ARG roman_log italic_n / italic_n end_ARG ) and O(1/n)𝑂1𝑛O(1/\sqrt{n})italic_O ( 1 / square-root start_ARG italic_n end_ARG ) rates, respectively. More recently, it has been shown that faster O(logn/n)𝑂𝑛𝑛O(\log n/n)italic_O ( roman_log italic_n / italic_n ) or O(1/n)𝑂1𝑛O(1/n)italic_O ( 1 / italic_n ) rates can be attained under a margin condition on the source distribution (Levrard 2015, 2018); we shall go over this margin condition in detail shortly.

However, in our setting we cannot estimate Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT using C^superscript^𝐶\widehat{C}^{*}over^ start_ARG italic_C end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as in (7) since we do not know each μ(i)subscript𝜇𝑖\mu_{(i)}italic_μ start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT. Instead, we propose the following plug-in estimator

C^=argminC𝒞kR^n(C),where R^n(C)=1ni=1nμ^(i)ΠC(μ^(i))22,formulae-sequence^𝐶𝐶subscript𝒞𝑘argminsubscript^𝑅𝑛𝐶where subscript^𝑅𝑛𝐶1𝑛superscriptsubscript𝑖1𝑛superscriptsubscriptdelimited-∥∥subscript^𝜇𝑖subscriptΠ𝐶subscript^𝜇𝑖22\begin{gathered}\widehat{C}=\underset{C\in\mathcal{C}_{k}}{\mathop{\mathrm{% argmin}}}\widehat{R}_{n}(C),\\ \text{where }\quad\widehat{R}_{n}(C)=\frac{1}{n}\sum_{i=1}^{n}\|\widehat{\mu}_% {(i)}-\Pi_{C}(\widehat{\mu}_{(i)})\|_{2}^{2},\end{gathered}start_ROW start_CELL over^ start_ARG italic_C end_ARG = start_UNDERACCENT italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_argmin end_ARG over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_C ) , end_CELL end_ROW start_ROW start_CELL where over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_C ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT - roman_Π start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW (8)

where μ^^𝜇\widehat{\mu}over^ start_ARG italic_μ end_ARG is some initial estimator of the outcome regression functions. We will use sample splitting to avoid imposing empirical process conditions on the function class of μ𝜇\muitalic_μ (Kennedy 2016, 2022). For now, we suppose that μ^^𝜇\widehat{\mu}over^ start_ARG italic_μ end_ARG are constructed on a separate, independent sample; this will be discussed in more detail in the following section.

Due to the non-smoothness of the projection function ΠC()subscriptΠ𝐶\Pi_{C}(\cdot)roman_Π start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( ⋅ ), in general we would not expect the proposed plug-in estimator (8) to inherit the rate of convergence of μ^^𝜇\widehat{\mu}over^ start_ARG italic_μ end_ARG. To resolve this, we shall assume that the source distribution \mathbb{P}blackboard_P is concentrated around Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in a similar spirit to Levrard (2015, 2018).

In the sequel, the set of minimizers of the clustering risk will be denoted by 𝒞ksubscriptsuperscript𝒞𝑘\mathcal{C}^{*}_{k}caligraphic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, i.e., 𝒞k={C𝒞k:R(C)=minC𝒞kR(C)}subscriptsuperscript𝒞𝑘conditional-setsuperscript𝐶subscript𝒞𝑘𝑅superscript𝐶𝐶subscript𝒞𝑘𝑅𝐶\mathcal{C}^{*}_{k}=\{C^{*}\in\mathcal{C}_{k}:R(C^{*})=\underset{C\in\mathcal{% C}_{k}}{\min}R(C)\}caligraphic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT : italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = start_UNDERACCENT italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_min end_ARG italic_R ( italic_C ) }. For C𝒞ksuperscript𝐶subscriptsuperscript𝒞𝑘C^{*}\in\mathcal{C}^{*}_{k}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we define the Voronoi cell associated with a cluster cisubscriptsuperscript𝑐𝑖c^{*}_{i}italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the closed set by

Vi(C)={μμci2μcj2,ji},subscript𝑉𝑖superscript𝐶conditional-set𝜇formulae-sequencesubscriptnorm𝜇subscriptsuperscript𝑐𝑖2subscriptnorm𝜇subscriptsuperscript𝑐𝑗2for-all𝑗𝑖V_{i}(C^{*})=\left\{\mu\mid\|\mu-c^{*}_{i}\|_{2}\leq\|\mu-c^{*}_{j}\|_{2},% \forall j\neq i\right\},italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = { italic_μ ∣ ∥ italic_μ - italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ∥ italic_μ - italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∀ italic_j ≠ italic_i } ,

and its boundary by

Vi(C)={μμci2=μcj2,ji}.subscript𝑉𝑖superscript𝐶conditional-set𝜇formulae-sequencesubscriptnorm𝜇subscriptsuperscript𝑐𝑖2subscriptnorm𝜇subscriptsuperscript𝑐𝑗2for-all𝑗𝑖\partial V_{i}(C^{*})=\left\{\mu\mid\|\mu-c^{*}_{i}\|_{2}=\|\mu-c^{*}_{j}\|_{2% },\forall j\neq i\right\}.∂ italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = { italic_μ ∣ ∥ italic_μ - italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∥ italic_μ - italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∀ italic_j ≠ italic_i } .

And we write the entire boundaries induced from Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as

C=𝑖Vi(C).superscript𝐶𝑖subscript𝑉𝑖superscript𝐶\partial C^{*}=\underset{i}{\bigcup}\partial V_{i}(C^{*}).∂ italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = underitalic_i start_ARG ⋃ end_ARG ∂ italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) .

Next, for any Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and some t>0𝑡0t>0italic_t > 0, we define a set NC(t)subscript𝑁superscript𝐶𝑡N_{C^{*}}(t)italic_N start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_t ) by

NC(t)=𝑗{μVj(C)||μcj2minjiμci2|t}.subscript𝑁superscript𝐶𝑡𝑗conditional-set𝜇subscript𝑉𝑗superscript𝐶subscriptnorm𝜇subscriptsuperscript𝑐𝑗2𝑗𝑖subscriptnorm𝜇subscriptsuperscript𝑐𝑖2𝑡N_{C^{*}}(t)=\underset{j}{\bigcup}\left\{\mu\in V_{j}(C^{*})\Bigm{|}\left|\|% \mu-c^{*}_{j}\|_{2}-\underset{j\neq i}{\min}\|\mu-c^{*}_{i}\|_{2}\right|\leq t% \right\}.italic_N start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_t ) = underitalic_j start_ARG ⋃ end_ARG { italic_μ ∈ italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | | ∥ italic_μ - italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - start_UNDERACCENT italic_j ≠ italic_i end_UNDERACCENT start_ARG roman_min end_ARG ∥ italic_μ - italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | ≤ italic_t } .
Refer to caption
Figure 3: Illustration of the margin condition in Definition 3.1, where we control the probability mass in the shaded area within the red-dashed lines specified by κ𝜅\kappaitalic_κ.

NC(t)subscript𝑁superscript𝐶𝑡N_{C^{*}}(t)italic_N start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_t ) can be viewed as a neighborhood of Csuperscript𝐶{\partial C^{*}}∂ italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in which the distance from a point μ𝜇\muitalic_μ to two nearest cluster centers differs by as much as t𝑡titalic_t. For example, in 2-dimensional Euclidean space (i.e., when p=2𝑝2p=2italic_p = 2), NC(t)subscript𝑁superscript𝐶𝑡N_{C^{*}}(t)italic_N start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_t ) forms a region surrounded by hyperbolas that are symmetric around each segment in {Vi(C)=Vj(C)i,j{1,,k}}conditional-setsubscript𝑉𝑖superscript𝐶subscript𝑉𝑗superscript𝐶𝑖𝑗1𝑘\{\partial V_{i}(C^{*})=\partial V_{j}(C^{*})\mid i,j\in\{1,\ldots,k\}\}{ ∂ italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ∂ italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∣ italic_i , italic_j ∈ { 1 , … , italic_k } }, as shown in Figure 3. Now we introduce the following margin condition.

Definition 3.1 (Margin condition).

A distribution \mathbb{P}blackboard_P satisfies a margin condition with radius κ>0𝜅0\kappa>0italic_κ > 0 and rate α>0𝛼0\alpha>0italic_α > 0 if and only if for all 0tκ0𝑡𝜅0\leq t\leq\kappa0 ≤ italic_t ≤ italic_κ,

supC𝒞k(μNC(t))tα.less-than-or-similar-tosuperscript𝐶subscriptsuperscript𝒞𝑘supremum𝜇subscript𝑁superscript𝐶𝑡superscript𝑡𝛼\displaystyle\underset{C^{*}\in\mathcal{C}^{*}_{k}}{\sup}\mathbb{P}(\mu\in N_{% C^{*}}(t))\lesssim t^{\alpha}.start_UNDERACCENT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_sup end_ARG blackboard_P ( italic_μ ∈ italic_N start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_t ) ) ≲ italic_t start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT .

The above margin condition requires a local control of the probability around Csuperscript𝐶\partial C^{*}∂ italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for C𝒞ksuperscript𝐶superscriptsubscript𝒞𝑘C^{*}\in\mathcal{C}_{k}^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, hence implies that every optimal codebook forms a "natural classification". A larger α𝛼\alphaitalic_α indicates that \mathbb{P}blackboard_P is "more structured", facilitating the formation of such a natural classifier, whereas a smaller α𝛼\alphaitalic_α suggests that a natural classifier is less likely to exist; when α<1𝛼1\alpha<1italic_α < 1, the density is unbounded near Csuperscript𝐶\partial C^{*}∂ italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Levrard (2015, 2018) used the same condition with α=1𝛼1\alpha=1italic_α = 1 to achieve fast O(1/n)𝑂1𝑛O(1/n)italic_O ( 1 / italic_n ) rates of convergence for the excess risk, and provided some instances of the corresponding natural classifiers. This type of margin condition, where the weight of the neighborhood of the critical region is controlled, has been often adopted for a wide range of problems in causal inference involving estimation of non-smooth target parameters (e.g., van der Laan & Luedtke 2015; Luedtke & Van Der Laan 2016; Kennedy et al. 2018; Levis et al. 2023; Kim & Zubizarreta 2023). We introduce the following mild boundedness and consistency assumptions as well.

Assumption A1.

μa,μ^aB<subscriptnormsubscript𝜇𝑎subscriptnormsubscript^𝜇𝑎𝐵\|\mu_{a}\|_{\infty},\|\widehat{\mu}_{a}\|_{\infty}\leq B<\infty∥ italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT , ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_B < ∞ a.s.

Assumption A2.

maxaμ^aμa=o(1)subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎subscript𝑜1\max_{a}\left\|\widehat{\mu}_{a}-\mu_{a}\right\|_{\infty}=o_{\mathbb{P}}(1)roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ).

In the next theorem, we give upper bounds of the excess risk, showing that the proposed plug-in estimator (8) is risk consistent.

Theorem 3.1.

Suppose \mathbb{P}blackboard_P satisfies the margin condition with some κ>0𝜅0\kappa>0italic_κ > 0, α>0𝛼0\alpha>0italic_α > 0, and let

R1,n=maxaμ^aμa,1+maxaμ^aμaα+1+1κmaxa(μ^aμaμ^aμa,1).subscript𝑅1𝑛subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1subscript𝑎superscriptsubscriptnormsubscript^𝜇𝑎subscript𝜇𝑎𝛼11𝜅subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1R_{1,n}=\max_{a}\left\|\widehat{\mu}_{a}-\mu_{a}\right\|_{\mathbb{P},1}+\max_{% a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}^{\alpha+1}+\frac{1}{\kappa}\max_{a}% \left(\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}\left\|\widehat{\mu}_{a}-\mu_{a}% \right\|_{\mathbb{P},1}\right).italic_R start_POSTSUBSCRIPT 1 , italic_n end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α + 1 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT ) .

Then under Assumptions A1, A2, we have

{R(C^)R(C)}=O(1n+R1,n)andR(C^)R(C)=O(lognn+R1,n),formulae-sequence𝑅^𝐶𝑅superscript𝐶𝑂1𝑛subscript𝑅1𝑛and𝑅^𝐶𝑅superscript𝐶subscript𝑂𝑛𝑛subscript𝑅1𝑛\displaystyle\mathbb{P}\left\{R(\widehat{C})-R(C^{*})\right\}=O\left(\frac{1}{% \sqrt{n}}+R_{1,n}\right)\quad\text{and}\quad R(\widehat{C})-R(C^{*})=O_{% \mathbb{P}}\left(\sqrt{\frac{\log n}{n}}+R_{1,n}\right),blackboard_P { italic_R ( over^ start_ARG italic_C end_ARG ) - italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } = italic_O ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG + italic_R start_POSTSUBSCRIPT 1 , italic_n end_POSTSUBSCRIPT ) and italic_R ( over^ start_ARG italic_C end_ARG ) - italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( square-root start_ARG divide start_ARG roman_log italic_n end_ARG start_ARG italic_n end_ARG end_ARG + italic_R start_POSTSUBSCRIPT 1 , italic_n end_POSTSUBSCRIPT ) ,

whenever μ^^𝜇\widehat{\mu}over^ start_ARG italic_μ end_ARG is constructed from a separate independent sample.

A proof of the above theorem and all subsequent proofs can be found in Web Appendix B. The term μ^aμaα+1superscriptsubscriptnormsubscript^𝜇𝑎subscript𝜇𝑎𝛼1\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}^{\alpha+1}∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α + 1 end_POSTSUPERSCRIPT in R1,nsubscript𝑅1𝑛R_{1,n}italic_R start_POSTSUBSCRIPT 1 , italic_n end_POSTSUBSCRIPT commonly appears in the literature involving efficient estimation of non-smooth functionals based on the margin condition, including those listed above. The term 1κ(μ^aμaμ^aμa,1)1𝜅subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1\frac{1}{\kappa}\left(\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}\left\|\widehat{% \mu}_{a}-\mu_{a}\right\|_{\mathbb{P},1}\right)divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT ) is due to the fact that the margin condition in Definition 3.1 only requires a local control in the neighborhood NC(κ)subscript𝑁superscript𝐶𝜅N_{C^{*}}(\kappa)italic_N start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_κ ); if κ𝜅\kappa\rightarrow\inftyitalic_κ → ∞, this term vanishes. Theorem 3.1 essentially states that the extra price we pay for excess risk is the estimation error of the outcome regression functions.

The fact that C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG is risk consistent does not imply that C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG is actually close to the true codebook Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. To assure consistency of C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG, an additional condition is required as follows.

Assumption A3.

Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is unique up to relabeling of its coordinates: i.e., 𝒞ksuperscriptsubscript𝒞𝑘\mathcal{C}_{k}^{*}caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a singleton.

The uniqueness specified in Assumption A3 is also used in earlier work by Pollard (1981, 1982). The next theorem states that the proposed plug-in estimator is consistent.

Theorem 3.2.

Under Assumptions A1 - A3, C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG computed by the plug-in estimator (8) converges in probability to Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

The map CR(C)maps-to𝐶𝑅𝐶C\mapsto R(C)italic_C ↦ italic_R ( italic_C ) from kpsuperscript𝑘𝑝\mathbb{R}^{kp}blackboard_R start_POSTSUPERSCRIPT italic_k italic_p end_POSTSUPERSCRIPT into \mathbb{R}blackboard_R is differentiable if μ,2<subscriptnorm𝜇2\|\mu\|_{\mathbb{P},2}<\infty∥ italic_μ ∥ start_POSTSUBSCRIPT blackboard_P , 2 end_POSTSUBSCRIPT < ∞ (Pollard 1982). Based on Theorems 3.1 and 3.2, one may thus characterize the rate of convergence of C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG as stated in the next corollary.

Corollary 3.3.

Suppose that \mathbb{P}blackboard_P satisfies the margin condition with some κ>0𝜅0\kappa>0italic_κ > 0, α>0𝛼0\alpha>0italic_α > 0, and that Assumptions A1 - A3 hold. Also assume that μ^^𝜇\widehat{\mu}over^ start_ARG italic_μ end_ARG is constructed from a separate independent sample. Then C^C1=j=1kc^jcj1=O(lognn+R1,n).subscriptnorm^𝐶superscript𝐶1superscriptsubscript𝑗1𝑘subscriptnormsubscript^𝑐𝑗subscriptsuperscript𝑐𝑗1subscript𝑂𝑛𝑛subscript𝑅1𝑛\|\widehat{C}-C^{*}\|_{1}=\sum_{j=1}^{k}\|\widehat{c}_{j}-c^{*}_{j}\|_{1}=O_{% \mathbb{P}}\left(\sqrt{\frac{\log n}{n}}+R_{1,n}\right).∥ over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( square-root start_ARG divide start_ARG roman_log italic_n end_ARG start_ARG italic_n end_ARG end_ARG + italic_R start_POSTSUBSCRIPT 1 , italic_n end_POSTSUBSCRIPT ) .

The plug-in estimator is simple and intuitive. When an initial estimator is available or μ^^𝜇\widehat{\mu}over^ start_ARG italic_μ end_ARG is fitted in a separate independent sample, (8) is readily implementable using the standard, off-the-shelf algorithms including Lloyd’s algorithm. Otherwise, we can estimate the risk via cross-fitting, where we swap the samples, repeat the procedure, and average the results to regain full sample size efficiency. Then we compute the optimal codebook that minimizes the estimated risk. We shall address this in further detail shortly.

Note that the convergence rate in Theorem 3.1 essentially inherits from μ^^𝜇\widehat{\mu}over^ start_ARG italic_μ end_ARG. Hence, for either the risk or the codebook, rates of convergence would be expected to be slower than n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG with non-normal limiting distributions not centered at the true parameter, unless careful undersmoothing of particular estimators (e.g., splines) is used. Consequently, valid confidence intervals (even via bootstrap) may not be constructed. In the following section, we will develop an estimator that can be n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG consistent and asymptotically normal even if the nuisance functions are estimated flexibly at slower than n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG rates, in a wide variety of settings.

4 Semiparametric Estimator

In this section, we describe estimators that can achieve faster rates than the plug-in estimator from Section 3 based upon semiparametric efficiency theory.

4.1 Proposed estimator

For convenience, we introduce the following additional notations

πa(X)subscript𝜋𝑎𝑋\displaystyle\pi_{a}(X)italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_X ) =(A=aX),absent𝐴conditional𝑎𝑋\displaystyle=\mathbb{P}\left(A=a\mid X\right),= blackboard_P ( italic_A = italic_a ∣ italic_X ) , (9)
φ1,a(Z;ηa)subscript𝜑1𝑎𝑍subscript𝜂𝑎\displaystyle\varphi_{1,a}(Z;\eta_{a})italic_φ start_POSTSUBSCRIPT 1 , italic_a end_POSTSUBSCRIPT ( italic_Z ; italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) =𝟙(A=a)πa(X){YμA(X)}+μa(X),absent1𝐴𝑎subscript𝜋𝑎𝑋𝑌subscript𝜇𝐴𝑋subscript𝜇𝑎𝑋\displaystyle=\frac{\mathbbm{1}(A=a)}{\pi_{a}(X)}\left\{Y-\mu_{A}(X)\right\}+% \mu_{a}(X),= divide start_ARG blackboard_1 ( italic_A = italic_a ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_X ) end_ARG { italic_Y - italic_μ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_X ) } + italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_X ) ,
φ2,a(Z;ηa)subscript𝜑2𝑎𝑍subscript𝜂𝑎\displaystyle\varphi_{2,a}(Z;\eta_{a})italic_φ start_POSTSUBSCRIPT 2 , italic_a end_POSTSUBSCRIPT ( italic_Z ; italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) =2μa(X)𝟙(A=a)πa(X){YμA(X)}+μa2(X),absent2subscript𝜇𝑎𝑋1𝐴𝑎subscript𝜋𝑎𝑋𝑌subscript𝜇𝐴𝑋superscriptsubscript𝜇𝑎2𝑋\displaystyle=2\mu_{a}(X)\frac{\mathbbm{1}(A=a)}{\pi_{a}(X)}\left\{Y-\mu_{A}(X% )\right\}+\mu_{a}^{2}(X),= 2 italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_X ) divide start_ARG blackboard_1 ( italic_A = italic_a ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_X ) end_ARG { italic_Y - italic_μ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_X ) } + italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X ) ,

where ηa={πa,μa}subscript𝜂𝑎subscript𝜋𝑎subscript𝜇𝑎\eta_{a}=\{\pi_{a},\mu_{a}\}italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = { italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT } denotes a set of relevant nuisance functions afor-all𝑎\forall a∀ italic_a. πasubscript𝜋𝑎\pi_{a}italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is a conditional probability of receiving the treatment a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A; when p=2𝑝2p=2italic_p = 2, π1subscript𝜋1\pi_{1}italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT denotes the propensity score. Notice that φ1,asubscript𝜑1𝑎\varphi_{1,a}italic_φ start_POSTSUBSCRIPT 1 , italic_a end_POSTSUBSCRIPT and φ2,asubscript𝜑2𝑎\varphi_{2,a}italic_φ start_POSTSUBSCRIPT 2 , italic_a end_POSTSUBSCRIPT are the uncentered efficient influence function for the parameters 𝔼{μa(X)}𝔼subscript𝜇𝑎𝑋\mathbb{E}\left\{\mu_{a}(X)\right\}blackboard_E { italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_X ) } and 𝔼{μa2(X)}𝔼superscriptsubscript𝜇𝑎2𝑋\mathbb{E}\left\{\mu_{a}^{2}(X)\right\}blackboard_E { italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X ) }, respectively. The efficient influence function is important to construct optimal estimators since its variance equals the efficiency bound (in asymptotic minimax sense). Shortly, we shall see that exploiting the efficient influence function endows our estimators with desirable properties such as double robustness or general second-order bias, allowing us to relax nonparametric conditions on nuisance function estimation. We refer the interested reader to, for example, van der Vaart (2002); Tsiatis (2007); Kennedy (2016, 2022) for more details about influence functions and semiparametric efficiency theory.

Next, for any fixed C𝐶Citalic_C, we define

φC(Z;η)=a𝒜{φ2,a(Z;ηa)2φ1,a(Z;ηa)[ΠC(μ)]a+[ΠC(μ)]a2},subscript𝜑𝐶𝑍𝜂subscript𝑎𝒜subscript𝜑2𝑎𝑍subscript𝜂𝑎2subscript𝜑1𝑎𝑍subscript𝜂𝑎subscriptdelimited-[]subscriptΠ𝐶𝜇𝑎subscriptsuperscriptdelimited-[]subscriptΠ𝐶𝜇2𝑎\displaystyle\varphi_{C}(Z;\eta)=\sum_{a\in\mathcal{A}}\left\{\varphi_{2,a}(Z;% \eta_{a})-2\varphi_{1,a}(Z;\eta_{a})\left[\Pi_{C}\left(\mu\right)\right]_{a}+% \left[\Pi_{C}\left(\mu\right)\right]^{2}_{a}\right\},italic_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_Z ; italic_η ) = ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT { italic_φ start_POSTSUBSCRIPT 2 , italic_a end_POSTSUBSCRIPT ( italic_Z ; italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) - 2 italic_φ start_POSTSUBSCRIPT 1 , italic_a end_POSTSUBSCRIPT ( italic_Z ; italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) [ roman_Π start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) ] start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + [ roman_Π start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT } , (10)

where we let η={ηa}a𝒜𝜂subscriptsubscript𝜂𝑎𝑎𝒜\eta=\left\{\eta_{a}\right\}_{a\in\mathcal{A}}italic_η = { italic_η start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT denote a set of all nuisance functions collectively, and [ΠC(μ)]asubscriptdelimited-[]subscriptΠ𝐶𝜇𝑎\left[\Pi_{C}\left(\mu\right)\right]_{a}[ roman_Π start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) ] start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT be the a𝑎aitalic_a-th element of the projection ΠC(μ)subscriptΠ𝐶𝜇\Pi_{C}\left(\mu\right)roman_Π start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ). φC(Z;η)subscript𝜑superscript𝐶𝑍𝜂\varphi_{C^{*}}(Z;\eta)italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; italic_η ) is the uncentered efficient influence function for R(C)𝑅superscript𝐶R(C^{*})italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) whenever \mathbb{P}blackboard_P satisfies the margin condition, as formally stated below.

Lemma 4.1.

Suppose that Assumptions A1, A2 hold, and that \mathbb{P}blackboard_P satisfies the margin condition with some κ>0𝜅0\kappa>0italic_κ > 0 and α>0𝛼0\alpha>0italic_α > 0. If, for every optimal codebook C𝒞ksuperscript𝐶superscriptsubscript𝒞𝑘C^{*}\in\mathcal{C}_{k}^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we let ϕC(z;)=φC(z;)φC(z;)𝑑subscriptitalic-ϕsuperscript𝐶𝑧subscript𝜑superscript𝐶𝑧subscript𝜑superscript𝐶𝑧differential-d\phi_{C^{*}}(z;\mathbb{P})=\varphi_{C^{*}}(z;\mathbb{P})-\int\varphi_{C^{*}}(z% ;\mathbb{P})d\mathbb{P}italic_ϕ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_z ; blackboard_P ) = italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_z ; blackboard_P ) - ∫ italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_z ; blackboard_P ) italic_d blackboard_P, then ϕCsubscriptitalic-ϕsuperscript𝐶\phi_{C^{*}}italic_ϕ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is the efficient influence function for R(C)𝑅superscript𝐶R(C^{*})italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ).

We now describe how to construct the proposed estimator for R(C)𝑅𝐶R(C)italic_R ( italic_C ). Following (Robins et al. 2008; Zheng & Van Der Laan 2010; Chernozhukov et al. 2017; Newey & Robins 2018; Kennedy 2023) and many others, we use sample splitting (or cross-fitting) to allow for arbitrarily complex nuisance estimators η^^𝜂\widehat{\eta}over^ start_ARG italic_η end_ARG. Specifically with fixed K𝐾Kitalic_K, we split the data into K𝐾Kitalic_K disjoint groups, each with size n/K𝑛𝐾n/Kitalic_n / italic_K approximately, by drawing variables (B1,,Bn)subscript𝐵1subscript𝐵𝑛(B_{1},\ldots,B_{n})( italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) independent of the data; Bi=bsubscript𝐵𝑖𝑏B_{i}=bitalic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_b indicates that subject i𝑖iitalic_i was split into group b{1,,K}𝑏1𝐾b\in\{1,\ldots,K\}italic_b ∈ { 1 , … , italic_K }. This could be done, for example, by drawing each Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT uniformly from {1,,K}1𝐾\{1,\ldots,K\}{ 1 , … , italic_K }. We propose our estimator for R(C)𝑅𝐶R(C)italic_R ( italic_C ) as

R^(C)^𝑅𝐶\displaystyle\widehat{R}(C)over^ start_ARG italic_R end_ARG ( italic_C ) =b=1K{1ni=1n𝟙(Bi=b)}nb{φC(Z;η^b)}absentsuperscriptsubscript𝑏1𝐾1𝑛superscriptsubscript𝑖1𝑛1subscript𝐵𝑖𝑏superscriptsubscript𝑛𝑏subscript𝜑𝐶𝑍subscript^𝜂𝑏\displaystyle=\sum_{b=1}^{K}\left\{\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}(B_{i}=% b)\right\}\mathbb{P}_{n}^{b}\left\{\varphi_{C}(Z;\widehat{\eta}_{-b})\right\}= ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_1 ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_b ) } blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT { italic_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT - italic_b end_POSTSUBSCRIPT ) }
n{φC(Z;η^K)},absentsubscript𝑛subscript𝜑𝐶𝑍subscript^𝜂𝐾\displaystyle\equiv\mathbb{P}_{n}\left\{\varphi_{C}(Z;\widehat{\eta}_{-K})% \right\},≡ blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT { italic_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT - italic_K end_POSTSUBSCRIPT ) } , (11)

where we let nbsuperscriptsubscript𝑛𝑏\mathbb{P}_{n}^{b}blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT denote empirical averages only over the set of units {i:Bi=b}conditional-set𝑖subscript𝐵𝑖𝑏\{i:B_{i}=b\}{ italic_i : italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_b } in group b𝑏bitalic_b and let η^bsubscript^𝜂𝑏\widehat{\eta}_{-b}over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT - italic_b end_POSTSUBSCRIPT denote the nuisance estimator constructed only using those units {i:Bib}conditional-set𝑖subscript𝐵𝑖𝑏\{i:B_{i}\neq b\}{ italic_i : italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_b }. In the following section, we will show that the above estimator R^(C)^𝑅superscript𝐶\widehat{R}(C^{*})over^ start_ARG italic_R end_ARG ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is asymptotically efficient under weak conditions for any C𝒞ksuperscript𝐶subscriptsuperscript𝒞𝑘C^{*}\in\mathcal{C}^{*}_{k}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

Then we propose estimating the optimal cluster codebook Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as a minimizer of R^(C)^𝑅𝐶\widehat{R}(C)over^ start_ARG italic_R end_ARG ( italic_C ):

C^=argminC𝒞kR^(C).^𝐶𝐶subscript𝒞𝑘argmin^𝑅𝐶\displaystyle\widehat{C}=\underset{C\in\mathcal{C}_{k}}{\mathop{\mathrm{argmin% }}}\widehat{R}(C).over^ start_ARG italic_C end_ARG = start_UNDERACCENT italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_argmin end_ARG over^ start_ARG italic_R end_ARG ( italic_C ) . (12)

After finding the function R^()^𝑅\widehat{R}(\cdot)over^ start_ARG italic_R end_ARG ( ⋅ ), C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG can be computed on a full sample. Note that the cross-fitting procedure described above is equally applicable to the plug-in estimator (8). (12) can be computed using first-order (e.g., gradient descent) or second-order (e.g., Newton-Raphson) methods based on the derivative formulas (13) and (14) specified in the following section.

4.2 Asymptotic Properties

In this subsection, we analyze asymptotic properties of the proposed estimator. For notational simplicity, we define the remainder term that appears in our results as follows:

R2,nsubscript𝑅2𝑛\displaystyle R_{2,n}italic_R start_POSTSUBSCRIPT 2 , italic_n end_POSTSUBSCRIPT =maxa{μ^aμa,2(μ^aμa,2+π^aπa,2)}+maxaμ^aμaα+1absentsubscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎2subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎2subscriptnormsubscript^𝜋𝑎subscript𝜋𝑎2subscript𝑎superscriptsubscriptnormsubscript^𝜇𝑎subscript𝜇𝑎𝛼1\displaystyle=\max_{a}\left\{\|\widehat{\mu}_{a}-{\mu}_{a}\|_{\mathbb{P},2}% \left(\|\widehat{\mu}_{a}-{\mu}_{a}\|_{\mathbb{P},2}+\|\widehat{\pi}_{a}-{\pi}% _{a}\|_{\mathbb{P},2}\right)\right\}+\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{% \infty}^{\alpha+1}= roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT { ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 2 end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 2 end_POSTSUBSCRIPT + ∥ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 2 end_POSTSUBSCRIPT ) } + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α + 1 end_POSTSUPERSCRIPT
+1κmaxa(μ^aμaμ^aμa,1).1𝜅subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1\displaystyle\quad+\frac{1}{\kappa}\max_{a}\left(\|\widehat{\mu}_{a}-\mu_{a}\|% _{\infty}\left\|\widehat{\mu}_{a}-\mu_{a}\right\|_{\mathbb{P},1}\right).+ divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT ) .

Note that terms in R2,nsubscript𝑅2𝑛R_{2,n}italic_R start_POSTSUBSCRIPT 2 , italic_n end_POSTSUBSCRIPT are all second-order, as opposed to R1,nsubscript𝑅1𝑛R_{1,n}italic_R start_POSTSUBSCRIPT 1 , italic_n end_POSTSUBSCRIPT, the analogous bias term for the plug-in estimator in the previous section. We introduce the following additional assumptions pertaining to our nuisance estimation.

Assumption A4.

{ϵπ^a(X)1ϵ}=1italic-ϵsubscript^𝜋𝑎𝑋1italic-ϵ1\mathbb{P}\left\{\epsilon\leq\widehat{\pi}_{a}(X)\leq 1-\epsilon\right\}=1blackboard_P { italic_ϵ ≤ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_X ) ≤ 1 - italic_ϵ } = 1 for some ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0.

Assumption A5.

max𝑎{π^aπa,2+μ^aμa}=o(1)𝑎subscriptnormsubscript^𝜋𝑎subscript𝜋𝑎2subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎subscript𝑜1\underset{a}{\max}\left\{\|\widehat{\pi}_{a}-{\pi}_{a}\|_{\mathbb{P},2}+\|% \widehat{\mu}_{a}-{\mu}_{a}\|_{\infty}\right\}=o_{\mathbb{P}}(1)underitalic_a start_ARG roman_max end_ARG { ∥ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 2 end_POSTSUBSCRIPT + ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT } = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ).

Assumption A6.

R2,n=o(n1/2)subscript𝑅2𝑛subscript𝑜superscript𝑛12R_{2,n}=o_{\mathbb{P}}(n^{-1/2})italic_R start_POSTSUBSCRIPT 2 , italic_n end_POSTSUBSCRIPT = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ).

Assumption A5 is a mild consistency assumption, with no requirement on rates of convergence. Assumption A6 may hold, for example, under standard n1/4superscript𝑛14n^{-1/4}italic_n start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT-type rate conditions on η^^𝜂\widehat{\eta}over^ start_ARG italic_η end_ARG which can be attained under smoothness, sparsity, or other structural constraints (e.g., Kennedy 2016).

Lemma 4.1 allows us to specify conditions under which R^(C)^𝑅superscript𝐶\widehat{R}(C^{*})over^ start_ARG italic_R end_ARG ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is an asymptotically normal and efficient estimator for R(C)𝑅superscript𝐶R(C^{*})italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), for any Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT satisfying the margin condition, as stated in the following lemma.

Lemma 4.2.

Suppose that the margin condition in Definition 3.1 is satisfied with some α>0𝛼0\alpha>0italic_α > 0, κ>0𝜅0\kappa>0italic_κ > 0, and that Assumptions A1, A4, A5, and A6 hold. Then for every optimal codebook C𝒞ksuperscript𝐶superscriptsubscript𝒞𝑘C^{*}\in\mathcal{C}_{k}^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT,

n{R^(C)R(C)}N(0,var(φC)),leads-to𝑛^𝑅superscript𝐶𝑅superscript𝐶𝑁0varsubscript𝜑superscript𝐶\displaystyle\sqrt{n}\left\{\widehat{R}(C^{*})-R(C^{*})\right\}\leadsto N\left% (0,\text{var}\left(\varphi_{C^{*}}\right)\right),square-root start_ARG italic_n end_ARG { over^ start_ARG italic_R end_ARG ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } ↝ italic_N ( 0 , var ( italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ) ,

where φCsubscript𝜑𝐶\varphi_{C}italic_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is specified in (10).

Under the similar conditions as Theorem 3.2, we can show the proposed codebook estimator (12) is consistent, as stated in the following corollary.

Corollary 4.3.

If Assumptions A1, A3, A4, and A5 hold, then C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG computed by the semiparametric estimator (12) converges in probability to Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

We now focus on the asymptotic properties of C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG, particularly on identifying conditions that assure n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG consistency and asymptotic normality in large nonparametric models. In the next theorem, our first main result of this section, we compute an asymptotic bound for the excess risk, as well as the rate of convergence for C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG.

Theorem 4.4.

Suppose that \mathbb{P}blackboard_P satisfies the margin condition with some κ>0𝜅0\kappa>0italic_κ > 0, α>0𝛼0\alpha>0italic_α > 0, and that Assumptions A1, A3, A4, and A5 hold. Also, assume that pj(η,C)>0subscript𝑝𝑗𝜂superscript𝐶0p_{j}(\eta,C^{*})>0italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_η , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) > 0, jfor-all𝑗\forall j∀ italic_j. Then, if 𝔼(Y2X)<\left\|\mathbb{E}(Y^{2}\mid X)\right\|_{\infty}<\infty∥ blackboard_E ( italic_Y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ italic_X ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT < ∞, we have

C^C1=O(1n+R2,n)andR(C^)R(C)subscriptnorm^𝐶superscript𝐶1subscript𝑂1𝑛subscript𝑅2𝑛and𝑅^𝐶𝑅superscript𝐶\displaystyle\|\widehat{C}-C^{*}\|_{1}=O_{\mathbb{P}}\left(\frac{1}{\sqrt{n}}+% R_{2,n}\right)\quad\text{and}\quad R(\widehat{C})-R(C^{*})∥ over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG + italic_R start_POSTSUBSCRIPT 2 , italic_n end_POSTSUBSCRIPT ) and italic_R ( over^ start_ARG italic_C end_ARG ) - italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) =o(1n+R2,n).absentsubscript𝑜1𝑛subscript𝑅2𝑛\displaystyle=o_{\mathbb{P}}\left(\frac{1}{\sqrt{n}}+R_{2,n}\right).= italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG + italic_R start_POSTSUBSCRIPT 2 , italic_n end_POSTSUBSCRIPT ) .

Note that the condition pj(η,C)>0subscript𝑝𝑗𝜂superscript𝐶0p_{j}(\eta,C^{*})>0italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_η , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) > 0 is equivalent to {Vi(C)}>0subscript𝑉𝑖superscript𝐶0\mathbb{P}\{V_{i}(C^{*})\}>0blackboard_P { italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } > 0, i.e., there are no vacant Voronoi cells, and guarantees that the derivative matrix M(C,η)𝑀superscript𝐶𝜂M(C^{*},\eta)italic_M ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_η ) is nonsingular. Theorem 4.4 shows that the proposed codebook estimator C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG and the associated excess risk may attain substantially faster rates than its nuisance estimators η^^𝜂\widehat{\eta}over^ start_ARG italic_η end_ARG. Specifically if R2,n=O(n1/2)subscript𝑅2𝑛subscript𝑂superscript𝑛12R_{2,n}=O_{\mathbb{P}}\left(n^{-1/2}\right)italic_R start_POSTSUBSCRIPT 2 , italic_n end_POSTSUBSCRIPT = italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) (weaker assumption than A6), we can attain n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG rates for C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG and faster-than-n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG rates for excess risk by virtue of the fact that R2,nsubscript𝑅2𝑛R_{2,n}italic_R start_POSTSUBSCRIPT 2 , italic_n end_POSTSUBSCRIPT involves products of nuisance estimation errors.

Asymptotic normality of estimated codebooks in the standard k-means clustering was first studied by Pollard (1982). However, extending the classic result of Pollard (1982) to causal clustering poses some difficulties due to the complexity of our new risk function 𝔼(φC)𝔼subscript𝜑𝐶\mathbb{E}(\varphi_{C})blackboard_E ( italic_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) which relies on multiple nuisance components in an infinite-dimensional function space. To achieve asymptotic normality for our estimated codebook C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG, we shall adopt the logic employed in Kennedy et al. (2023).

Let φ1(z;η)=[φ1,1(z;η1),,φ1,p(z;ηp)]subscriptφ1𝑧𝜂superscriptsubscript𝜑11𝑧subscript𝜂1subscript𝜑1𝑝𝑧subscript𝜂𝑝top\upvarphi_{1}(z;\eta)=[\varphi_{1,1}(z;\eta_{1}),\ldots,\varphi_{1,p}(z;\eta_{% p})]^{\top}roman_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z ; italic_η ) = [ italic_φ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ( italic_z ; italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_φ start_POSTSUBSCRIPT 1 , italic_p end_POSTSUBSCRIPT ( italic_z ; italic_η start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT where each φ1,asubscript𝜑1𝑎\varphi_{1,a}italic_φ start_POSTSUBSCRIPT 1 , italic_a end_POSTSUBSCRIPT is defined in (9). With a slight abuse of notation, as was done in Bottou & Bengio (1994) we compute the derivative of φCsubscript𝜑𝐶\varphi_{C}italic_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT at any C𝒞ksuperscript𝐶subscript𝒞𝑘C^{\prime}\in\mathcal{C}_{k}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for some fixed η¯¯𝜂\bar{\eta}over¯ start_ARG italic_η end_ARG by

CφC(Z;η¯)|C=Cevaluated-at𝐶subscript𝜑𝐶𝑍¯𝜂𝐶superscript𝐶\displaystyle\frac{\partial}{\partial C}\varphi_{C}(Z;\bar{\eta})\Big{|}_{C=C^% {\prime}}divide start_ARG ∂ end_ARG start_ARG ∂ italic_C end_ARG italic_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_Z ; over¯ start_ARG italic_η end_ARG ) | start_POSTSUBSCRIPT italic_C = italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT φC(Z;η¯)absentsubscriptφsuperscript𝐶𝑍¯𝜂\displaystyle\equiv\upvarphi_{C^{\prime}}(Z;\bar{\eta})≡ roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; over¯ start_ARG italic_η end_ARG ) (13)
=2[(c1φ1(Z;η¯))𝟙{1=d(μ¯,C)},,(ckφ1(Z;η¯))𝟙{k=d(μ¯,C)}]absent2superscriptsubscriptsuperscript𝑐1subscriptφ1𝑍¯𝜂11𝑑¯𝜇superscript𝐶subscriptsuperscript𝑐𝑘subscriptφ1𝑍¯𝜂1𝑘𝑑¯𝜇superscript𝐶top\displaystyle=2\left[(c^{\prime}_{1}-\upvarphi_{1}(Z;\bar{\eta}))\mathbbm{1}\{% 1=d(\bar{\mu},C^{\prime})\},\ldots,(c^{\prime}_{k}-\upvarphi_{1}(Z;\bar{\eta})% )\mathbbm{1}\{k=d(\bar{\mu},C^{\prime})\}\right]^{\top}= 2 [ ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - roman_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ; over¯ start_ARG italic_η end_ARG ) ) blackboard_1 { 1 = italic_d ( over¯ start_ARG italic_μ end_ARG , italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) } , … , ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - roman_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ; over¯ start_ARG italic_η end_ARG ) ) blackboard_1 { italic_k = italic_d ( over¯ start_ARG italic_μ end_ARG , italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) } ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT

where we let d(μ¯,C)=argminj{1,,k}cjμ¯22𝑑¯𝜇superscript𝐶𝑗1𝑘argminsuperscriptsubscriptnormsubscriptsuperscript𝑐𝑗¯𝜇22d(\bar{\mu},C^{\prime})=\underset{j\in\{1,\ldots,k\}}{\mathop{\mathrm{argmin}}% }\|c^{\prime}_{j}-\bar{\mu}\|_{2}^{2}italic_d ( over¯ start_ARG italic_μ end_ARG , italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = start_UNDERACCENT italic_j ∈ { 1 , … , italic_k } end_UNDERACCENT start_ARG roman_argmin end_ARG ∥ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - over¯ start_ARG italic_μ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, i.e., the subscript for the nearest center to a given μ¯¯𝜇\bar{\mu}over¯ start_ARG italic_μ end_ARG. Similarly, one may compute the derivative matrix of {φC(Z;η¯)}subscriptφ𝐶𝑍¯𝜂\mathbb{P}\left\{\upvarphi_{C}(Z;\bar{\eta})\right\}blackboard_P { roman_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_Z ; over¯ start_ARG italic_η end_ARG ) } at Csuperscript𝐶C^{\prime}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT:

C{φC(Z;η¯)}|C=CM(C,η¯)=2diag(𝟏(p)p1(η¯,C),,𝟏(p)pk(η¯,C)),evaluated-at𝐶subscriptφ𝐶𝑍¯𝜂𝐶superscript𝐶𝑀superscript𝐶¯𝜂2diagsubscript1𝑝subscript𝑝1¯𝜂superscript𝐶subscript1𝑝subscript𝑝𝑘¯𝜂superscript𝐶\displaystyle\frac{\partial}{\partial C}\mathbb{P}\left\{\upvarphi_{C}(Z;\bar{% \eta})\right\}\Big{|}_{C=C^{\prime}}\equiv M(C^{\prime},\bar{\eta})=2\text{% diag}\left(\bm{1}_{(p)}p_{1}(\bar{\eta},C^{\prime}),\ldots,\bm{1}_{(p)}p_{k}(% \bar{\eta},C^{\prime})\right),divide start_ARG ∂ end_ARG start_ARG ∂ italic_C end_ARG blackboard_P { roman_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_Z ; over¯ start_ARG italic_η end_ARG ) } | start_POSTSUBSCRIPT italic_C = italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≡ italic_M ( italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over¯ start_ARG italic_η end_ARG ) = 2 diag ( bold_1 start_POSTSUBSCRIPT ( italic_p ) end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( over¯ start_ARG italic_η end_ARG , italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , … , bold_1 start_POSTSUBSCRIPT ( italic_p ) end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over¯ start_ARG italic_η end_ARG , italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) , (14)

where pj(η¯,C)={j=d(μ¯,C)}subscript𝑝𝑗¯𝜂superscript𝐶𝑗𝑑¯𝜇superscript𝐶p_{j}(\bar{\eta},C^{\prime})=\mathbb{P}\{j=d(\bar{\mu},C^{\prime})\}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over¯ start_ARG italic_η end_ARG , italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = blackboard_P { italic_j = italic_d ( over¯ start_ARG italic_μ end_ARG , italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) } and 𝟏(p)subscript1𝑝\bm{1}_{(p)}bold_1 start_POSTSUBSCRIPT ( italic_p ) end_POSTSUBSCRIPT is a p𝑝pitalic_p-dimensional vector of all ones.

Notice that the solutions of the minimization problem (12) can be equivalently expressed by solutions to the following empirical moment condition (up to o(1/n)subscript𝑜1𝑛o_{\mathbb{P}}(1/\sqrt{n})italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 / square-root start_ARG italic_n end_ARG ) error):

n{φC^(Z;η^K)}=o(1n).subscript𝑛subscriptφ^𝐶𝑍subscript^𝜂𝐾subscript𝑜1𝑛\displaystyle\mathbb{P}_{n}\left\{\upvarphi_{\widehat{C}}(Z;\widehat{\eta}_{-K% })\right\}=o_{\mathbb{P}}\left(\frac{1}{\sqrt{n}}\right).blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT { roman_φ start_POSTSUBSCRIPT over^ start_ARG italic_C end_ARG end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT - italic_K end_POSTSUBSCRIPT ) } = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ) .

In the next theorem, we give the second main result of this section, which presents conditions allowing for n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG consistency and asymptotic normality of C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG.

Theorem 4.5.

Suppose that \mathbb{P}blackboard_P satisfies the margin condition with some κ>0𝜅0\kappa>0italic_κ > 0 and α=𝛼\alpha=\inftyitalic_α = ∞, and that Assumptions A1, A3, A4, and A5 hold. Also, assume that pj(η,C)>0subscript𝑝𝑗𝜂superscript𝐶0p_{j}(\eta,C^{*})>0italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_η , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) > 0, jfor-all𝑗\forall j∀ italic_j. Then, if 𝔼(Y2X)<\left\|\mathbb{E}(Y^{2}\mid X)\right\|_{\infty}<\infty∥ blackboard_E ( italic_Y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ italic_X ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT < ∞,

C^C=M(C,η)1(n){φC(Z;η)}+O(R2,n+o(1n)).^𝐶superscript𝐶𝑀superscriptsuperscript𝐶𝜂1subscript𝑛subscriptφsuperscript𝐶𝑍𝜂subscript𝑂subscript𝑅2𝑛subscript𝑜1𝑛\displaystyle\widehat{C}-C^{*}=-M(C^{*},\eta)^{-1}(\mathbb{P}_{n}-\mathbb{P})% \left\{\upvarphi_{C^{*}}(Z;\eta)\right\}+O_{\mathbb{P}}\left(R_{2,n}+o_{% \mathbb{P}}\left(\frac{1}{\sqrt{n}}\right)\right).over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = - italic_M ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_η ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - blackboard_P ) { roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; italic_η ) } + italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 2 , italic_n end_POSTSUBSCRIPT + italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ) ) .

Theorem 4.5 requires a stronger version of the margin condition where NC(κ)subscript𝑁superscript𝐶𝜅N_{C^{*}}(\kappa)italic_N start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_κ ) is completely empty. Note that we still do not restrict the radius κ𝜅\kappaitalic_κ. Importantly, Theorem 4.5 implies that C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG can be not only n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG consistent but also asymptotically normal under the rate condition in Assumption A6, which may hold even when the nuisance estimators are generic and flexibly fit. In this case, asymptotically valid confidence intervals can be readily constructed via bootstrap methods.

5 Illustration

5.1 Simulation Study

In order to assess the performance of the proposed estimators, we conduct a small simulation study. We consider a simplified scenario where a generated codebook forms a natural classifier satisfying the margin condition. As briefly shown in Figure LABEL:fig:experiments-(a), we demonstrate that, as anticipated by our theoretical results, the proposed semiparametric estimator from Section 4 generally has smaller error than the plug-in estimator from Section 3, and achieves parametric n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG rates even with n1/4superscript𝑛14n^{1/4}italic_n start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT rates on nuisance estimation. Details and full results are included in Web Appendix A.

5.2 Case Study

Here we apply our method to the real-world dataset that was collected to study the relative effects of three treatment programs for adolescent substance abuse, i.e., community (A=1𝐴1A=1italic_A = 1), MET&CBT-5 (A=2𝐴2A=2italic_A = 2), SCY (A=3𝐴3A=3italic_A = 3) (McCaffrey et al. 2013; Burgette et al. 2017). For illustration purpose, we use a subset of publicly available data via the twang R package. The dataset consists of 600600600600 samples, 200200200200 youths for each treatment, and 5555 covariates including age, ethnicity, and criminal history. Our outcome is the program effectiveness score, where higher scores indicate reduced frequency of substance use.

We use the proposed semiparametric estimator with K=2𝐾2K=2italic_K = 2 splits, using the gradient descent algorithm for optimization. For nonparametric estimation we used the cross-validation-based Super Learner ensemble (Van der Laan et al. 2007) to combine regression splines, support vector machine regression, and random forests. The Elbow method indicates that k=4𝑘4k=4italic_k = 4 can be a reasonable choice. Figure LABEL:fig:experiments-(b) displays the four clusters in the counterfactual mean vector space, revealing a substantial degree of heterogeneity. In Figure LABEL:fig:experiments-(c), we also present the density plots for the pairwise CATE estimates τ^2,1subscript^𝜏21\widehat{\tau}_{2,1}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT and τ^3,1subscript^𝜏31\widehat{\tau}_{3,1}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT 3 , 1 end_POSTSUBSCRIPT, across different clusters. This helps to understand how units in each cluster respond differently to a specific treatment. For instance, for Cluster 2, the traditional community program is more effective than the MET&CBT-5, while there is no significant difference between the community and SCY programs. On the other hand, for units in Cluster 4, the MET&CBT-5 is moderately more successful than the community program, whereas the SCY is significantly less effective.

6 Discussion

In this paper, we propose a new framework for analyzing treatment effect heterogeneity by leveraging tools in cluster analysis. We provide flexible nonparametric estimators for a wide class of models. The proposed methods allow for the discovery of subgroup structure in studies with multiple treatments or outcomes. Our framework is extensible to clustering with generic pseudo-outcomes, such as partially observed outcomes or unknown functionals.

Our findings open up a plethora of intriguing opportunities for future work. In an upcoming companion paper, we consider kernel-based undersmoothing approaches for causal k-means clustering, which do not require the margin condition. Much more work is required to expand causal clustering to other widely-used clustering algorithms, such as density-based clustering and hierarchical clustering. Different algorithms rely on different assumptions about the data, necessitating distinct analysis. Connecting to prescriptive methods, such as optimal treatment regimes, and other settings involving, for example, time-varying treatments, instrumental variables, or mediation would be also promising directions for future research.

References

  • (1)
  • Arthur & Vassilvitskii (2007) Arthur, D. & Vassilvitskii, S. (2007), k-means++: The advantages of careful seeding, in ‘Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms’, Society for Industrial and Applied Mathematics, pp. 1027–1035.
  • Athey & Imbens (2016) Athey, S. & Imbens, G. (2016), ‘Recursive partitioning for heterogeneous causal effects’, Proceedings of the National Academy of Sciences 113(27), 7353–7360.
  • Biau et al. (2008) Biau, G., Devroye, L. & Lugosi, G. (2008), ‘On the performance of clustering in hilbert spaces’, IEEE Transactions on Information Theory 54(2), 781–790.
  • Bottou & Bengio (1994) Bottou, L. & Bengio, Y. (1994), ‘Convergence properties of the k-means algorithms’, Advances in neural information processing systems 7.
  • Burgette et al. (2017) Burgette, L., Griffin, B. A. & McCaffrey, D. (2017), ‘Propensity scores for multiple treatments: A tutorial for the mnps function in the twang package’, R package. Rand Corporation .
  • Chernozhukov et al. (2017) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C. & Newey, W. (2017), ‘Double/debiased/neyman machine learning of treatment effects’, American Economic Review 107(5), 261–65.
  • Chernozhukov et al. (2016) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C. & Newey, W. K. (2016), Double machine learning for treatment and causal parameters, Technical report, cemmap working paper.
  • Chernozhukov et al. (2018) Chernozhukov, V., Demirer, M., Duflo, E. & Fernandez-Val, I. (2018), Generic machine learning inference on heterogeneous treatment effects in randomized experiments, with an application to immunization in india, Technical report, National Bureau of Economic Research.
  • Chernozhukov & Hansen (2005) Chernozhukov, V. & Hansen, C. (2005), ‘An iv model of quantile treatment effects’, Econometrica 73(1), 245–261.
  • Devroye et al. (2013) Devroye, L., Györfi, L. & Lugosi, G. (2013), A probabilistic theory of pattern recognition, Vol. 31, Springer Science & Business Media.
  • Ding & He (2004) Ding, C. & He, X. (2004), K-means clustering via principal component analysis, in ‘Proceedings of the twenty-first international conference on Machine learning’, ACM, p. 29.
  • Foster et al. (2011) Foster, J. C., Taylor, J. M. & Ruberg, S. J. (2011), ‘Subgroup identification from randomized clinical trial data’, Statistics in medicine 30(24), 2867–2880.
  • Giné & Nickl (2021) Giné, E. & Nickl, R. (2021), Mathematical foundations of infinite-dimensional statistical models, Cambridge university press.
  • Graf & Luschgy (2007) Graf, S. & Luschgy, H. (2007), Foundations of quantization for probability distributions, Springer.
  • Grimmer et al. (2017) Grimmer, J., Messing, S. & Westwood, S. J. (2017), ‘Estimating heterogeneous treatment effects and the effects of heterogeneous treatments with ensemble methods’, Political Analysis 25(4), 413–434.
  • Haviland et al. (2011) Haviland, A. M., Jones, B. L. & Nagin, D. S. (2011), ‘Group-based trajectory modeling extended to account for nonrandom participant attrition’, Sociological Methods & Research 40(2), 367–390.
  • Hayden (2009) Hayden, E. C. (2009), ‘Personalized cancer therapy gets closer’.
  • Imai et al. (2013) Imai, K., Ratkovic, M. et al. (2013), ‘Estimating treatment effect heterogeneity in randomized program evaluation’, The Annals of Applied Statistics 7(1), 443–470.
  • Imbens & Rubin (2015) Imbens, G. W. & Rubin, D. B. (2015), Causal inference in statistics, social, and biomedical sciences, Cambridge University Press.
  • Jain (2010) Jain, A. K. (2010), ‘Data clustering: 50 years beyond k-means’, Pattern recognition letters 31(8), 651–666.
  • Kanungo et al. (2002) Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R. & Wu, A. Y. (2002), ‘An efficient k-means clustering algorithm: Analysis and implementation’, IEEE Transactions on Pattern Analysis & Machine Intelligence (7), 881–892.
  • Kennedy et al. (2023) Kennedy, E., Balakrishnan, S. & Wasserman, L. (2023), ‘Semiparametric counterfactual density estimation’, Biometrika p. asad017.
  • Kennedy (2016) Kennedy, E. H. (2016), Semiparametric theory and empirical processes in causal inference, in ‘Statistical causal inferences and their applications in public health research’, Springer, pp. 141–167.
  • Kennedy (2022) Kennedy, E. H. (2022), ‘Semiparametric doubly robust targeted double machine learning: a review’, arXiv preprint arXiv:2203.06469 .
  • Kennedy (2023) Kennedy, E. H. (2023), ‘Towards optimal doubly robust estimation of heterogeneous causal effects’, Electronic Journal of Statistics 17(2), 3008–3049.
  • Kennedy et al. (2018) Kennedy, E. H., Balakrishnan, S. & G’Sell, M. (2018), ‘Sharp instruments for classifying compliers and generalizing causal effects’, arXiv preprint arXiv:1801.03635 .
  • Kim & Zubizarreta (2023) Kim, K. & Zubizarreta, J. R. (2023), Fair and robust estimation of heterogeneous treatment effects for policy learning, in ‘Proceedings of the 40th International Conference on Machine Learning’, Vol. 202 of Proceedings of Machine Learning Research, PMLR, pp. 16997–17014.
  • Kravitz et al. (2004) Kravitz, R. L., Duan, N. & Braslow, J. (2004), ‘Evidence-based medicine, heterogeneity of treatment effects, and the trouble with averages’, The Milbank Quarterly 82(4), 661–687.
  • Kumar & Patel (2007) Kumar, M. & Patel, N. R. (2007), ‘Clustering data with measurement errors’, Computational Statistics & Data Analysis 51(12), 6084–6101.
  • Künzel et al. (2017) Künzel, S. R., Sekhon, J. S., Bickel, P. J. & Yu, B. (2017), ‘Meta-learners for estimating heterogeneous treatment effects using machine learning’, arXiv preprint arXiv:1706.03461 .
  • Leskovec et al. (2020) Leskovec, J., Rajaraman, A. & Ullman, J. D. (2020), Mining of massive data sets, Cambridge university press.
  • Levis et al. (2023) Levis, A. W., Bonvini, M., Zeng, Z., Keele, L. & Kennedy, E. H. (2023), ‘Covariate-assisted bounds on causal effects with instrumental variables’, arXiv preprint arXiv:2301.12106 .
  • Levrard (2015) Levrard, C. (2015), ‘Nonasymptotic bounds for vector quantization in hilbert spaces’, The Annals of Statistics pp. 592–619.
  • Levrard (2018) Levrard, C. (2018), ‘Quantization/clustering: when and why does k𝑘kitalic_k-means work?’, Journal de la société française de statistique 159(1), 1–26.
  • Linder et al. (1994) Linder, T., Lugosi, G. & Zeger, K. (1994), ‘Rates of convergence in the source coding theorem, in empirical quantizer design, and in universal lossy source coding’, IEEE Transactions on Information Theory 40(6), 1728–1740.
  • Lloyd (1982) Lloyd, S. (1982), ‘Least squares quantization in pcm’, IEEE transactions on information theory 28(2), 129–137.
  • Luedtke & Van Der Laan (2016) Luedtke, A. R. & Van Der Laan, M. J. (2016), ‘Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy’, Annals of statistics 44(2), 713.
  • McCaffrey et al. (2013) McCaffrey, D. F., Griffin, B. A., Almirall, D., Slaughter, M. E., Ramchand, R. & Burgette, L. F. (2013), ‘A tutorial on propensity score estimation for multiple treatments using generalized boosted models’, Statistics in medicine 32(19), 3388–3414.
  • Newey & Robins (2018) Newey, W. K. & Robins, J. R. (2018), ‘Cross-fitting and fast remainder rates for semiparametric estimation’, arXiv preprint arXiv:1801.09138 .
  • Nie & Wager (2021) Nie, X. & Wager, S. (2021), ‘Quasi-oracle estimation of heterogeneous treatment effects’, Biometrika 108(2), 299–319.
  • Pollard (1981) Pollard, D. (1981), ‘Strong consistency of k-means clustering’, The Annals of Statistics pp. 135–140.
  • Pollard (1982) Pollard, D. (1982), ‘A central limit theorem for k𝑘kitalic_k-means clustering’, The Annals of Probability 10(4), 919–926.
  • Robins et al. (2008) Robins, J., Li, L., Tchetgen, E., van der Vaart, A. et al. (2008), Higher order influence functions and minimax estimation of nonlinear functionals, in ‘Probability and statistics: essays in honor of David A. Freedman’, Institute of Mathematical Statistics, pp. 335–421.
  • Rubin (1974) Rubin, D. B. (1974), ‘Estimating causal effects of treatments in randomized and nonrandomized studies.’, Journal of Educational Psychology 66(5), 688.
  • Serafini et al. (2020) Serafini, A., Murphy, T. B. & Scrucca, L. (2020), ‘Handling missing data in model-based clustering’, arXiv preprint arXiv:2006.02954 .
  • Shalit et al. (2017) Shalit, U., Johansson, F. D. & Sontag, D. (2017), Estimating individual treatment effect: generalization bounds and algorithms, in ‘International conference on machine learning’, PMLR, pp. 3076–3085.
  • Su et al. (2018) Su, Y., Reedy, J. & Carroll, R. J. (2018), ‘Clustering in general measurement error models’, Statistica Sinica 28(4), 2337.
  • Tseng & Wong (2005) Tseng, G. C. & Wong, W. H. (2005), ‘Tight clustering: a resampling-based approach for identifying stable and tight patterns in data’, Biometrics 61(1), 10–16.
  • Tsiatis (2007) Tsiatis, A. (2007), Semiparametric theory and missing data, Springer Science & Business Media.
  • Van der Laan et al. (2003) Van der Laan, M. J., Laan, M. & Robins, J. M. (2003), Unified methods for censored longitudinal data and causality, Springer Science & Business Media.
  • van der Laan & Luedtke (2015) van der Laan, M. J. & Luedtke, A. R. (2015), ‘Targeted learning of the mean outcome under an optimal dynamic treatment rule’, Journal of causal inference 3(1), 61–95.
  • Van der Laan et al. (2007) Van der Laan, M. J., Polley, E. C. & Hubbard, A. E. (2007), ‘Super learner’, Statistical applications in genetics and molecular biology 6(1).
  • van der Vaart (2002) van der Vaart, A. (2002), Semiparametric statistics, number 1781 in ‘Lecture Notes in Math.’, Springer, pp. 331–457. MR1915446.
  • Van der Vaart (2000) Van der Vaart, A. W. (2000), Asymptotic statistics, Vol. 3, Cambridge university press.
  • Van Der Vaart & Wellner (1996) Van Der Vaart, A. W. & Wellner, J. A. (1996), Weak convergence, in ‘Weak convergence and empirical processes’, Springer, pp. 16–28.
  • VanderWeele (2017) VanderWeele, T. J. (2017), ‘Outcome-wide epidemiology’, Epidemiology (Cambridge, Mass.) 28(3), 399.
  • VanderWeele et al. (2016) VanderWeele, T. J., Li, S., Tsai, A. C. & Kawachi, I. (2016), ‘Association between religious service attendance and lower suicide rates among us women’, JAMA psychiatry 73(8), 845–851.
  • Wager & Athey (2018) Wager, S. & Athey, S. (2018), ‘Estimation and inference of heterogeneous treatment effects using random forests’, Journal of the American Statistical Association 113(523), 1228–1242.
  • Zhang et al. (2017) Zhang, W., Le, T. D., Liu, L., Zhou, Z.-H. & Li, J. (2017), ‘Mining heterogeneous causal effects for personalized cancer treatment’, Bioinformatics 33(15), 2372–2378.
  • Zhang et al. (2012) Zhang, Z., Chen, Z., Troendle, J. F. & Zhang, J. (2012), ‘Causal inference on quantiles with an obstetric application’, Biometrics 68(3), 697–706.
  • Zheng & Van Der Laan (2010) Zheng, W. & Van Der Laan, M. J. (2010), ‘Asymptotic theory for cross-validated targeted maximum likelihood estimation’, Working Paper 273 .

Web Appendix

Appendix A Simulation Study Details

We consider a simple data generating process as follows. First, we fix k,p𝑘𝑝k,pitalic_k , italic_p, each of which is randomly drawn from a set {2,,10}210\{2,...,10\}{ 2 , … , 10 }. Then we randomly pick k𝑘kitalic_k points in a bounded hypercube [0,1]psuperscript01𝑝[0,1]^{p}[ 0 , 1 ] start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT under a constraint that every pairwise mutual Euclidean distance between two cluster centers is always greater than 0.20.20.20.2. A set of these k𝑘kitalic_k points is considered as our true codebook C={c1,,ck}superscript𝐶subscriptsuperscript𝑐1subscriptsuperscript𝑐𝑘C^{*}=\{c^{*}_{1},...,c^{*}_{k}\}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }; consequently Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT defines the associated Voronoi cells. To assign roughly equal numbers of units to each cjsubscriptsuperscript𝑐𝑗c^{*}_{j}italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, for each unit i=1,,n𝑖1𝑛i=1,...,nitalic_i = 1 , … , italic_n, we draw a label I{1,,k}𝐼1𝑘I\in\{1,...,k\}italic_I ∈ { 1 , … , italic_k } from a multinomial distribution: Imult(p1,,pk)similar-to𝐼multsubscript𝑝1subscript𝑝𝑘I\sim\text{mult}(p_{1},...,p_{k})italic_I ∼ mult ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) with p1==pk=1/ksubscript𝑝1subscript𝑝𝑘1𝑘p_{1}=\cdots=p_{k}=1/kitalic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ⋯ = italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 / italic_k. Given this label information, we set μ=cI+ϵtruc𝜇subscriptsuperscript𝑐𝐼superscriptitalic-ϵtruc\mu=c^{*}_{I}+\epsilon^{\text{truc}}italic_μ = italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT + italic_ϵ start_POSTSUPERSCRIPT truc end_POSTSUPERSCRIPT where ϵtrucsuperscriptitalic-ϵtruc\epsilon^{\text{truc}}italic_ϵ start_POSTSUPERSCRIPT truc end_POSTSUPERSCRIPT follows a truncated normal distribution of N(0,1/2)𝑁012N(0,1/2)italic_N ( 0 , 1 / 2 ) with the threshold of min𝑗d(cI,cj)/20.01𝑗𝑑subscriptsuperscript𝑐𝐼subscriptsuperscript𝑐𝑗20.01\underset{j}{\min}\,d(c^{*}_{I},c^{*}_{j})/2-0.01underitalic_j start_ARG roman_min end_ARG italic_d ( italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / 2 - 0.01. This guarantees that the nearest center for units with label j𝑗jitalic_j in the counterfactual mean vector space is always cjsubscriptsuperscript𝑐𝑗c^{*}_{j}italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and that the margin condition holds. Next, we model our observed data generating process by Amult(π1,,πp)similar-to𝐴multsubscript𝜋1subscript𝜋𝑝A\sim\text{mult}(\pi_{1},...,\pi_{p})italic_A ∼ mult ( italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_π start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) and Y=μA+Z𝑌subscript𝜇𝐴𝑍Y=\mu_{A}+Zitalic_Y = italic_μ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT + italic_Z, where π1==πp=1/psubscript𝜋1subscript𝜋𝑝1𝑝\pi_{1}=\cdots=\pi_{p}=1/pitalic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ⋯ = italic_π start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 1 / italic_p and ZN(0,1)similar-to𝑍𝑁01Z\sim N(0,1)italic_Z ∼ italic_N ( 0 , 1 ). Finally, we let μ^a=μa+ξsubscript^𝜇𝑎subscript𝜇𝑎𝜉\widehat{\mu}_{a}=\mu_{a}+\xiover^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_ξ and π^a=πa+ζsubscript^𝜋𝑎subscript𝜋𝑎𝜁\widehat{\pi}_{a}=\pi_{a}+\zetaover^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_ζ, where ξN(0,n(r+0.01))similar-to𝜉𝑁0superscript𝑛𝑟0.01\xi\sim N(0,n^{-(r+0.01)})italic_ξ ∼ italic_N ( 0 , italic_n start_POSTSUPERSCRIPT - ( italic_r + 0.01 ) end_POSTSUPERSCRIPT ) and ζN(0,n(r+0.01))similar-to𝜁𝑁0superscript𝑛𝑟0.01\zeta\sim N(0,n^{-(r+0.01)})italic_ζ ∼ italic_N ( 0 , italic_n start_POSTSUPERSCRIPT - ( italic_r + 0.01 ) end_POSTSUPERSCRIPT ), respectively, which ensures that μ^aμa=o(nr)normsubscript^𝜇𝑎subscript𝜇𝑎subscript𝑜superscript𝑛𝑟\|\widehat{\mu}_{a}-\mu_{a}\|=o_{\mathbb{P}}(n^{-r})∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - italic_r end_POSTSUPERSCRIPT ) and π^aπa=o(nr)normsubscript^𝜋𝑎subscript𝜋𝑎subscript𝑜superscript𝑛𝑟\|\widehat{\pi}_{a}-\pi_{a}\|=o_{\mathbb{P}}(n^{-r})∥ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - italic_r end_POSTSUPERSCRIPT ).

We randomly pick 50505050 different pairs of (k,p)𝑘𝑝(k,p)( italic_k , italic_p ) and vary the sample size n𝑛nitalic_n from 250250250250 to 10,0001000010,00010 , 000 for each (k,p)𝑘𝑝(k,p)( italic_k , italic_p ). For each (k,p,n)𝑘𝑝𝑛(k,p,n)( italic_k , italic_p , italic_n ) tuple, we generate data according to the above specified process, and then compute C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG and the corresponding risk using the plug-in estimator from Section 3, as well as the semiparametric estimator from Section 4. We use K=2𝐾2K=2italic_K = 2 splits and the gradient descent algorithm for optimization. We run the simulation 500500500500 times for each (k,p,n)𝑘𝑝𝑛(k,p,n)( italic_k , italic_p , italic_n ) at two different nuisance rates of r=1/2,1/4𝑟1214r=1/2,1/4italic_r = 1 / 2 , 1 / 4. Results are presented in Figure LABEL:fig:app-sim.

For both fast (r=1/2𝑟12r=1/2italic_r = 1 / 2) and slow (r=1/4𝑟14r=1/4italic_r = 1 / 4) rates at which the nuisance functions are estimated, the performance of the proposed semiparametric estimator is improved as n𝑛nitalic_n grows, nearly at n1/2superscript𝑛12n^{1/2}italic_n start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT rates. On the other hand, the plug-in estimator shows far worse performance at the slow nuisance estimation rates, as it is no longer expected to converge at n1/2superscript𝑛12n^{1/2}italic_n start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT rates. Hence, the simulation results validate our theoretical findings in Sections 3 and 4, and support our recommendation to use the proposed semiparametric estimator described in Section 4 in practice.

Appendix B Proofs

Notation Guide. Hereafter, we let fnorm𝑓\|f\|∥ italic_f ∥ denote the L2()subscript𝐿2L_{2}(\mathbb{P})italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_P )-norm in order to simplify notation and avoid any confusion with the Euclidean norm 2\|\cdot\|_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, as the L2()subscript𝐿2L_{2}(\mathbb{P})italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_P )-norm is used most frequently in the proofs. For simplicity, we drop the dependence on Z𝑍Zitalic_Z if the context is clear. Also, for any fixed C𝐶Citalic_C, we let fC(x)=xΠC(x)22subscript𝑓𝐶𝑥superscriptsubscriptnorm𝑥subscriptΠ𝐶𝑥22f_{C}(x)=\|x-\Pi_{C}(x)\|_{2}^{2}italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) = ∥ italic_x - roman_Π start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for xp𝑥superscript𝑝x\in\mathbb{R}^{p}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT so that R(C)=𝔼{fC(μ)}𝑅𝐶𝔼subscript𝑓𝐶𝜇R(C)=\mathbb{E}\{f_{C}(\mu)\}italic_R ( italic_C ) = blackboard_E { italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) }, and let

fcj(μ)subscript𝑓subscript𝑐𝑗𝜇\displaystyle f_{c_{j}}(\mu)italic_f start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) =μcj22,absentsuperscriptsubscriptnorm𝜇subscript𝑐𝑗22\displaystyle=\|\mu-c_{j}\|_{2}^{2},= ∥ italic_μ - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
φcj(η)subscript𝜑subscript𝑐𝑗𝜂\displaystyle\varphi_{c_{j}}(\eta)italic_φ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) =a{φ2,a(η)2φ1,a(η)cja+cja2},j=1,,k.formulae-sequenceabsentsubscript𝑎subscript𝜑2𝑎𝜂2subscript𝜑1𝑎𝜂subscript𝑐𝑗𝑎superscriptsubscript𝑐𝑗𝑎2𝑗1𝑘\displaystyle=\sum_{a}\left\{\varphi_{2,a}(\eta)-2\varphi_{1,a}(\eta)c_{ja}+c_% {ja}^{2}\right\},\quad j=1,\ldots,k.= ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT { italic_φ start_POSTSUBSCRIPT 2 , italic_a end_POSTSUBSCRIPT ( italic_η ) - 2 italic_φ start_POSTSUBSCRIPT 1 , italic_a end_POSTSUBSCRIPT ( italic_η ) italic_c start_POSTSUBSCRIPT italic_j italic_a end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT italic_j italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } , italic_j = 1 , … , italic_k .

Further, we let ζj(μ;C)=minjiμci2μcj2subscript𝜁𝑗𝜇𝐶𝑗𝑖subscriptnorm𝜇subscript𝑐𝑖2subscriptnorm𝜇subscript𝑐𝑗2\zeta_{j}(\mu;C)=\underset{j\neq i}{\min}\|\mu-c_{i}\|_{2}-\|\mu-c_{j}\|_{2}italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_μ ; italic_C ) = start_UNDERACCENT italic_j ≠ italic_i end_UNDERACCENT start_ARG roman_min end_ARG ∥ italic_μ - italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - ∥ italic_μ - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT so that {|ζj(μ;C)|t|0tκ}tα\mathbb{P}\left\{|\zeta_{j}(\mu;C^{*})|\leq t\bigm{|}0\leq t\leq\kappa\right\}% \lesssim t^{\alpha}blackboard_P { | italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_μ ; italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | ≤ italic_t | 0 ≤ italic_t ≤ italic_κ } ≲ italic_t start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT under the margin condition for any α>0𝛼0\alpha>0italic_α > 0, κ>0𝜅0\kappa>0italic_κ > 0. With a slight abuse of notation, we write C^C1=j=1kc^jcj1subscriptnorm^𝐶superscript𝐶1superscriptsubscript𝑗1𝑘subscriptnormsubscript^𝑐𝑗subscriptsuperscript𝑐𝑗1\|\widehat{C}-C^{*}\|_{1}=\sum_{j=1}^{k}\|\widehat{c}_{j}-c^{*}_{j}\|_{1}∥ over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

B.1 Proof of Theorem 3.1

Before proving Theorem 3.1, we present the three following lemmas.

Lemma B.1.

Suppose that Assumption A1 holds, and \mathbb{P}blackboard_P satisfies the margin condition with some κ>0𝜅0\kappa>0italic_κ > 0, α>0𝛼0\alpha>0italic_α > 0. Then we have

|ΠC,a(μ^)ΠC,a(μ)|subscriptΠsuperscript𝐶𝑎^𝜇subscriptΠsuperscript𝐶𝑎𝜇\displaystyle\mathbb{P}\left|\Pi_{C^{*},a}(\widehat{\mu})-\Pi_{C^{*},a}(\mu)\right|blackboard_P | roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a end_POSTSUBSCRIPT ( italic_μ ) | maxj𝟙{ζj(μ^;C)>0}𝟙{ζj(μ;C)>0}aμ^aμa,1less-than-or-similar-toabsentsubscript𝑗subscriptnorm1subscript𝜁𝑗^𝜇superscript𝐶01subscript𝜁𝑗𝜇superscript𝐶0subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1\displaystyle\lesssim\max_{j}\left\|\mathbbm{1}\left\{\zeta_{j}(\widehat{\mu};% C^{*})>0\right\}-\mathbbm{1}\left\{\zeta_{j}(\mu;C^{*})>0\right\}\right\|_{% \infty}\sum_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\mathbb{P},1}≲ roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ; italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) > 0 } - blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_μ ; italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) > 0 } ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT
+aμ^aμaα,subscript𝑎superscriptsubscriptnormsubscript^𝜇𝑎subscript𝜇𝑎𝛼\displaystyle\quad+\sum_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}^{\alpha},+ ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ,

where ΠC,a()subscriptΠsuperscript𝐶𝑎\Pi_{C^{*},a}(\cdot)roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a end_POSTSUBSCRIPT ( ⋅ ) denotes the a𝑎aitalic_a-th coordinate of ΠC()subscriptΠsuperscript𝐶\Pi_{C^{*}}(\cdot)roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ ).

Proof.

Recall that ζj(μ;C)=minjiμci2μcj2subscript𝜁𝑗𝜇𝐶𝑗𝑖subscriptnorm𝜇subscript𝑐𝑖2subscriptnorm𝜇subscript𝑐𝑗2\zeta_{j}(\mu;C)=\underset{j\neq i}{\min}\|\mu-c_{i}\|_{2}-\|\mu-c_{j}\|_{2}italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_μ ; italic_C ) = start_UNDERACCENT italic_j ≠ italic_i end_UNDERACCENT start_ARG roman_min end_ARG ∥ italic_μ - italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - ∥ italic_μ - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Letting ζjζj(μ;C)subscript𝜁𝑗subscript𝜁𝑗𝜇superscript𝐶\zeta_{j}\equiv\zeta_{j}(\mu;C^{*})italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≡ italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_μ ; italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and ζ^jζj(μ^;C)subscript^𝜁𝑗subscript𝜁𝑗^𝜇superscript𝐶\widehat{\zeta}_{j}\equiv\zeta_{j}(\widehat{\mu};C^{*})over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≡ italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ; italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), we have

|ΠC,a(μ^)ΠC,a(μ)|subscriptΠsuperscript𝐶𝑎^𝜇subscriptΠsuperscript𝐶𝑎𝜇\displaystyle\mathbb{P}\left|\Pi_{C^{*},a}(\widehat{\mu})-\Pi_{C^{*},a}(\mu)\right|blackboard_P | roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a end_POSTSUBSCRIPT ( italic_μ ) |
=(jcj,a|𝟙{ζ^j>0}𝟙{ζj>0}|)absentsubscript𝑗subscriptsuperscript𝑐𝑗𝑎1subscript^𝜁𝑗01subscript𝜁𝑗0\displaystyle=\mathbb{P}\left(\sum_{j}c^{*}_{j,a}\left|\mathbbm{1}\left\{% \widehat{\zeta}_{j}>0\right\}-\mathbbm{1}\left\{\zeta_{j}>0\right\}\right|\right)= blackboard_P ( ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_a end_POSTSUBSCRIPT | blackboard_1 { over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 } - blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 } | )
=jcj,a(|𝟙{ζ^j>0}𝟙{ζj>0}|[𝟙{|ζ^jζj|κ}+𝟙{|ζ^jζj|>κ}]).absentsubscript𝑗subscriptsuperscript𝑐𝑗𝑎1subscript^𝜁𝑗01subscript𝜁𝑗0delimited-[]1subscript^𝜁𝑗subscript𝜁𝑗𝜅1subscript^𝜁𝑗subscript𝜁𝑗𝜅\displaystyle=\sum_{j}c^{*}_{j,a}\mathbb{P}\left(\left|\mathbbm{1}\left\{% \widehat{\zeta}_{j}>0\right\}-\mathbbm{1}\left\{\zeta_{j}>0\right\}\right|% \left[\mathbbm{1}\left\{\left|\widehat{\zeta}_{j}-\zeta_{j}\right|\leq\kappa% \right\}+\mathbbm{1}\left\{\left|\widehat{\zeta}_{j}-\zeta_{j}\right|>\kappa% \right\}\right]\right).= ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_a end_POSTSUBSCRIPT blackboard_P ( | blackboard_1 { over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 } - blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 } | [ blackboard_1 { | over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ≤ italic_κ } + blackboard_1 { | over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | > italic_κ } ] ) .

On the one hand, by the iterated expectation we have that

jcj,a[|𝟙{ζ^j>0}𝟙{ζj>0}|𝟙{|ζ^jζj|κ}]subscript𝑗subscriptsuperscript𝑐𝑗𝑎delimited-[]1subscript^𝜁𝑗01subscript𝜁𝑗01subscript^𝜁𝑗subscript𝜁𝑗𝜅\displaystyle\sum_{j}c^{*}_{j,a}\mathbb{P}\left[\left|\mathbbm{1}\left\{% \widehat{\zeta}_{j}>0\right\}-\mathbbm{1}\left\{\zeta_{j}>0\right\}\right|% \mathbbm{1}\left\{\left|\widehat{\zeta}_{j}-\zeta_{j}\right|\leq\kappa\right\}\right]∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_a end_POSTSUBSCRIPT blackboard_P [ | blackboard_1 { over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 } - blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 } | blackboard_1 { | over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ≤ italic_κ } ]
jcj,a[(𝟙{|ζj||ζ^jζj|}||ζ^jζj|κ)𝟙{|ζ^jζj|κ}]\displaystyle\leq\sum_{j}c^{*}_{j,a}\mathbb{P}\left[\mathbb{P}\left(\mathbbm{1% }\left\{|\zeta_{j}|\leq\left|\widehat{\zeta}_{j}-\zeta_{j}\right|\right\}\Bigm% {|}\left|\widehat{\zeta}_{j}-\zeta_{j}\right|\leq\kappa\right)\mathbbm{1}\left% \{\left|\widehat{\zeta}_{j}-\zeta_{j}\right|\leq\kappa\right\}\right]≤ ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_a end_POSTSUBSCRIPT blackboard_P [ blackboard_P ( blackboard_1 { | italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ≤ | over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | } | | over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ≤ italic_κ ) blackboard_1 { | over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ≤ italic_κ } ]
jcj,a[{|ζj||ζ^jζj|||ζ^jζj|κ}]\displaystyle\leq\sum_{j}c^{*}_{j,a}\mathbb{P}\left[\mathbb{P}\left\{|\zeta_{j% }|\leq\left|\widehat{\zeta}_{j}-\zeta_{j}\right|\Bigm{|}\left|\widehat{\zeta}_% {j}-\zeta_{j}\right|\leq\kappa\right\}\right]≤ ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_a end_POSTSUBSCRIPT blackboard_P [ blackboard_P { | italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ≤ | over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | | | over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ≤ italic_κ } ]
jζ^jζjαless-than-or-similar-toabsentsubscript𝑗superscriptsubscriptnormsubscript^𝜁𝑗subscript𝜁𝑗𝛼\displaystyle\lesssim\sum_{j}\|\widehat{\zeta}_{j}-\zeta_{j}\|_{\infty}^{\alpha}≲ ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT

where the last inequality follows by the margin condition. Similarly,

jcj,a[|𝟙{ζ^j>0}𝟙{ζj>0}|𝟙{|ζ^jζj|>κ}]subscript𝑗subscriptsuperscript𝑐𝑗𝑎delimited-[]1subscript^𝜁𝑗01subscript𝜁𝑗01subscript^𝜁𝑗subscript𝜁𝑗𝜅\displaystyle\sum_{j}c^{*}_{j,a}\mathbb{P}\left[\left|\mathbbm{1}\left\{% \widehat{\zeta}_{j}>0\right\}-\mathbbm{1}\left\{\zeta_{j}>0\right\}\right|% \mathbbm{1}\left\{\left|\widehat{\zeta}_{j}-\zeta_{j}\right|>\kappa\right\}\right]∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_a end_POSTSUBSCRIPT blackboard_P [ | blackboard_1 { over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 } - blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 } | blackboard_1 { | over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | > italic_κ } ]
jcj,a𝟙{ζ^j>0}𝟙{ζj>0}[𝟙{|ζ^jζj|>κ}]absentsubscript𝑗subscriptsuperscript𝑐𝑗𝑎subscriptnorm1subscript^𝜁𝑗01subscript𝜁𝑗0delimited-[]1subscript^𝜁𝑗subscript𝜁𝑗𝜅\displaystyle\leq\sum_{j}c^{*}_{j,a}\left\|\mathbbm{1}\left\{\widehat{\zeta}_{% j}>0\right\}-\mathbbm{1}\left\{\zeta_{j}>0\right\}\right\|_{\infty}\mathbb{P}% \left[\mathbbm{1}\left\{\left|\widehat{\zeta}_{j}-\zeta_{j}\right|>\kappa% \right\}\right]≤ ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_a end_POSTSUBSCRIPT ∥ blackboard_1 { over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 } - blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 } ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT blackboard_P [ blackboard_1 { | over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | > italic_κ } ]
jcj,a𝟙{ζ^j>0}𝟙{ζj>0}(aμ^aμa,1),less-than-or-similar-toabsentsubscript𝑗subscriptsuperscript𝑐𝑗𝑎subscriptnorm1subscript^𝜁𝑗01subscript𝜁𝑗0subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1\displaystyle\lesssim\sum_{j}c^{*}_{j,a}\left\|\mathbbm{1}\left\{\widehat{% \zeta}_{j}>0\right\}-\mathbbm{1}\left\{\zeta_{j}>0\right\}\right\|_{\infty}% \left(\sum_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\mathbb{P},1}\right),≲ ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_a end_POSTSUBSCRIPT ∥ blackboard_1 { over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 } - blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 } ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT ) ,

where the first and second inequalities follow by Hölder’s and Markov’s inequalities, respectively, and the fact that each ζjsubscript𝜁𝑗\zeta_{j}italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is Lipschitz at μ𝜇\muitalic_μ.

Putting the two pieces together, we finally obtain that

|ΠC,a(μ^)ΠC,a(μ)|subscriptΠsuperscript𝐶𝑎^𝜇subscriptΠsuperscript𝐶𝑎𝜇\displaystyle\mathbb{P}\left|\Pi_{C^{*},a}(\widehat{\mu})-\Pi_{C^{*},a}(\mu)\right|blackboard_P | roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a end_POSTSUBSCRIPT ( italic_μ ) |
maxj𝟙{ζj(μ^;C)>0}𝟙{ζj(μ;C)>0}aμ^aμa,1+aμ^aμaα.less-than-or-similar-toabsentsubscript𝑗subscriptnorm1subscript𝜁𝑗^𝜇superscript𝐶01subscript𝜁𝑗𝜇superscript𝐶0subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1subscript𝑎superscriptsubscriptnormsubscript^𝜇𝑎subscript𝜇𝑎𝛼\displaystyle\lesssim\max_{j}\left\|\mathbbm{1}\left\{\zeta_{j}(\widehat{\mu};% C^{*})>0\right\}-\mathbbm{1}\left\{\zeta_{j}(\mu;C^{*})>0\right\}\right\|_{% \infty}\sum_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\mathbb{P},1}+\sum_{a}\|\widehat% {\mu}_{a}-\mu_{a}\|_{\infty}^{\alpha}.≲ roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ; italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) > 0 } - blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_μ ; italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) > 0 } ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT .

The next lemma shows that one may achieve faster rates for the bias of fC(μ^)subscript𝑓superscript𝐶^𝜇f_{C^{*}}(\widehat{\mu})italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ).

Lemma B.2.

Suppose that Assumption A1 holds and \mathbb{P}blackboard_P satisfies the margin condition with some κ>0𝜅0\kappa>0italic_κ > 0, α>0𝛼0\alpha>0italic_α > 0. Then we have

|{fC(μ^)fC(μ)}|subscript𝑓superscript𝐶^𝜇subscript𝑓superscript𝐶𝜇\displaystyle\left|\mathbb{P}\left\{f_{C^{*}}(\widehat{\mu})-f_{C^{*}}(\mu)% \right\}\right|| blackboard_P { italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) } |
maxaμ^aμa,1+maxaμ^aμaα+1+1κmaxa(μ^aμaμ^aμa,1).absentsubscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1subscript𝑎superscriptsubscriptnormsubscript^𝜇𝑎subscript𝜇𝑎𝛼11𝜅subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1\displaystyle\leq\max_{a}\left\|\widehat{\mu}_{a}-\mu_{a}\right\|_{\mathbb{P},% 1}+\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}^{\alpha+1}+\frac{1}{\kappa}% \max_{a}\left(\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}\left\|\widehat{\mu}_{a}-% \mu_{a}\right\|_{\mathbb{P},1}\right).≤ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α + 1 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT ) .
Proof.

The proof follows the same logic that we develop in greater detail in the subsequent proof of Lemma B.6 (see Remark B.2). ∎

The following lemma computes the bias of our plug-in risk estimator R^nsubscript^𝑅𝑛\widehat{R}_{n}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

Lemma B.3.

Suppose \mathbb{P}blackboard_P satisfies the margin condition for some κ>0𝜅0\kappa>0italic_κ > 0, α>0𝛼0\alpha>0italic_α > 0. Then under Assumptions A1, A2, we have

R^n(C)R(C)subscript^𝑅𝑛superscript𝐶𝑅superscript𝐶\displaystyle\widehat{R}_{n}(C^{*})-R(C^{*})over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
=O(1n+maxaμ^aμa,1+maxaμ^aμaα+1+1κmaxa(μ^aμaμ^aμa,1)),absentsubscript𝑂1𝑛subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1subscript𝑎superscriptsubscriptnormsubscript^𝜇𝑎subscript𝜇𝑎𝛼11𝜅subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1\displaystyle=O_{\mathbb{P}}\left(\frac{1}{\sqrt{n}}+\max_{a}\left\|\widehat{% \mu}_{a}-\mu_{a}\right\|_{\mathbb{P},1}+\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_% {\infty}^{\alpha+1}+\frac{1}{\kappa}\max_{a}\left(\|\widehat{\mu}_{a}-\mu_{a}% \|_{\infty}\left\|\widehat{\mu}_{a}-\mu_{a}\right\|_{\mathbb{P},1}\right)% \right),= italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α + 1 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT ) ) ,

whenever μ^^𝜇\widehat{\mu}over^ start_ARG italic_μ end_ARG is constructed from a separate independent sample.

Proof.

It is immediate to see that

R^n(C)R(C)subscript^𝑅𝑛superscript𝐶𝑅superscript𝐶\displaystyle\widehat{R}_{n}(C^{*})-R(C^{*})over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) =n{fC(μ^)}𝔼{fC(μ)}absentsubscript𝑛subscript𝑓superscript𝐶^𝜇𝔼subscript𝑓superscript𝐶𝜇\displaystyle=\mathbb{P}_{n}\left\{f_{C^{*}}(\widehat{\mu})\right\}-\mathbb{E}% \left\{f_{C^{*}}(\mu)\right\}= blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT { italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) } - blackboard_E { italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) } (A.1)
=(n){fC(μ^)fC(μ)}absentsubscript𝑛subscript𝑓superscript𝐶^𝜇subscript𝑓superscript𝐶𝜇\displaystyle=(\mathbb{P}_{n}-\mathbb{P})\left\{f_{C^{*}}(\widehat{\mu})-f_{C^% {*}}(\mu)\right\}= ( blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - blackboard_P ) { italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) }
+(n)fC(μ)+{fC(μ^)fC(μ)},subscript𝑛subscript𝑓superscript𝐶𝜇subscript𝑓superscript𝐶^𝜇subscript𝑓superscript𝐶𝜇\displaystyle\quad+(\mathbb{P}_{n}-\mathbb{P})f_{C^{*}}(\mu)+\mathbb{P}\left\{% f_{C^{*}}(\widehat{\mu})-f_{C^{*}}(\mu)\right\},+ ( blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - blackboard_P ) italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) + blackboard_P { italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) } ,

where fC(x)=xΠC(x)22subscript𝑓𝐶𝑥superscriptsubscriptnorm𝑥subscriptΠ𝐶𝑥22f_{C}(x)=\|x-\Pi_{C}(x)\|_{2}^{2}italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) = ∥ italic_x - roman_Π start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, xpfor-all𝑥superscript𝑝\forall x\in\mathbb{R}^{p}∀ italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. The central limit theorem implies (n)fC(μ)=O(n1/2)subscript𝑛subscript𝑓superscript𝐶𝜇subscript𝑂superscript𝑛12(\mathbb{P}_{n}-\mathbb{P})f_{C^{*}}(\mu)=O_{\mathbb{P}}(n^{-1/2})( blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - blackboard_P ) italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) = italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ). Also, it follows by Lemma B.2 that

{fC(μ^)fC(μ)}subscript𝑓superscript𝐶^𝜇subscript𝑓superscript𝐶𝜇\displaystyle\mathbb{P}\left\{f_{C^{*}}(\widehat{\mu})-f_{C^{*}}(\mu)\right\}blackboard_P { italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) }
maxaμ^aμa,1+maxaμ^aμaα+1+1κmaxa(μ^aμaμ^aμa,1).absentsubscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1subscript𝑎superscriptsubscriptnormsubscript^𝜇𝑎subscript𝜇𝑎𝛼11𝜅subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1\displaystyle\leq\max_{a}\left\|\widehat{\mu}_{a}-\mu_{a}\right\|_{\mathbb{P},% 1}+\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}^{\alpha+1}+\frac{1}{\kappa}% \max_{a}\left(\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}\left\|\widehat{\mu}_{a}-% \mu_{a}\right\|_{\mathbb{P},1}\right).≤ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α + 1 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT ) .

Further, under Assumption A1, it follows that

fC(μ^)fC(μ)subscript𝑓superscript𝐶^𝜇subscript𝑓superscript𝐶𝜇\displaystyle f_{C^{*}}(\widehat{\mu})-f_{C^{*}}(\mu)italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) =μ^ΠC(μ^)22μΠC(μ)22absentsuperscriptsubscriptnorm^𝜇subscriptΠ𝐶^𝜇22superscriptsubscriptnorm𝜇subscriptΠ𝐶𝜇22\displaystyle=\|\widehat{\mu}-\Pi_{C}(\widehat{\mu})\|_{2}^{2}-\|\mu-\Pi_{C}(% \mu)\|_{2}^{2}= ∥ over^ start_ARG italic_μ end_ARG - roman_Π start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_μ - roman_Π start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=μ^22+μ^ΠC(μ^)+ΠC(μ^)22absentsuperscriptsubscriptnorm^𝜇22superscript^𝜇topsubscriptΠsuperscript𝐶^𝜇superscriptsubscriptnormsubscriptΠsuperscript𝐶^𝜇22\displaystyle=\|\widehat{\mu}\|_{2}^{2}+\widehat{\mu}^{\top}\Pi_{C^{*}}(% \widehat{\mu})+\|\Pi_{C^{*}}(\widehat{\mu})\|_{2}^{2}= ∥ over^ start_ARG italic_μ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) + ∥ roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
{μ22+μΠC(μ)+ΠC(μ)22}superscriptsubscriptnorm𝜇22superscript𝜇topsubscriptΠsuperscript𝐶𝜇superscriptsubscriptnormsubscriptΠsuperscript𝐶𝜇22\displaystyle\quad-\left\{\|\mu\|_{2}^{2}+\mu^{\top}\Pi_{C^{*}}(\mu)+\|\Pi_{C^% {*}}(\mu)\|_{2}^{2}\right\}- { ∥ italic_μ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) + ∥ roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }
=μ^22μ^ΠC(μ^)+μ^ΠC(μ)+ΠC(μ^)22absentsuperscriptsubscriptnorm^𝜇22superscript^𝜇topsubscriptΠsuperscript𝐶^𝜇superscript^𝜇topsubscriptΠsuperscript𝐶𝜇superscriptsubscriptnormsubscriptΠsuperscript𝐶^𝜇22\displaystyle=\|\widehat{\mu}\|_{2}^{2}-\widehat{\mu}^{\top}\Pi_{C^{*}}(% \widehat{\mu})+\widehat{\mu}^{\top}\Pi_{C^{*}}(\mu)+\|\Pi_{C^{*}}(\widehat{\mu% })\|_{2}^{2}= ∥ over^ start_ARG italic_μ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) + over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) + ∥ roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
{μ22μΠC(μ)+μ^ΠC(μ)+ΠC(μ)22}superscriptsubscriptnorm𝜇22superscript𝜇topsubscriptΠsuperscript𝐶𝜇superscript^𝜇topsubscriptΠsuperscript𝐶𝜇superscriptsubscriptnormsubscriptΠsuperscript𝐶𝜇22\displaystyle\quad-\left\{\|\mu\|_{2}^{2}-\mu^{\top}\Pi_{C^{*}}(\mu)+\widehat{% \mu}^{\top}\Pi_{C^{*}}(\mu)+\|\Pi_{C^{*}}(\mu)\|_{2}^{2}\right\}- { ∥ italic_μ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_μ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) + over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) + ∥ roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }
3Ba{|μ^aμa|+|ΠC,a(μ^)ΠC,a(μ)|},absent3𝐵subscript𝑎subscript^𝜇𝑎subscript𝜇𝑎subscriptΠsuperscript𝐶𝑎^𝜇subscriptΠsuperscript𝐶𝑎𝜇\displaystyle\leq 3B\sum_{a}\left\{\left|\widehat{\mu}_{a}-\mu_{a}\right|+% \left|\Pi_{C^{*},a}(\widehat{\mu})-\Pi_{C^{*},a}(\mu)\right|\right\},≤ 3 italic_B ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT { | over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | + | roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a end_POSTSUBSCRIPT ( italic_μ ) | } , (A.2)

which, by the triangle inequality, leads to

fC(μ^)fC(μ)a(μ^aμa+ΠC,a(μ^)ΠC,a(μ)).less-than-or-similar-tonormsubscript𝑓superscript𝐶^𝜇subscript𝑓superscript𝐶𝜇subscript𝑎normsubscript^𝜇𝑎subscript𝜇𝑎normsubscriptΠsuperscript𝐶𝑎^𝜇subscriptΠsuperscript𝐶𝑎𝜇\displaystyle\|f_{C^{*}}(\widehat{\mu})-f_{C^{*}}(\mu)\|\lesssim\sum_{a}\left(% \|\widehat{\mu}_{a}-\mu_{a}\|+\left\|\Pi_{C^{*},a}(\widehat{\mu})-\Pi_{C^{*},a% }(\mu)\right\|\right).∥ italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) ∥ ≲ ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ + ∥ roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a end_POSTSUBSCRIPT ( italic_μ ) ∥ ) .

For the second term in the last display, note that

{ΠC,a(μ^)ΠC,a(μ)}2superscriptsubscriptΠsuperscript𝐶𝑎^𝜇subscriptΠsuperscript𝐶𝑎𝜇2\displaystyle\mathbb{P}\left\{\Pi_{C^{*},a}(\widehat{\mu})-\Pi_{C^{*},a}(\mu)% \right\}^{2}blackboard_P { roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a end_POSTSUBSCRIPT ( italic_μ ) } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =(jcj,a[𝟙{ζj(μ^)>0}𝟙{ζj(μ)>0}])2absentsuperscriptsubscript𝑗subscriptsuperscript𝑐𝑗𝑎delimited-[]1subscript𝜁𝑗^𝜇01subscript𝜁𝑗𝜇02\displaystyle=\mathbb{P}\left(\sum_{j}c^{*}_{j,a}\left[\mathbbm{1}\left\{\zeta% _{j}(\widehat{\mu})>0\right\}-\mathbbm{1}\left\{\zeta_{j}(\mu)>0\right\}\right% ]\right)^{2}= blackboard_P ( ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_a end_POSTSUBSCRIPT [ blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) > 0 } - blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_μ ) > 0 } ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
{jcj,a2j[𝟙{ζj(μ^)>0}𝟙{ζj(μ)>0}]2}absentsubscript𝑗superscriptsubscriptsuperscript𝑐𝑗𝑎2subscript𝑗superscriptdelimited-[]1subscript𝜁𝑗^𝜇01subscript𝜁𝑗𝜇02\displaystyle\leq\mathbb{P}\left\{\sum_{j}{c^{*}}_{j,a}^{2}\sum_{j}\left[% \mathbbm{1}\left\{\zeta_{j}(\widehat{\mu})>0\right\}-\mathbbm{1}\left\{\zeta_{% j}(\mu)>0\right\}\right]^{2}\right\}≤ blackboard_P { ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) > 0 } - blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_μ ) > 0 } ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }
jcj,a|𝟙{ζj(μ^)>0}𝟙{ζj(μ)>0}|less-than-or-similar-toabsentsubscript𝑗subscriptsuperscript𝑐𝑗𝑎1subscript𝜁𝑗^𝜇01subscript𝜁𝑗𝜇0\displaystyle\lesssim\sum_{j}c^{*}_{j,a}\mathbb{P}\left|\mathbbm{1}\left\{% \zeta_{j}(\widehat{\mu})>0\right\}-\mathbbm{1}\left\{\zeta_{j}(\mu)>0\right\}\right|≲ ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_a end_POSTSUBSCRIPT blackboard_P | blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) > 0 } - blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_μ ) > 0 } |
maxj𝟙{ζj(μ^;C)>0}𝟙{ζj(μ;C)>0}aμ^aμa,1less-than-or-similar-toabsentsubscript𝑗subscriptnorm1subscript𝜁𝑗^𝜇superscript𝐶01subscript𝜁𝑗𝜇superscript𝐶0subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1\displaystyle\lesssim\max_{j}\left\|\mathbbm{1}\left\{\zeta_{j}(\widehat{\mu};% C^{*})>0\right\}-\mathbbm{1}\left\{\zeta_{j}(\mu;C^{*})>0\right\}\right\|_{% \infty}\sum_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\mathbb{P},1}≲ roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ; italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) > 0 } - blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_μ ; italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) > 0 } ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT
+aμ^aμaα,subscript𝑎superscriptsubscriptnormsubscript^𝜇𝑎subscript𝜇𝑎𝛼\displaystyle\quad+\sum_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}^{\alpha},+ ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT , (A.3)

where the last inequality follows by Lemma B.1. Hence, by the given consistency condition in Assumption A2, we get μ^aμa,1=o(1)subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1subscript𝑜1\|\widehat{\mu}_{a}-\mu_{a}\|_{\mathbb{P},1}=o_{\mathbb{P}}(1)∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ), μ^aμaα=o(1)superscriptsubscriptnormsubscript^𝜇𝑎subscript𝜇𝑎𝛼subscript𝑜1\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}^{\alpha}=o_{\mathbb{P}}(1)∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ), α>0for-all𝛼0\forall\alpha>0∀ italic_α > 0, and thereby conclude that ΠC,a(μ^)ΠC,a(μ)=o(1)normsubscriptΠsuperscript𝐶𝑎^𝜇subscriptΠsuperscript𝐶𝑎𝜇subscript𝑜1\left\|\Pi_{C^{*},a}(\widehat{\mu})-\Pi_{C^{*},a}(\mu)\right\|=o_{\mathbb{P}}(1)∥ roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a end_POSTSUBSCRIPT ( italic_μ ) ∥ = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ). Hence, fC(μ^)fC(μ)=o(1)normsubscript𝑓superscript𝐶^𝜇subscript𝑓superscript𝐶𝜇subscript𝑜1\|f_{C^{*}}(\widehat{\mu})-f_{C^{*}}(\mu)\|=o_{\mathbb{P}}(1)∥ italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) ∥ = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ), and by the sample splitting lemma (Kennedy et al. 2018, Lemma 2), we obtain (n){fC(μ^)fC(μ)}=o(n1/2)subscript𝑛subscript𝑓superscript𝐶^𝜇subscript𝑓superscript𝐶𝜇subscript𝑜superscript𝑛12(\mathbb{P}_{n}-\mathbb{P})\left\{f_{C^{*}}(\widehat{\mu})-f_{C^{*}}(\mu)% \right\}=o_{\mathbb{P}}(n^{-1/2})( blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - blackboard_P ) { italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) } = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ).

Putting the three pieces back together into (A.1), we obtain the desired bias bound as

R^n(C)R(C)subscript^𝑅𝑛superscript𝐶𝑅superscript𝐶\displaystyle\widehat{R}_{n}(C^{*})-R(C^{*})over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
=O(1n+maxaμ^aμa,1+maxaμ^aμaα+1+1κmaxa(μ^aμaμ^aμa,1)).absentsubscript𝑂1𝑛subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1subscript𝑎superscriptsubscriptnormsubscript^𝜇𝑎subscript𝜇𝑎𝛼11𝜅subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1\displaystyle=O_{\mathbb{P}}\left(\frac{1}{\sqrt{n}}+\max_{a}\left\|\widehat{% \mu}_{a}-\mu_{a}\right\|_{\mathbb{P},1}+\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_% {\infty}^{\alpha+1}+\frac{1}{\kappa}\max_{a}\left(\|\widehat{\mu}_{a}-\mu_{a}% \|_{\infty}\left\|\widehat{\mu}_{a}-\mu_{a}\right\|_{\mathbb{P},1}\right)% \right).= italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α + 1 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT ) ) .

Theorem 3.1 immediately follows by Lemma B.3.

Proof of Theorem 3.1.

Notice that

R(C^)R(C)𝑅^𝐶𝑅superscript𝐶\displaystyle R(\widehat{C})-R(C^{*})italic_R ( over^ start_ARG italic_C end_ARG ) - italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) =R(C^)R^n(C^)+R^n(C^)R(C)absent𝑅^𝐶subscript^𝑅𝑛^𝐶subscript^𝑅𝑛^𝐶𝑅superscript𝐶\displaystyle=R(\widehat{C})-\widehat{R}_{n}(\widehat{C})+\widehat{R}_{n}(% \widehat{C})-R(C^{*})= italic_R ( over^ start_ARG italic_C end_ARG ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( over^ start_ARG italic_C end_ARG ) + over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( over^ start_ARG italic_C end_ARG ) - italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
R(C^)R^n(C^)+R^n(C)R(C)absent𝑅^𝐶subscript^𝑅𝑛^𝐶subscript^𝑅𝑛superscript𝐶𝑅superscript𝐶\displaystyle\leq R(\widehat{C})-\widehat{R}_{n}(\widehat{C})+\widehat{R}_{n}(% C^{*})-R(C^{*})≤ italic_R ( over^ start_ARG italic_C end_ARG ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( over^ start_ARG italic_C end_ARG ) + over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
supC𝒞k|R(C)R^n(C)|+R^n(C)R(C).absent𝐶subscript𝒞𝑘supremum𝑅𝐶subscript^𝑅𝑛𝐶subscript^𝑅𝑛superscript𝐶𝑅superscript𝐶\displaystyle\leq\underset{C\in\mathcal{C}_{k}}{\sup}\left|R(C)-\widehat{R}_{n% }(C)\right|+\widehat{R}_{n}(C^{*})-R(C^{*}).≤ start_UNDERACCENT italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_sup end_ARG | italic_R ( italic_C ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_C ) | + over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) . (A.4)

Since μ2<subscriptnorm𝜇2\left\|\mu\right\|_{2}<\infty∥ italic_μ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < ∞ a.s., Linder et al. (1994, Theorem 1) implies the following bound for the first term in (B.1):

supC𝒞k|R(C)Rn(C)|=O(lognn)𝐶subscript𝒞𝑘supremum𝑅𝐶subscript𝑅𝑛𝐶subscript𝑂𝑛𝑛\displaystyle\underset{C\in\mathcal{C}_{k}}{\sup}\left|R(C)-R_{n}(C)\right|=O_% {\mathbb{P}}\left(\sqrt{\frac{\log n}{n}}\right)start_UNDERACCENT italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_sup end_ARG | italic_R ( italic_C ) - italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_C ) | = italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( square-root start_ARG divide start_ARG roman_log italic_n end_ARG start_ARG italic_n end_ARG end_ARG ) (A.5)

For the second term, we have

R^n(C)R(C)subscript^𝑅𝑛superscript𝐶𝑅superscript𝐶\displaystyle\widehat{R}_{n}(C^{*})-R(C^{*})over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
=O(1n+maxaμ^aμa,1+maxaμ^aμaα+1+1κmaxa(μ^aμaμ^aμa,1)),absentsubscript𝑂1𝑛subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1subscript𝑎superscriptsubscriptnormsubscript^𝜇𝑎subscript𝜇𝑎𝛼11𝜅subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1\displaystyle=O_{\mathbb{P}}\left(\frac{1}{\sqrt{n}}+\max_{a}\left\|\widehat{% \mu}_{a}-\mu_{a}\right\|_{\mathbb{P},1}+\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_% {\infty}^{\alpha+1}+\frac{1}{\kappa}\max_{a}\left(\|\widehat{\mu}_{a}-\mu_{a}% \|_{\infty}\left\|\widehat{\mu}_{a}-\mu_{a}\right\|_{\mathbb{P},1}\right)% \right),= italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α + 1 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT ) ) , (A.6)

due to Lemma B.3.

Hence we obtain that

R(C^)R(C)𝑅^𝐶𝑅superscript𝐶\displaystyle R(\widehat{C})-R(C^{*})italic_R ( over^ start_ARG italic_C end_ARG ) - italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
=O(lognn+maxaμ^aμa,1+maxaμ^aμaα+1+1κmaxa(μ^aμaμ^aμa,1)).absentsubscript𝑂𝑛𝑛subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1subscript𝑎superscriptsubscriptnormsubscript^𝜇𝑎subscript𝜇𝑎𝛼11𝜅subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1\displaystyle=O_{\mathbb{P}}\left(\sqrt{\frac{\log n}{n}}+\max_{a}\left\|% \widehat{\mu}_{a}-\mu_{a}\right\|_{\mathbb{P},1}+\max_{a}\|\widehat{\mu}_{a}-% \mu_{a}\|_{\infty}^{\alpha+1}+\frac{1}{\kappa}\max_{a}\left(\|\widehat{\mu}_{a% }-\mu_{a}\|_{\infty}\left\|\widehat{\mu}_{a}-\mu_{a}\right\|_{\mathbb{P},1}% \right)\right).= italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( square-root start_ARG divide start_ARG roman_log italic_n end_ARG start_ARG italic_n end_ARG end_ARG + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α + 1 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT ) ) .

The same argument as in the preceding proof can be used to compute the rate of convergence in expectation as well. Specifically, when μB<subscriptnorm𝜇𝐵\|\mu\|_{\infty}\leq B<\infty∥ italic_μ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_B < ∞ a.s., Biau et al. (2008, Theorem 2.1) implies that

{supC𝒞k|R(C)Rn(C)|}𝐶subscript𝒞𝑘supremum𝑅𝐶subscript𝑅𝑛𝐶\displaystyle\mathbb{P}\left\{\underset{C\in\mathcal{C}_{k}}{\sup}\left|R(C)-R% _{n}(C)\right|\right\}blackboard_P { start_UNDERACCENT italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_sup end_ARG | italic_R ( italic_C ) - italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_C ) | } 12B2kn.absent12superscript𝐵2𝑘𝑛\displaystyle\leq\frac{12B^{2}k}{\sqrt{n}}.≤ divide start_ARG 12 italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_k end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG .

Also, by virtue of Lemma B.2 one may deduce that

{R^n(C)R(C)}subscript^𝑅𝑛superscript𝐶𝑅superscript𝐶\displaystyle\mathbb{P}\left\{\widehat{R}_{n}(C^{*})-R(C^{*})\right\}blackboard_P { over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } ={fC(μ^)fC(μ)}absentsubscript𝑓superscript𝐶^𝜇subscript𝑓superscript𝐶𝜇\displaystyle=\mathbb{P}\left\{f_{C^{*}}(\widehat{\mu})-f_{C^{*}}(\mu)\right\}= blackboard_P { italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) }
maxaμ^aμa,1+maxaμ^aμaα+1+1κmaxa(μ^aμaμ^aμa,1).less-than-or-similar-toabsentsubscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1subscript𝑎superscriptsubscriptnormsubscript^𝜇𝑎subscript𝜇𝑎𝛼11𝜅subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1\displaystyle\lesssim\max_{a}\left\|\widehat{\mu}_{a}-\mu_{a}\right\|_{\mathbb% {P},1}+\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}^{\alpha+1}+\frac{1}{% \kappa}\max_{a}\left(\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}\left\|\widehat{\mu% }_{a}-\mu_{a}\right\|_{\mathbb{P},1}\right).≲ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α + 1 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT ) .

Using the above inequalities instead of (A.5) and (B.1), we obtain that

{R(C^)R(C)}𝑅^𝐶𝑅superscript𝐶\displaystyle\mathbb{P}\left\{R(\widehat{C})-R(C^{*})\right\}blackboard_P { italic_R ( over^ start_ARG italic_C end_ARG ) - italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) }
1n+maxaμ^aμa,1+maxaμ^aμaα+1+1κmaxa(μ^aμaμ^aμa,1).less-than-or-similar-toabsent1𝑛subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1subscript𝑎superscriptsubscriptnormsubscript^𝜇𝑎subscript𝜇𝑎𝛼11𝜅subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1\displaystyle\lesssim\frac{1}{\sqrt{n}}+\max_{a}\left\|\widehat{\mu}_{a}-\mu_{% a}\right\|_{\mathbb{P},1}+\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}^{% \alpha+1}+\frac{1}{\kappa}\max_{a}\left(\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}% \left\|\widehat{\mu}_{a}-\mu_{a}\right\|_{\mathbb{P},1}\right).≲ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α + 1 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT ) .

B.2 Proof of Theorem 3.2

Lemma B.4.

For any C𝒞k𝐶subscript𝒞𝑘C\in\mathcal{C}_{k}italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, under Assumption A1, we have

supC𝒞k|{fC(μ^)fC(μ)}|maxaμ^aμa,1.less-than-or-similar-to𝐶subscript𝒞𝑘supremumsubscript𝑓𝐶^𝜇subscript𝑓𝐶𝜇subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1\displaystyle\underset{C\in\mathcal{C}_{k}}{\sup}\left|\mathbb{P}\left\{f_{C}(% \widehat{\mu})-f_{C}(\mu)\right\}\right|\lesssim\max_{a}\|\widehat{\mu}_{a}-% \mu_{a}\|_{\mathbb{P},1}.start_UNDERACCENT italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_sup end_ARG | blackboard_P { italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) } | ≲ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT .
Proof.

We defer the proof until Lemma B.7 (see Remark B.3). ∎

First, we aim to show

supC𝒞k|Rn(C)R(C)|=o(1).𝐶subscript𝒞𝑘supremumsubscript𝑅𝑛𝐶𝑅𝐶subscript𝑜1\displaystyle\underset{C\in\mathcal{C}_{k}}{\sup}\left|R_{n}(C)-R(C)\right|=o_% {\mathbb{P}}(1).start_UNDERACCENT italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_sup end_ARG | italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_C ) - italic_R ( italic_C ) | = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ) .

To this end, consider the following decomposition for any C𝒞k𝐶subscript𝒞𝑘C\in\mathcal{C}_{k}italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT:

Rn(C)R(C)subscript𝑅𝑛𝐶𝑅𝐶\displaystyle R_{n}(C)-R(C)italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_C ) - italic_R ( italic_C ) =(n){fC(μ^)fC(μ)}(i)absentsubscriptsubscript𝑛subscript𝑓𝐶^𝜇subscript𝑓𝐶𝜇𝑖\displaystyle=\underbrace{(\mathbb{P}_{n}-\mathbb{P})\left\{f_{C}(\widehat{\mu% })-f_{C}(\mu)\right\}}_{(i)}= under⏟ start_ARG ( blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - blackboard_P ) { italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) } end_ARG start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT
+{fC(μ^)fC(μ)}(ii)subscriptsubscript𝑓𝐶^𝜇subscript𝑓𝐶𝜇𝑖𝑖\displaystyle\quad+\underbrace{\mathbb{P}\left\{f_{C}(\widehat{\mu})-f_{C}(\mu% )\right\}}_{(ii)}+ under⏟ start_ARG blackboard_P { italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) } end_ARG start_POSTSUBSCRIPT ( italic_i italic_i ) end_POSTSUBSCRIPT
+(n){fC(μ)}(iii).subscriptsubscript𝑛subscript𝑓𝐶𝜇𝑖𝑖𝑖\displaystyle\quad+\underbrace{(\mathbb{P}_{n}-\mathbb{P})\left\{f_{C}(\mu)% \right\}}_{(iii)}.+ under⏟ start_ARG ( blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - blackboard_P ) { italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) } end_ARG start_POSTSUBSCRIPT ( italic_i italic_i italic_i ) end_POSTSUBSCRIPT .

We will analyze the terms in the following order: (iii) \rightarrow (ii) \rightarrow (i).

(iii) Consider sets G𝐺\mathscrsfs{G}italic_G of the subgraph {fC(x)>u:(x,u)p×}conditional-setsubscript𝑓𝐶𝑥𝑢𝑥𝑢superscript𝑝\{f_{C}(x)>u:(x,u)\in\mathbb{R}^{p}\times\mathbb{R}\}{ italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) > italic_u : ( italic_x , italic_u ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT × blackboard_R }. The shattering number of G𝐺\mathscrsfs{G}italic_G is s(G,n)nk(p+1)𝑠𝐺𝑛superscript𝑛𝑘𝑝1s(\mathscrsfs{G},n)\leq n^{k(p+1)}italic_s ( italic_G , italic_n ) ≤ italic_n start_POSTSUPERSCRIPT italic_k ( italic_p + 1 ) end_POSTSUPERSCRIPT, which follows by the fact that each {fC(x)>u}subscript𝑓𝐶𝑥𝑢\{f_{C}(x)>u\}{ italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) > italic_u } is represented as a union of the complements of k𝑘kitalic_k spheres. Hence the function class ~={fC():C𝒞k}~conditional-setsubscript𝑓𝐶𝐶subscript𝒞𝑘\widetilde{\mathcal{F}}=\{f_{C}(\cdot):C\in\mathcal{C}_{k}\}over~ start_ARG caligraphic_F end_ARG = { italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( ⋅ ) : italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } is a VC-class. For any fixed μ:p:𝜇superscript𝑝\mu:\mathbb{R}^{p}\to\mathbb{R}italic_μ : blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT → blackboard_R and fC(μ)=μΠC(μ)22=fCμsubscript𝑓𝐶𝜇superscriptsubscriptnorm𝜇subscriptΠ𝐶𝜇22subscript𝑓𝐶𝜇f_{C}(\mu)=\|\mu-\Pi_{C}(\mu)\|_{2}^{2}=f_{C}\circ\muitalic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) = ∥ italic_μ - roman_Π start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∘ italic_μ, by the stability property (e.g., Van Der Vaart & Wellner 1996, Lemma 2.6.17) the function class μ={fC(μ()):C𝒞k}subscript𝜇conditional-setsubscript𝑓𝐶𝜇𝐶subscript𝒞𝑘\mathcal{F}_{\mu}=\{f_{C}(\mu(\cdot)):C\in\mathcal{C}_{k}\}caligraphic_F start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT = { italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ( ⋅ ) ) : italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } is also a VC-class. Taking Fμ=supC𝒞k|fC(μ)|subscript𝐹𝜇𝐶subscript𝒞𝑘supremumsubscript𝑓𝐶𝜇F_{\mu}=\underset{C\in\mathcal{C}_{k}}{\sup}\left|f_{C}(\mu)\right|italic_F start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT = start_UNDERACCENT italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_sup end_ARG | italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) | as the envelope function, we have {Fμ}4B2subscript𝐹𝜇4superscript𝐵2\mathbb{P}\{F_{\mu}\}\leq 4B^{2}blackboard_P { italic_F start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT } ≤ 4 italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT under the given boundedness condition. Thus, μsubscript𝜇\mathcal{F}_{\mu}caligraphic_F start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT is \mathbb{P}blackboard_P-Glivenko-Cantelli, yielding supC𝒞k|(n){fC(μ)}|=o(1)𝐶subscript𝒞𝑘supremumsubscript𝑛subscript𝑓𝐶𝜇subscript𝑜1\underset{C\in\mathcal{C}_{k}}{\sup}\left|(\mathbb{P}_{n}-\mathbb{P})\left\{f_% {C}(\mu)\right\}\right|=o_{\mathbb{P}}(1)start_UNDERACCENT italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_sup end_ARG | ( blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - blackboard_P ) { italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) } | = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ).

(ii) Under Assumption A1, by Lemma B.4 we have

supC𝒞k|{fC(μ^)fC(μ)}|maxaμ^aμa,less-than-or-similar-to𝐶subscript𝒞𝑘supremumsubscript𝑓𝐶^𝜇subscript𝑓𝐶𝜇subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎\displaystyle\underset{C\in\mathcal{C}_{k}}{\sup}\left|\mathbb{P}\left\{f_{C}(% \widehat{\mu})-f_{C}(\mu)\right\}\right|\lesssim\max_{a}\|\widehat{\mu}_{a}-% \mu_{a}\|_{\infty},start_UNDERACCENT italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_sup end_ARG | blackboard_P { italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) } | ≲ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ,

which is o(1)subscript𝑜1o_{\mathbb{P}}(1)italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ) under the consistency condition in Assumptions A2.

(i) Let n=μ^μsubscript𝑛subscript^𝜇subscript𝜇\mathcal{F}_{n}=\mathcal{F}_{\hat{\mu}}-\mathcal{F}_{\mu}caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = caligraphic_F start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG end_POSTSUBSCRIPT - caligraphic_F start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT for the function class μ¯={fC(μ¯()):C𝒞k}subscript¯𝜇conditional-setsubscript𝑓𝐶¯𝜇𝐶subscript𝒞𝑘\mathcal{F}_{\bar{\mu}}=\{f_{C}(\bar{\mu}(\cdot)):C\in\mathcal{C}_{k}\}caligraphic_F start_POSTSUBSCRIPT over¯ start_ARG italic_μ end_ARG end_POSTSUBSCRIPT = { italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over¯ start_ARG italic_μ end_ARG ( ⋅ ) ) : italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } from before. Then,

1n𝔾n{fC(μ^)fC(μ)}𝒞ksubscriptnorm1𝑛subscript𝔾𝑛subscript𝑓𝐶^𝜇subscript𝑓𝐶𝜇subscript𝒞𝑘\displaystyle\left\|\frac{1}{\sqrt{n}}\mathbb{G}_{n}\left\{f_{C}(\widehat{\mu}% )-f_{C}(\mu)\right\}\right\|_{\mathcal{C}_{k}}∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG blackboard_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT { italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) } ∥ start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT =supC𝒞k|1n𝔾n{fC(μ^)fC(μ)}|absent𝐶subscript𝒞𝑘supremum1𝑛subscript𝔾𝑛subscript𝑓𝐶^𝜇subscript𝑓𝐶𝜇\displaystyle=\underset{C\in\mathcal{C}_{k}}{\sup}\left|\frac{1}{\sqrt{n}}% \mathbb{G}_{n}\left\{f_{C}(\widehat{\mu})-f_{C}(\mu)\right\}\right|= start_UNDERACCENT italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_sup end_ARG | divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG blackboard_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT { italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) } |
=1nsupfn|𝔾n(f)|.absent1𝑛𝑓subscript𝑛supremumsubscript𝔾𝑛𝑓\displaystyle=\frac{1}{\sqrt{n}}\underset{f\in\mathcal{F}_{n}}{\sup}\left|% \mathbb{G}_{n}(f)\right|.= divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG start_UNDERACCENT italic_f ∈ caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_sup end_ARG | blackboard_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_f ) | .

One may view the nuisance functions μ^^𝜇\widehat{\mu}over^ start_ARG italic_μ end_ARG as fixed given the training data D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Since μsubscript𝜇\mathcal{F}_{\mu}caligraphic_F start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT is a VC-subgraph for any fixed μ𝜇\muitalic_μ, so is nsubscript𝑛\mathcal{F}_{n}caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT given D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Let the VC index of nsubscript𝑛\mathcal{F}_{n}caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be ν<superscript𝜈\nu^{\prime}<\inftyitalic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < ∞. Then we have

sup𝑄N(ϵFnQ,2,n,L2(Q))(c1ϵ)c2νless-than-or-similar-to𝑄supremum𝑁italic-ϵsubscriptnormsubscript𝐹𝑛𝑄2subscript𝑛subscript𝐿2𝑄superscriptsubscript𝑐1italic-ϵsubscript𝑐2superscript𝜈\displaystyle\underset{Q}{\sup}N\left(\epsilon\|F_{n}\|_{Q,2},\mathcal{F}_{n},% L_{2}(Q)\right)\lesssim\left(\frac{c_{1}}{\epsilon}\right)^{c_{2}\nu^{\prime}}underitalic_Q start_ARG roman_sup end_ARG italic_N ( italic_ϵ ∥ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_Q , 2 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_Q ) ) ≲ ( divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_ϵ end_ARG ) start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT

for some universal constants c1,c2>0subscript𝑐1subscript𝑐20c_{1},c_{2}>0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0. Hence applying Giné & Nickl (2021, Theorem 3.5.4), we obtain that

{supfn|𝔾n(f)|}𝑓subscript𝑛supremumsubscript𝔾𝑛𝑓\displaystyle\mathbb{P}\left\{\underset{f\in\mathcal{F}_{n}}{\sup}\left|% \mathbb{G}_{n}(f)\right|\right\}blackboard_P { start_UNDERACCENT italic_f ∈ caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_sup end_ARG | blackboard_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_f ) | } Fnsup𝑄011+logN(ϵFnQ,2,n,L2(Q))𝑑ϵless-than-or-similar-toabsentnormsubscript𝐹𝑛𝑄supremumsuperscriptsubscript011𝑁italic-ϵsubscriptnormsubscript𝐹𝑛𝑄2subscript𝑛subscript𝐿2𝑄differential-ditalic-ϵ\displaystyle\lesssim\left\|F_{n}\right\|\underset{Q}{\sup}\int_{0}^{1}\sqrt{1% +\log N\left(\epsilon\|F_{n}\|_{Q,2},\mathcal{F}_{n},L_{2}(Q)\right)}d\epsilon≲ ∥ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ underitalic_Q start_ARG roman_sup end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT square-root start_ARG 1 + roman_log italic_N ( italic_ϵ ∥ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_Q , 2 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_Q ) ) end_ARG italic_d italic_ϵ
Fn011+νlog(1/ϵ)𝑑ϵ.less-than-or-similar-toabsentnormsubscript𝐹𝑛superscriptsubscript011superscript𝜈1italic-ϵdifferential-ditalic-ϵ\displaystyle\lesssim\left\|F_{n}\right\|\int_{0}^{1}\sqrt{1+\nu^{\prime}\log(% 1/\epsilon)}d\epsilon.≲ ∥ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT square-root start_ARG 1 + italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_log ( 1 / italic_ϵ ) end_ARG italic_d italic_ϵ .

Taking the envelope Fn=supC𝒞k|fC(μ^)fC(μ)|subscript𝐹𝑛𝐶subscript𝒞𝑘supremumsubscript𝑓𝐶^𝜇subscript𝑓𝐶𝜇F_{n}=\underset{C\in\mathcal{C}_{k}}{\sup}\left|f_{C}(\widehat{\mu})-f_{C}(\mu% )\right|italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = start_UNDERACCENT italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_sup end_ARG | italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) | which is bounded, it is immediate to show that {supfnb|𝔾n(f)|}=O(1)𝑓superscriptsubscript𝑛𝑏supremumsubscript𝔾𝑛𝑓subscript𝑂1\mathbb{P}\left\{\underset{f\in\mathcal{F}_{n}^{b}}{\sup}\left|\mathbb{G}_{n}(% f)\right|\right\}=O_{\mathbb{P}}(1)blackboard_P { start_UNDERACCENT italic_f ∈ caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_sup end_ARG | blackboard_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_f ) | } = italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ) as the integral in the last display is finite. Consequently we get (n){fC(μ^)fC(μ)}𝒞k=O(1n)=o(1)subscriptnormsubscript𝑛subscript𝑓𝐶^𝜇subscript𝑓𝐶𝜇subscript𝒞𝑘subscript𝑂1𝑛subscript𝑜1\left\|(\mathbb{P}_{n}-\mathbb{P})\left\{f_{C}(\widehat{\mu})-f_{C}(\mu)\right% \}\right\|_{\mathcal{C}_{k}}=O_{\mathbb{P}}(\frac{1}{\sqrt{n}})=o_{\mathbb{P}}% (1)∥ ( blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - blackboard_P ) { italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) } ∥ start_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ) = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ).

Now that we have shown supC𝒞k|R^(C)R(C)|=o(1),𝐶subscript𝒞𝑘supremum^𝑅𝐶𝑅𝐶subscript𝑜1\underset{C\in\mathcal{C}_{k}}{\sup}\left|\widehat{R}(C)-R(C)\right|=o_{% \mathbb{P}}(1),start_UNDERACCENT italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_sup end_ARG | over^ start_ARG italic_R end_ARG ( italic_C ) - italic_R ( italic_C ) | = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ) , the desired consistency C^𝑝C𝑝^𝐶superscript𝐶\widehat{C}\xrightarrow{p}C^{*}over^ start_ARG italic_C end_ARG start_ARROW overitalic_p → end_ARROW italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT follows by Van der Vaart (2000, Theorem 5.7), noting that R()𝑅R(\cdot)italic_R ( ⋅ ) is a continuous, bounded function whose domain 𝒞ksubscript𝒞𝑘\mathcal{C}_{k}caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is compact, and that Csuperscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is unique.

B.3 Proof of Lemma 4.1

Before proving the main result, we introduce the following lemma.

Lemma B.5.

Under Assumptions A1 and A4, we have that for any a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A,

{φ2,a(η^)φ2,a(η)}μ^aμa(μ^aμa+π^aπa).less-than-or-similar-tosubscript𝜑2𝑎^𝜂subscript𝜑2𝑎𝜂normsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜋𝑎subscript𝜋𝑎\displaystyle\mathbb{P}\left\{\varphi_{2,a}(\widehat{\eta})-\varphi_{2,a}(\eta% )\right\}\lesssim\|\widehat{\mu}_{a}-{\mu}_{a}\|\left(\|\widehat{\mu}_{a}-{\mu% }_{a}\|+\|\widehat{\pi}_{a}-{\pi}_{a}\|\right).blackboard_P { italic_φ start_POSTSUBSCRIPT 2 , italic_a end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT 2 , italic_a end_POSTSUBSCRIPT ( italic_η ) } ≲ ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ + ∥ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ) .
Proof.

Since {φ2,a(η)}={μa2(X)}subscript𝜑2𝑎𝜂superscriptsubscript𝜇𝑎2𝑋\mathbb{P}\left\{\varphi_{2,a}(\eta)\right\}=\mathbb{P}\left\{\mu_{a}^{2}(X)\right\}blackboard_P { italic_φ start_POSTSUBSCRIPT 2 , italic_a end_POSTSUBSCRIPT ( italic_η ) } = blackboard_P { italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X ) }, it follows

{φ2,a(η^)φ2,a(η)}subscript𝜑2𝑎^𝜂subscript𝜑2𝑎𝜂\displaystyle\mathbb{P}\left\{\varphi_{2,a}(\widehat{\eta})-\varphi_{2,a}(\eta% )\right\}blackboard_P { italic_φ start_POSTSUBSCRIPT 2 , italic_a end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT 2 , italic_a end_POSTSUBSCRIPT ( italic_η ) } ={2μ^a𝟙(A=a)π^a{Yμ^A}+μ^a2μa2}absent2subscript^𝜇𝑎1𝐴𝑎subscript^𝜋𝑎𝑌subscript^𝜇𝐴superscriptsubscript^𝜇𝑎2superscriptsubscript𝜇𝑎2\displaystyle=\mathbb{P}\left\{2\widehat{\mu}_{a}\frac{\mathbbm{1}(A=a)}{% \widehat{\pi}_{a}}\left\{Y-\widehat{\mu}_{A}\right\}+\widehat{\mu}_{a}^{2}-\mu% _{a}^{2}\right\}= blackboard_P { 2 over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT divide start_ARG blackboard_1 ( italic_A = italic_a ) end_ARG start_ARG over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG { italic_Y - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT } + over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }
={2μ^aπaπ^a(μaμ^a)+(μ^aμa)(μ^a+μa)}absent2subscript^𝜇𝑎subscript𝜋𝑎subscript^𝜋𝑎subscript𝜇𝑎subscript^𝜇𝑎subscript^𝜇𝑎subscript𝜇𝑎subscript^𝜇𝑎subscript𝜇𝑎\displaystyle=\mathbb{P}\left\{2\widehat{\mu}_{a}\frac{\pi_{a}}{\widehat{\pi}_% {a}}(\mu_{a}-\widehat{\mu}_{a})+(\widehat{\mu}_{a}-\mu_{a})(\widehat{\mu}_{a}+% \mu_{a})\right\}= blackboard_P { 2 over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT divide start_ARG italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG ( italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) + ( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) ( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) }
=[(μaμ^a){4μ^a(πaπ^aπ^a)+μ^aμa}]absentdelimited-[]subscript𝜇𝑎subscript^𝜇𝑎4subscript^𝜇𝑎subscript𝜋𝑎subscript^𝜋𝑎subscript^𝜋𝑎subscript^𝜇𝑎subscript𝜇𝑎\displaystyle=\mathbb{P}\left[\left(\mu_{a}-\widehat{\mu}_{a}\right)\left\{4% \widehat{\mu}_{a}\left(\frac{\pi_{a}-\widehat{\pi}_{a}}{\widehat{\pi}_{a}}% \right)+\widehat{\mu}_{a}-{\mu}_{a}\right\}\right]= blackboard_P [ ( italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) { 4 over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( divide start_ARG italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG ) + over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT } ]
{|μ^aμa|(|μ^aμa|+4Bϵ|π^aπa|)}absentsubscript^𝜇𝑎subscript𝜇𝑎subscript^𝜇𝑎subscript𝜇𝑎4𝐵italic-ϵsubscript^𝜋𝑎subscript𝜋𝑎\displaystyle\leq\mathbb{P}\left\{\left|\widehat{\mu}_{a}-{\mu}_{a}\right|% \left(\left|\widehat{\mu}_{a}-{\mu}_{a}\right|+\frac{4B}{\epsilon}\left|% \widehat{\pi}_{a}-{\pi}_{a}\right|\right)\right\}≤ blackboard_P { | over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | ( | over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | + divide start_ARG 4 italic_B end_ARG start_ARG italic_ϵ end_ARG | over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | ) }
μ^aμa(μ^aμa+π^aπa)less-than-or-similar-toabsentnormsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜋𝑎subscript𝜋𝑎\displaystyle\lesssim\left\|\widehat{\mu}_{a}-{\mu}_{a}\right\|\left(\left\|% \widehat{\mu}_{a}-{\mu}_{a}\right\|+\left\|\widehat{\pi}_{a}-{\pi}_{a}\right\|\right)≲ ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ + ∥ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ )

Remark B.1.

(Kennedy (2022, Example 2)) For φ1,a(η)subscript𝜑1𝑎𝜂\varphi_{1,a}(\eta)italic_φ start_POSTSUBSCRIPT 1 , italic_a end_POSTSUBSCRIPT ( italic_η ), it is well known that

{φ1,a(η^)φ1,a(η)}μ^aμaπ^aπa.less-than-or-similar-tosubscript𝜑1𝑎^𝜂subscript𝜑1𝑎𝜂normsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜋𝑎subscript𝜋𝑎\displaystyle\mathbb{P}\left\{\varphi_{1,a}(\widehat{\eta})-\varphi_{1,a}(\eta% )\right\}\lesssim\|\widehat{\mu}_{a}-{\mu}_{a}\|\|\widehat{\pi}_{a}-{\pi}_{a}\|.blackboard_P { italic_φ start_POSTSUBSCRIPT 1 , italic_a end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT 1 , italic_a end_POSTSUBSCRIPT ( italic_η ) } ≲ ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ∥ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ .
Lemma B.6.

Suppose that Assumptions A1, A4 hold and \mathbb{P}blackboard_P satisfies the margin condition with some κ>0𝜅0\kappa>0italic_κ > 0, α>0𝛼0\alpha>0italic_α > 0. Then we have

|{φC(η^)φC(η)}|subscript𝜑superscript𝐶^𝜂subscript𝜑superscript𝐶𝜂\displaystyle\left|\mathbb{P}\left\{\varphi_{C^{*}}(\widehat{\eta})-\varphi_{C% ^{*}}(\eta)\right\}\right|| blackboard_P { italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_η ) } |
maxaμ^aμa(μ^aμa+π^aπa)+maxaμ^aμaα+1less-than-or-similar-toabsentsubscript𝑎normsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜋𝑎subscript𝜋𝑎subscript𝑎superscriptsubscriptnormsubscript^𝜇𝑎subscript𝜇𝑎𝛼1\displaystyle\lesssim\max_{a}\left\|\widehat{\mu}_{a}-{\mu}_{a}\right\|\left(% \left\|\widehat{\mu}_{a}-{\mu}_{a}\right\|+\left\|\widehat{\pi}_{a}-{\pi}_{a}% \right\|\right)+\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}^{\alpha+1}≲ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ + ∥ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ) + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α + 1 end_POSTSUPERSCRIPT
+1κmaxa(μ^aμaμ^aμa,1).1𝜅subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1\displaystyle\quad+\frac{1}{\kappa}\max_{a}\left(\|\widehat{\mu}_{a}-\mu_{a}\|% _{\infty}\left\|\widehat{\mu}_{a}-\mu_{a}\right\|_{\mathbb{P},1}\right).+ divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT ) .
Proof.

Letting

fcj(μ)subscript𝑓subscriptsuperscript𝑐𝑗𝜇\displaystyle f_{c^{*}_{j}}(\mu)italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) =μcj22,absentsuperscriptsubscriptnorm𝜇subscriptsuperscript𝑐𝑗22\displaystyle=\|\mu-c^{*}_{j}\|_{2}^{2},= ∥ italic_μ - italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
φcj(η)subscript𝜑subscriptsuperscript𝑐𝑗𝜂\displaystyle\varphi_{c^{*}_{j}}(\eta)italic_φ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) =a{φ2,a(η)2φ1,a(η)cja+cja2},absentsubscript𝑎subscript𝜑2𝑎𝜂2subscript𝜑1𝑎𝜂subscriptsuperscript𝑐𝑗𝑎superscriptsubscriptsuperscript𝑐𝑗𝑎2\displaystyle=\sum_{a}\left\{\varphi_{2,a}(\eta)-2\varphi_{1,a}(\eta)c^{*}_{ja% }+{c^{*}}_{ja}^{2}\right\},= ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT { italic_φ start_POSTSUBSCRIPT 2 , italic_a end_POSTSUBSCRIPT ( italic_η ) - 2 italic_φ start_POSTSUBSCRIPT 1 , italic_a end_POSTSUBSCRIPT ( italic_η ) italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_a end_POSTSUBSCRIPT + italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } ,

and

dd(μ;C)𝑑𝑑𝜇superscript𝐶\displaystyle d\equiv d(\mu;C^{*})italic_d ≡ italic_d ( italic_μ ; italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) =argminjfcj(μ),absentsubscriptargmin𝑗subscript𝑓subscriptsuperscript𝑐𝑗𝜇\displaystyle=\mathop{\mathrm{argmin}}_{j}f_{c^{*}_{j}}(\mu),= roman_argmin start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) ,
d^d(μ^;C)^𝑑𝑑^𝜇superscript𝐶\displaystyle\widehat{d}\equiv d(\widehat{\mu};C^{*})over^ start_ARG italic_d end_ARG ≡ italic_d ( over^ start_ARG italic_μ end_ARG ; italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) =argminjfcj(μ^),absentsubscriptargmin𝑗subscript𝑓subscriptsuperscript𝑐𝑗^𝜇\displaystyle=\mathop{\mathrm{argmin}}_{j}f_{c^{*}_{j}}(\widehat{\mu}),= roman_argmin start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) ,

one may write

{fC(μ)}subscript𝑓superscript𝐶𝜇\displaystyle\mathbb{P}\left\{f_{C^{*}}(\mu)\right\}blackboard_P { italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) } =[minj{1,,k}fcj(μ)]=j=1k{𝟙{d=j}fcj(μ)},absentdelimited-[]subscript𝑗1𝑘subscript𝑓subscriptsuperscript𝑐𝑗𝜇superscriptsubscript𝑗1𝑘1𝑑𝑗subscript𝑓subscriptsuperscript𝑐𝑗𝜇\displaystyle=\mathbb{P}\left[\min_{j\in\{1,\ldots,k\}}f_{c^{*}_{j}}(\mu)% \right]=\sum_{j=1}^{k}\mathbb{P}\left\{\mathbbm{1}\{d=j\}f_{c^{*}_{j}}(\mu)% \right\},= blackboard_P [ roman_min start_POSTSUBSCRIPT italic_j ∈ { 1 , … , italic_k } end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) ] = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT blackboard_P { blackboard_1 { italic_d = italic_j } italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) } ,
{φC(η)}subscript𝜑superscript𝐶𝜂\displaystyle\mathbb{P}\{\varphi_{C^{*}}(\eta)\}blackboard_P { italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_η ) } =j=1k{𝟙{d=j}φcj(η)},absentsuperscriptsubscript𝑗1𝑘1𝑑𝑗subscript𝜑subscriptsuperscript𝑐𝑗𝜂\displaystyle=\sum_{j=1}^{k}\mathbb{P}\left\{\mathbbm{1}\{d=j\}\varphi_{c^{*}_% {j}}(\eta)\right\},= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT blackboard_P { blackboard_1 { italic_d = italic_j } italic_φ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) } ,
{φC(η^)}subscript𝜑superscript𝐶^𝜂\displaystyle\mathbb{P}\{\varphi_{C^{*}}(\widehat{\eta})\}blackboard_P { italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) } =j=1k{𝟙{d^=j}φcj(η^)}.absentsuperscriptsubscript𝑗1𝑘1^𝑑𝑗subscript𝜑subscriptsuperscript𝑐𝑗^𝜂\displaystyle=\sum_{j=1}^{k}\mathbb{P}\left\{\mathbbm{1}\{\widehat{d}=j\}% \varphi_{c^{*}_{j}}(\widehat{\eta})\right\}.= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT blackboard_P { blackboard_1 { over^ start_ARG italic_d end_ARG = italic_j } italic_φ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) } .

Now note that

{φC(η^)φC(η)}subscript𝜑superscript𝐶^𝜂subscript𝜑superscript𝐶𝜂\displaystyle\mathbb{P}\left\{\varphi_{C^{*}}(\widehat{\eta})-\varphi_{C^{*}}(% \eta)\right\}blackboard_P { italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_η ) }
=j=1k([𝟙(d^=j){φcj(η^)φcj(η)}]+[{𝟙(d^=j)𝟙(d=j)}φcj(η)])absentsuperscriptsubscript𝑗1𝑘delimited-[]1^𝑑𝑗subscript𝜑subscriptsuperscript𝑐𝑗^𝜂subscript𝜑subscriptsuperscript𝑐𝑗𝜂delimited-[]1^𝑑𝑗1𝑑𝑗subscript𝜑subscriptsuperscript𝑐𝑗𝜂\displaystyle=\sum_{j=1}^{k}\left(\mathbb{P}\left[\mathbbm{1}(\widehat{d}=j)% \left\{\varphi_{c^{*}_{j}}(\widehat{\eta})-\varphi_{c^{*}_{j}}(\eta)\right\}% \right]+\mathbb{P}\left[\left\{\mathbbm{1}(\widehat{d}=j)-\mathbbm{1}(d=j)% \right\}\varphi_{c^{*}_{j}}(\eta)\right]\right)= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( blackboard_P [ blackboard_1 ( over^ start_ARG italic_d end_ARG = italic_j ) { italic_φ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) } ] + blackboard_P [ { blackboard_1 ( over^ start_ARG italic_d end_ARG = italic_j ) - blackboard_1 ( italic_d = italic_j ) } italic_φ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ] )
=j=1k([𝟙(d^=j){φcj(η^)φcj(η)}]+[{𝟙(d^=j)𝟙(d=j)}fcj(μ)]),absentsuperscriptsubscript𝑗1𝑘delimited-[]1^𝑑𝑗subscript𝜑subscriptsuperscript𝑐𝑗^𝜂subscript𝜑subscriptsuperscript𝑐𝑗𝜂delimited-[]1^𝑑𝑗1𝑑𝑗subscript𝑓subscriptsuperscript𝑐𝑗𝜇\displaystyle=\sum_{j=1}^{k}\left(\mathbb{P}\left[\mathbbm{1}(\widehat{d}=j)% \left\{\varphi_{c^{*}_{j}}(\widehat{\eta})-\varphi_{c^{*}_{j}}(\eta)\right\}% \right]+\mathbb{P}\left[\left\{\mathbbm{1}(\widehat{d}=j)-\mathbbm{1}(d=j)% \right\}f_{c^{*}_{j}}(\mu)\right]\right),= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( blackboard_P [ blackboard_1 ( over^ start_ARG italic_d end_ARG = italic_j ) { italic_φ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) } ] + blackboard_P [ { blackboard_1 ( over^ start_ARG italic_d end_ARG = italic_j ) - blackboard_1 ( italic_d = italic_j ) } italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) ] ) , (A.7)

where the last equality follows by the fact that {fC(μ)}={φC(η)}subscript𝑓superscript𝐶𝜇subscript𝜑superscript𝐶𝜂\mathbb{P}\left\{f_{C^{*}}(\mu)\right\}=\mathbb{P}\{\varphi_{C^{*}}(\eta)\}blackboard_P { italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) } = blackboard_P { italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_η ) }. For the first term in the last display, it is immediate to see by Lemma B.5 and Remark B.1 that

j=1k|[𝟙(d^=j){φcj(η^)φcj(η)}]|maxaμ^aμa(μ^aμa+π^aπa).less-than-or-similar-tosuperscriptsubscript𝑗1𝑘delimited-[]1^𝑑𝑗subscript𝜑subscriptsuperscript𝑐𝑗^𝜂subscript𝜑subscriptsuperscript𝑐𝑗𝜂subscript𝑎normsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜋𝑎subscript𝜋𝑎\displaystyle\sum_{j=1}^{k}\left|\mathbb{P}\left[\mathbbm{1}(\widehat{d}=j)% \left\{\varphi_{c^{*}_{j}}(\widehat{\eta})-\varphi_{c^{*}_{j}}(\eta)\right\}% \right]\right|\lesssim\max_{a}\left\|\widehat{\mu}_{a}-{\mu}_{a}\right\|\left(% \left\|\widehat{\mu}_{a}-{\mu}_{a}\right\|+\left\|\widehat{\pi}_{a}-{\pi}_{a}% \right\|\right).∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | blackboard_P [ blackboard_1 ( over^ start_ARG italic_d end_ARG = italic_j ) { italic_φ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) } ] | ≲ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ + ∥ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ) . (A.8)

Next, let us rewrite the second term in (A.7) by

j[{𝟙(d^=j)𝟙(d=j)}fcj(μ)]subscript𝑗delimited-[]1^𝑑𝑗1𝑑𝑗subscript𝑓subscriptsuperscript𝑐𝑗𝜇\displaystyle\sum_{j}\mathbb{P}\left[\left\{\mathbbm{1}(\widehat{d}=j)-% \mathbbm{1}(d=j)\right\}f_{c^{*}_{j}}(\mu)\right]∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT blackboard_P [ { blackboard_1 ( over^ start_ARG italic_d end_ARG = italic_j ) - blackboard_1 ( italic_d = italic_j ) } italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) ]
=j({𝟙(d^=j)𝟙(d=j)}fcj(μ)\displaystyle=\sum_{j}\mathbb{P}\Bigg{(}\left\{\mathbbm{1}(\widehat{d}=j)-% \mathbbm{1}(d=j)\right\}f_{c^{*}_{j}}(\mu)= ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT blackboard_P ( { blackboard_1 ( over^ start_ARG italic_d end_ARG = italic_j ) - blackboard_1 ( italic_d = italic_j ) } italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ )
×[𝟙{2maxj|fcj(μ^)fcj(μ)|κ}+𝟙{2maxj|fcj(μ^)fcj(μ)|>κ}]).\displaystyle\qquad\qquad\times\left[\mathbbm{1}\left\{2\max_{j}\left|f_{c^{*}% _{j}}(\widehat{\mu})-f_{c^{*}_{j}}(\mu)\right|\leq\kappa\right\}+\mathbbm{1}% \left\{2\max_{j}\left|f_{c^{*}_{j}}(\widehat{\mu})-f_{c^{*}_{j}}(\mu)\right|>% \kappa\right\}\right]\Bigg{)}.× [ blackboard_1 { 2 roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) | ≤ italic_κ } + blackboard_1 { 2 roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) | > italic_κ } ] ) .

By mimicking the proof of Theorem 2 of Levis et al. (2023), we have that

|j[{𝟙(d^=j)𝟙(d=j)}fcj(μ)𝟙{2maxj|fcj(μ^)fcj(μ)|κ}]|subscript𝑗delimited-[]1^𝑑𝑗1𝑑𝑗subscript𝑓subscriptsuperscript𝑐𝑗𝜇12subscript𝑗subscript𝑓subscriptsuperscript𝑐𝑗^𝜇subscript𝑓subscriptsuperscript𝑐𝑗𝜇𝜅\displaystyle\left|\sum_{j}\mathbb{P}\left[\left\{\mathbbm{1}(\widehat{d}=j)-% \mathbbm{1}(d=j)\right\}f_{c^{*}_{j}}(\mu)\mathbbm{1}\left\{2\max_{j}\left|f_{% c^{*}_{j}}(\widehat{\mu})-f_{c^{*}_{j}}(\mu)\right|\leq\kappa\right\}\right]\right|| ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT blackboard_P [ { blackboard_1 ( over^ start_ARG italic_d end_ARG = italic_j ) - blackboard_1 ( italic_d = italic_j ) } italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) blackboard_1 { 2 roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) | ≤ italic_κ } ] |
=[𝟙{fcd(μ)<fcd^(μ)}{fcd^(μ)fcd(μ)}𝟙{2maxj|fcj(μ^)fcj(μ)|κ}]absentdelimited-[]1subscript𝑓subscriptsuperscript𝑐𝑑𝜇subscript𝑓subscriptsuperscript𝑐^𝑑𝜇subscript𝑓subscriptsuperscript𝑐^𝑑𝜇subscript𝑓subscriptsuperscript𝑐𝑑𝜇12subscript𝑗subscript𝑓subscriptsuperscript𝑐𝑗^𝜇subscript𝑓subscriptsuperscript𝑐𝑗𝜇𝜅\displaystyle=\mathbb{P}\left[\mathbbm{1}\left\{f_{c^{*}_{d}}(\mu)<f_{c^{*}_{% \widehat{d}}}(\mu)\right\}\left\{f_{c^{*}_{\widehat{d}}}(\mu)-f_{c^{*}_{d}}(% \mu)\right\}\mathbbm{1}\left\{2\max_{j}\left|f_{c^{*}_{j}}(\widehat{\mu})-f_{c% ^{*}_{j}}(\mu)\right|\leq\kappa\right\}\right]= blackboard_P [ blackboard_1 { italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) < italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) } { italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) } blackboard_1 { 2 roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) | ≤ italic_κ } ]
(𝟙[minjd{fcj(μ)fcd(μ)}fcd^(μ)fcd(μ)+fcd(μ^)fcd^(μ^)]\displaystyle\leq\mathbb{P}\Bigg{(}\mathbbm{1}\left[\min_{j\neq d}\left\{f_{c^% {*}_{j}}(\mu)-f_{c^{*}_{d}}(\mu)\right\}\leq f_{c^{*}_{\widehat{d}}}(\mu)-f_{c% ^{*}_{d}}(\mu)+f_{c^{*}_{d}}(\widehat{\mu})-f_{c^{*}_{\widehat{d}}}(\widehat{% \mu})\right]≤ blackboard_P ( blackboard_1 [ roman_min start_POSTSUBSCRIPT italic_j ≠ italic_d end_POSTSUBSCRIPT { italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) } ≤ italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) + italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) ]
×{fcd^(μ)fcd(μ)+fcd(μ^)fcd^(μ^)}𝟙{2maxj|fcj(μ^)fcj(μ)|κ})\displaystyle\qquad\quad\times\left\{f_{c^{*}_{\widehat{d}}}(\mu)-f_{c^{*}_{d}% }(\mu)+f_{c^{*}_{d}}(\widehat{\mu})-f_{c^{*}_{\widehat{d}}}(\widehat{\mu})% \right\}\mathbbm{1}\left\{2\max_{j}\left|f_{c^{*}_{j}}(\widehat{\mu})-f_{c^{*}% _{j}}(\mu)\right|\leq\kappa\right\}\Bigg{)}× { italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) + italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) } blackboard_1 { 2 roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) | ≤ italic_κ } )
2maxjfcj(μ^)fcj(μ)absent2subscript𝑗subscriptnormsubscript𝑓subscriptsuperscript𝑐𝑗^𝜇subscript𝑓subscriptsuperscript𝑐𝑗𝜇\displaystyle\leq 2\max_{j}\|f_{c^{*}_{j}}(\widehat{\mu})-f_{c^{*}_{j}}(\mu)\|% _{\infty}≤ 2 roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
×[ζj(μ;C)2maxj|fcj(μ^)fcj(μ)||2maxj|fcj(μ^)fcj(μ)|κ]\displaystyle\quad\times\mathbb{P}\left[\zeta_{j}(\mu;C^{*})\leq 2\max_{j}% \left|f_{c^{*}_{j}}(\widehat{\mu})-f_{c^{*}_{j}}(\mu)\right|\Bigm{|}2\max_{j}% \left|f_{c^{*}_{j}}(\widehat{\mu})-f_{c^{*}_{j}}(\mu)\right|\leq\kappa\right]× blackboard_P [ italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_μ ; italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ 2 roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) | | 2 roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) | ≤ italic_κ ]
maxjfcj(μ^)fcj(μ)α+1less-than-or-similar-toabsentsubscript𝑗superscriptsubscriptnormsubscript𝑓subscriptsuperscript𝑐𝑗^𝜇subscript𝑓subscriptsuperscript𝑐𝑗𝜇𝛼1\displaystyle\lesssim\max_{j}\|f_{c^{*}_{j}}(\widehat{\mu})-f_{c^{*}_{j}}(\mu)% \|_{\infty}^{\alpha+1}≲ roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α + 1 end_POSTSUPERSCRIPT
maxaμ^aμaα+1,less-than-or-similar-toabsentsubscript𝑎superscriptsubscriptnormsubscript^𝜇𝑎subscript𝜇𝑎𝛼1\displaystyle\lesssim\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}^{\alpha+1},≲ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α + 1 end_POSTSUPERSCRIPT , (A.9)

where the first inequality follows by the fact that fcd(μ^)fcd^(μ^)subscript𝑓subscriptsuperscript𝑐𝑑^𝜇subscript𝑓subscriptsuperscript𝑐^𝑑^𝜇f_{c^{*}_{d}}(\widehat{\mu})\geq f_{c^{*}_{\widehat{d}}}(\widehat{\mu})italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) ≥ italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) and fcd^(μ)fcd(μ)subscript𝑓subscriptsuperscript𝑐^𝑑𝜇subscript𝑓subscriptsuperscript𝑐𝑑𝜇f_{c^{*}_{\widehat{d}}}(\mu)\geq f_{c^{*}_{d}}(\mu)italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) ≥ italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ), the third by the margin condition, and the last by local Lipschitz continuity of each fcjsubscript𝑓subscriptsuperscript𝑐𝑗f_{c^{*}_{j}}italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT at μ𝜇\muitalic_μ under Assumption A1.

Similarly as above, we also note that

|j[{𝟙(d^=j)𝟙(d=j)}fC(μ)𝟙{2maxj|fcj(μ^)fcj(μ)|>κ}]|subscript𝑗delimited-[]1^𝑑𝑗1𝑑𝑗subscript𝑓superscript𝐶𝜇12subscript𝑗subscript𝑓subscriptsuperscript𝑐𝑗^𝜇subscript𝑓subscriptsuperscript𝑐𝑗𝜇𝜅\displaystyle\left|\sum_{j}\mathbb{P}\left[\left\{\mathbbm{1}(\widehat{d}=j)-% \mathbbm{1}(d=j)\right\}f_{C^{*}}(\mu)\mathbbm{1}\left\{2\max_{j}\left|f_{c^{*% }_{j}}(\widehat{\mu})-f_{c^{*}_{j}}(\mu)\right|>\kappa\right\}\right]\right|| ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT blackboard_P [ { blackboard_1 ( over^ start_ARG italic_d end_ARG = italic_j ) - blackboard_1 ( italic_d = italic_j ) } italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) blackboard_1 { 2 roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) | > italic_κ } ] |
=[𝟙{fcd(μ)<fcd^(μ)}{fcd^(μ)fcd(μ)}𝟙{2maxj|fcj(μ^)fcj(μ)|>κ}]absentdelimited-[]1subscript𝑓subscriptsuperscript𝑐𝑑𝜇subscript𝑓subscriptsuperscript𝑐^𝑑𝜇subscript𝑓subscriptsuperscript𝑐^𝑑𝜇subscript𝑓subscriptsuperscript𝑐𝑑𝜇12subscript𝑗subscript𝑓subscriptsuperscript𝑐𝑗^𝜇subscript𝑓subscriptsuperscript𝑐𝑗𝜇𝜅\displaystyle=\mathbb{P}\left[\mathbbm{1}\left\{f_{c^{*}_{d}}(\mu)<f_{c^{*}_{% \widehat{d}}}(\mu)\right\}\left\{f_{c^{*}_{\widehat{d}}}(\mu)-f_{c^{*}_{d}}(% \mu)\right\}\mathbbm{1}\left\{2\max_{j}\left|f_{c^{*}_{j}}(\widehat{\mu})-f_{c% ^{*}_{j}}(\mu)\right|>\kappa\right\}\right]= blackboard_P [ blackboard_1 { italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) < italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) } { italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) } blackboard_1 { 2 roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) | > italic_κ } ]
[𝟙{fcd(μ)<fcd^(μ)}{fcd^(μ)fcd(μ)+fcd(μ^)fcd^(μ^)}𝟙{2maxj|fcj(μ^)fcj(μ)|>κ}]absentdelimited-[]1subscript𝑓subscriptsuperscript𝑐𝑑𝜇subscript𝑓subscriptsuperscript𝑐^𝑑𝜇subscript𝑓subscriptsuperscript𝑐^𝑑𝜇subscript𝑓subscriptsuperscript𝑐𝑑𝜇subscript𝑓subscriptsuperscript𝑐𝑑^𝜇subscript𝑓subscriptsuperscript𝑐^𝑑^𝜇12subscript𝑗subscript𝑓subscriptsuperscript𝑐𝑗^𝜇subscript𝑓subscriptsuperscript𝑐𝑗𝜇𝜅\displaystyle\leq\mathbb{P}\left[\mathbbm{1}\left\{f_{c^{*}_{d}}(\mu)<f_{c^{*}% _{\widehat{d}}}(\mu)\right\}\left\{f_{c^{*}_{\widehat{d}}}(\mu)-f_{c^{*}_{d}}(% \mu)+f_{c^{*}_{d}}(\widehat{\mu})-f_{c^{*}_{\widehat{d}}}(\widehat{\mu})\right% \}\mathbbm{1}\left\{2\max_{j}\left|f_{c^{*}_{j}}(\widehat{\mu})-f_{c^{*}_{j}}(% \mu)\right|>\kappa\right\}\right]≤ blackboard_P [ blackboard_1 { italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) < italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) } { italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) + italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) } blackboard_1 { 2 roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) | > italic_κ } ]
2maxjfcj(μ^)fcj(μ){maxj|fcj(μ^)fcj(μ)|>κ/2}absent2subscript𝑗subscriptnormsubscript𝑓subscriptsuperscript𝑐𝑗^𝜇subscript𝑓subscriptsuperscript𝑐𝑗𝜇subscript𝑗subscript𝑓subscriptsuperscript𝑐𝑗^𝜇subscript𝑓subscriptsuperscript𝑐𝑗𝜇𝜅2\displaystyle\leq 2\max_{j}\|f_{c^{*}_{j}}(\widehat{\mu})-f_{c^{*}_{j}}(\mu)\|% _{\infty}\mathbb{P}\left\{\max_{j}\left|f_{c^{*}_{j}}(\widehat{\mu})-f_{c^{*}_% {j}}(\mu)\right|>\kappa/2\right\}≤ 2 roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT blackboard_P { roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) | > italic_κ / 2 }
4κmaxjfcj(μ^)fcj(μ)maxj|fcj(μ^)fcj(μ)|absent4𝜅subscript𝑗subscriptnormsubscript𝑓subscriptsuperscript𝑐𝑗^𝜇subscript𝑓subscriptsuperscript𝑐𝑗𝜇subscript𝑗subscript𝑓subscriptsuperscript𝑐𝑗^𝜇subscript𝑓subscriptsuperscript𝑐𝑗𝜇\displaystyle\leq\frac{4}{\kappa}\max_{j}\|f_{c^{*}_{j}}(\widehat{\mu})-f_{c^{% *}_{j}}(\mu)\|_{\infty}\max_{j}\mathbb{P}\left|f_{c^{*}_{j}}(\widehat{\mu})-f_% {c^{*}_{j}}(\mu)\right|≤ divide start_ARG 4 end_ARG start_ARG italic_κ end_ARG roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT blackboard_P | italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) |
1κmaxaμ^aμaμ^aμa,1,less-than-or-similar-toabsent1𝜅subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1\displaystyle\lesssim\frac{1}{\kappa}\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{% \infty}\left\|\widehat{\mu}_{a}-\mu_{a}\right\|_{\mathbb{P},1},≲ divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT , (A.10)

which the first inequality follow by Hölder’s inequality, the second by Markov’s inequality. Putting these together, we finally obtain that

|{φC(η^)φC(η)}|subscript𝜑superscript𝐶^𝜂subscript𝜑superscript𝐶𝜂\displaystyle\left|\mathbb{P}\left\{\varphi_{C^{*}}(\widehat{\eta})-\varphi_{C% ^{*}}(\eta)\right\}\right|| blackboard_P { italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_η ) } |
maxaμ^aμa(μ^aμa+π^aπa)+maxaμ^aμaα+1less-than-or-similar-toabsentsubscript𝑎normsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜋𝑎subscript𝜋𝑎subscript𝑎superscriptsubscriptnormsubscript^𝜇𝑎subscript𝜇𝑎𝛼1\displaystyle\lesssim\max_{a}\left\|\widehat{\mu}_{a}-{\mu}_{a}\right\|\left(% \left\|\widehat{\mu}_{a}-{\mu}_{a}\right\|+\left\|\widehat{\pi}_{a}-{\pi}_{a}% \right\|\right)+\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}^{\alpha+1}≲ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ + ∥ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ) + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α + 1 end_POSTSUPERSCRIPT
+1κmaxaμ^aμaμ^aμa,1.1𝜅subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1\displaystyle\quad+\frac{1}{\kappa}\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{% \infty}\left\|\widehat{\mu}_{a}-\mu_{a}\right\|_{\mathbb{P},1}.+ divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT .

Remark B.2 (Proof of Lemma B.2).

The proof of Lemma B.2 parallels the proof of Lemma B.6 provided above. Indeed, since we have the counterpart of (A.7) as

{fC(μ^)fC(μ)}subscript𝑓superscript𝐶^𝜇subscript𝑓superscript𝐶𝜇\displaystyle\mathbb{P}\left\{f_{C^{*}}(\widehat{\mu})-f_{C^{*}}(\mu)\right\}blackboard_P { italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) }
=j=1k([𝟙(d^=j){fcj(μ^)fcj(μ)}]+[{𝟙(d^=j)𝟙(d=j)}fcj(μ)]),absentsuperscriptsubscript𝑗1𝑘delimited-[]1^𝑑𝑗subscript𝑓subscriptsuperscript𝑐𝑗^𝜇subscript𝑓subscriptsuperscript𝑐𝑗𝜇delimited-[]1^𝑑𝑗1𝑑𝑗subscript𝑓subscriptsuperscript𝑐𝑗𝜇\displaystyle=\sum_{j=1}^{k}\left(\mathbb{P}\left[\mathbbm{1}(\widehat{d}=j)% \left\{f_{c^{*}_{j}}(\widehat{\mu})-f_{c^{*}_{j}}(\mu)\right\}\right]+\mathbb{% P}\left[\left\{\mathbbm{1}(\widehat{d}=j)-\mathbbm{1}(d=j)\right\}f_{c^{*}_{j}% }(\mu)\right]\right),= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( blackboard_P [ blackboard_1 ( over^ start_ARG italic_d end_ARG = italic_j ) { italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) } ] + blackboard_P [ { blackboard_1 ( over^ start_ARG italic_d end_ARG = italic_j ) - blackboard_1 ( italic_d = italic_j ) } italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) ] ) ,

the only difference is to replace (A.8) with

j=1k|[𝟙(d^=j){fcj(μ^)fcj(μ)}]|maxaμ^aμa,1,less-than-or-similar-tosuperscriptsubscript𝑗1𝑘delimited-[]1^𝑑𝑗subscript𝑓subscriptsuperscript𝑐𝑗^𝜇subscript𝑓subscriptsuperscript𝑐𝑗𝜇subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1\displaystyle\sum_{j=1}^{k}\left|\mathbb{P}\left[\mathbbm{1}(\widehat{d}=j)% \left\{f_{c^{*}_{j}}(\widehat{\mu})-f_{c^{*}_{j}}(\mu)\right\}\right]\right|% \lesssim\max_{a}\left\|\widehat{\mu}_{a}-{\mu}_{a}\right\|_{\mathbb{P},1},∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | blackboard_P [ blackboard_1 ( over^ start_ARG italic_d end_ARG = italic_j ) { italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) } ] | ≲ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT ,

which gives the result.

Using the same logic as in the proof of Lemma B.6, we may obtain the following uniform bound.

Lemma B.7.

For any C𝒞k𝐶subscript𝒞𝑘C\in\mathcal{C}_{k}italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, under Assumptions A1, A4, we have

supC𝒞k|{φC(η^)φC(η)}|𝐶subscript𝒞𝑘supremumsubscript𝜑𝐶^𝜂subscript𝜑𝐶𝜂\displaystyle\underset{C\in\mathcal{C}_{k}}{\sup}\left|\mathbb{P}\left\{% \varphi_{C}(\widehat{\eta})-\varphi_{C}(\eta)\right\}\right|start_UNDERACCENT italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_sup end_ARG | blackboard_P { italic_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_η ) } | maxaμ^aμa,1+maxaμ^aμa(μ^aμa+π^aπa).less-than-or-similar-toabsentsubscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1subscript𝑎normsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜋𝑎subscript𝜋𝑎\displaystyle\lesssim\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\mathbb{P},1}+\max% _{a}\left\|\widehat{\mu}_{a}-{\mu}_{a}\right\|\left(\left\|\widehat{\mu}_{a}-{% \mu}_{a}\right\|+\left\|\widehat{\pi}_{a}-{\pi}_{a}\right\|\right).≲ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ + ∥ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ) .
Proof.

Notice that (A.7) and (A.8) hold for any C𝒞k𝐶subscript𝒞𝑘C\in\mathcal{C}_{k}italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, i.e.,

{φC(η^)φC(η)}subscript𝜑𝐶^𝜂subscript𝜑𝐶𝜂\displaystyle\mathbb{P}\left\{\varphi_{C}(\widehat{\eta})-\varphi_{C}(\eta)\right\}blackboard_P { italic_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_η ) }
=j=1k([𝟙(d^=j){φcj(η^)φcj(η)}]+[{𝟙(d^=j)𝟙(d=j)}fcj(μ)]),absentsuperscriptsubscript𝑗1𝑘delimited-[]1^𝑑𝑗subscript𝜑subscript𝑐𝑗^𝜂subscript𝜑subscript𝑐𝑗𝜂delimited-[]1^𝑑𝑗1𝑑𝑗subscript𝑓subscript𝑐𝑗𝜇\displaystyle=\sum_{j=1}^{k}\left(\mathbb{P}\left[\mathbbm{1}(\widehat{d}=j)% \left\{\varphi_{c_{j}}(\widehat{\eta})-\varphi_{c_{j}}(\eta)\right\}\right]+% \mathbb{P}\left[\left\{\mathbbm{1}(\widehat{d}=j)-\mathbbm{1}(d=j)\right\}f_{c% _{j}}(\mu)\right]\right),= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( blackboard_P [ blackboard_1 ( over^ start_ARG italic_d end_ARG = italic_j ) { italic_φ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) } ] + blackboard_P [ { blackboard_1 ( over^ start_ARG italic_d end_ARG = italic_j ) - blackboard_1 ( italic_d = italic_j ) } italic_f start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) ] ) ,

where

j=1k|[𝟙(d^=j){φcj(η^)φcj(η)}]|maxaμ^aμa(μ^aμa+π^aπa).less-than-or-similar-tosuperscriptsubscript𝑗1𝑘delimited-[]1^𝑑𝑗subscript𝜑subscript𝑐𝑗^𝜂subscript𝜑subscript𝑐𝑗𝜂subscript𝑎normsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜋𝑎subscript𝜋𝑎\displaystyle\sum_{j=1}^{k}\left|\mathbb{P}\left[\mathbbm{1}(\widehat{d}=j)% \left\{\varphi_{c_{j}}(\widehat{\eta})-\varphi_{c_{j}}(\eta)\right\}\right]% \right|\lesssim\max_{a}\left\|\widehat{\mu}_{a}-{\mu}_{a}\right\|\left(\left\|% \widehat{\mu}_{a}-{\mu}_{a}\right\|+\left\|\widehat{\pi}_{a}-{\pi}_{a}\right\|% \right).∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | blackboard_P [ blackboard_1 ( over^ start_ARG italic_d end_ARG = italic_j ) { italic_φ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) } ] | ≲ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ + ∥ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ) .

Further, proceeding similarly to (B.3), we may get

|j[{𝟙(d^=j)𝟙(d=j)}fcj(μ)]|subscript𝑗delimited-[]1^𝑑𝑗1𝑑𝑗subscript𝑓subscript𝑐𝑗𝜇\displaystyle\left|\sum_{j}\mathbb{P}\left[\left\{\mathbbm{1}(\widehat{d}=j)-% \mathbbm{1}(d=j)\right\}f_{c_{j}}(\mu)\right]\right|| ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT blackboard_P [ { blackboard_1 ( over^ start_ARG italic_d end_ARG = italic_j ) - blackboard_1 ( italic_d = italic_j ) } italic_f start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) ] |
=[𝟙{fcd(μ)<fcd^(μ)}{fcd^(μ)fcd(μ)+fcd(μ^)fcd^(μ^)}]absentdelimited-[]1subscript𝑓subscript𝑐𝑑𝜇subscript𝑓subscript𝑐^𝑑𝜇subscript𝑓subscript𝑐^𝑑𝜇subscript𝑓subscript𝑐𝑑𝜇subscript𝑓subscript𝑐𝑑^𝜇subscript𝑓subscript𝑐^𝑑^𝜇\displaystyle=\mathbb{P}\left[\mathbbm{1}\left\{f_{c_{d}}(\mu)<f_{c_{\widehat{% d}}}(\mu)\right\}\left\{f_{c_{\widehat{d}}}(\mu)-f_{c_{d}}(\mu)+f_{c_{d}}(% \widehat{\mu})-f_{c_{\widehat{d}}}(\widehat{\mu})\right\}\right]= blackboard_P [ blackboard_1 { italic_f start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) < italic_f start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) } { italic_f start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) + italic_f start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) } ]
2maxjfcj(μ^)fcj(μ),1absent2subscript𝑗subscriptnormsubscript𝑓subscript𝑐𝑗^𝜇subscript𝑓subscript𝑐𝑗𝜇1\displaystyle\leq 2\max_{j}\|f_{c_{j}}(\widehat{\mu})-f_{c_{j}}(\mu)\|_{% \mathbb{P},1}≤ 2 roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT
maxaμ^aμa,1,less-than-or-similar-toabsentsubscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1\displaystyle\lesssim\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\mathbb{P},1},≲ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT ,

which follows by Hölder’s inequality and local Lipchitz continuity of fcjsubscript𝑓subscript𝑐𝑗f_{c_{j}}italic_f start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT at μ𝜇\muitalic_μ. Hence, we conclude that for any C𝒞k𝐶subscript𝒞𝑘C\in\mathcal{C}_{k}italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT,

|{φC(η^)φC(η)}|subscript𝜑𝐶^𝜂subscript𝜑𝐶𝜂\displaystyle\left|\mathbb{P}\left\{\varphi_{C}(\widehat{\eta})-\varphi_{C}(% \eta)\right\}\right|| blackboard_P { italic_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_η ) } | maxaμ^aμa,1+maxaμ^aμa(μ^aμa+π^aπa).less-than-or-similar-toabsentsubscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1subscript𝑎normsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜋𝑎subscript𝜋𝑎\displaystyle\lesssim\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\mathbb{P},1}+\max% _{a}\left\|\widehat{\mu}_{a}-{\mu}_{a}\right\|\left(\left\|\widehat{\mu}_{a}-{% \mu}_{a}\right\|+\left\|\widehat{\pi}_{a}-{\pi}_{a}\right\|\right).≲ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ + ∥ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ) .

The result arises from the fact that the RHS is independent of C𝐶Citalic_C. ∎

Remark B.3 (Proof of Lemma B.4).

The proof of Lemma B.4 parallels that of Lemma B.7 given above. Indeed, for |{fC(μ^)fC(μ)}|subscript𝑓𝐶^𝜇subscript𝑓𝐶𝜇\left|\mathbb{P}\left\{f_{C}(\widehat{\mu})-f_{C}(\mu)\right\}\right|| blackboard_P { italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) } |, both

|[{𝟙(d^=j)𝟙(d=j)}fcj(μ)]|,delimited-[]1^𝑑𝑗1𝑑𝑗subscript𝑓subscript𝑐𝑗𝜇\displaystyle\left|\mathbb{P}\left[\left\{\mathbbm{1}(\widehat{d}=j)-\mathbbm{% 1}(d=j)\right\}f_{c_{j}}(\mu)\right]\right|,| blackboard_P [ { blackboard_1 ( over^ start_ARG italic_d end_ARG = italic_j ) - blackboard_1 ( italic_d = italic_j ) } italic_f start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) ] | ,
|[𝟙(d^=j){fcj(η^)fcj(η)}]|delimited-[]1^𝑑𝑗subscript𝑓subscript𝑐𝑗^𝜂subscript𝑓subscript𝑐𝑗𝜂\displaystyle\left|\mathbb{P}\left[\mathbbm{1}(\widehat{d}=j)\left\{f_{c_{j}}(% \widehat{\eta})-f_{c_{j}}(\eta)\right\}\right]\right|| blackboard_P [ blackboard_1 ( over^ start_ARG italic_d end_ARG = italic_j ) { italic_f start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_f start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) } ] |

are O(maxaμ^aμa,1)𝑂subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1O\left(\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\mathbb{P},1}\right)italic_O ( roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT ).

Proof of Lemma 4.1.

Recall that ϕC(z;)=φC(z;)φC(z;)𝑑subscriptitalic-ϕsuperscript𝐶𝑧subscript𝜑superscript𝐶𝑧subscript𝜑superscript𝐶𝑧differential-d\phi_{C^{*}}(z;\mathbb{P})=\varphi_{C^{*}}(z;\mathbb{P})-\int\varphi_{C^{*}}(z% ;\mathbb{P})d\mathbb{P}italic_ϕ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_z ; blackboard_P ) = italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_z ; blackboard_P ) - ∫ italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_z ; blackboard_P ) italic_d blackboard_P and R(C)={φC(η)}φC(z;)𝑑𝑅superscript𝐶subscript𝜑superscript𝐶𝜂subscript𝜑superscript𝐶𝑧differential-dR(C^{*})=\mathbb{P}\{\varphi_{C^{*}}(\eta)\}\equiv\int\varphi_{C^{*}}(z;% \mathbb{P})d\mathbb{P}italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = blackboard_P { italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_η ) } ≡ ∫ italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_z ; blackboard_P ) italic_d blackboard_P. For two distributions ¯,¯\bar{\mathbb{P}},\mathbb{P}over¯ start_ARG blackboard_P end_ARG , blackboard_P, the second-order remainder term in the von Mises expansion is given by

R2(¯,)subscript𝑅2¯\displaystyle R_{2}(\bar{\mathbb{P}},\mathbb{P})italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over¯ start_ARG blackboard_P end_ARG , blackboard_P ) =R¯(C)R(C)+fC(z;¯)𝑑absent¯𝑅superscript𝐶𝑅superscript𝐶subscript𝑓superscript𝐶𝑧¯differential-d\displaystyle=\bar{R}(C^{*})-R(C^{*})+\int f_{C^{*}}(z;\bar{\mathbb{P}})d% \mathbb{P}= over¯ start_ARG italic_R end_ARG ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + ∫ italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_z ; over¯ start_ARG blackboard_P end_ARG ) italic_d blackboard_P (A.11)
={φC(z;)φC(z;¯)}𝑑.absentsubscript𝜑superscript𝐶𝑧subscript𝜑superscript𝐶𝑧¯differential-d\displaystyle=\int\left\{\varphi_{C^{*}}(z;\mathbb{P})-\varphi_{C^{*}}(z;\bar{% \mathbb{P}})\right\}d\mathbb{P}.= ∫ { italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_z ; blackboard_P ) - italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_z ; over¯ start_ARG blackboard_P end_ARG ) } italic_d blackboard_P .

By Lemma B.6, the last term in (A.11) is further bounded as

|{φC(η¯)φC(η)}|subscript𝜑superscript𝐶¯𝜂subscript𝜑superscript𝐶𝜂\displaystyle\left|\mathbb{P}\left\{\varphi_{C^{*}}(\bar{\eta})-\varphi_{C^{*}% }(\eta)\right\}\right|| blackboard_P { italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over¯ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_η ) } | maxaμ¯aμa(μ¯aμa+π¯aπa)+maxaμ¯aμaα+1less-than-or-similar-toabsentsubscript𝑎normsubscript¯𝜇𝑎subscript𝜇𝑎normsubscript¯𝜇𝑎subscript𝜇𝑎normsubscript¯𝜋𝑎subscript𝜋𝑎subscript𝑎superscriptsubscriptnormsubscript¯𝜇𝑎subscript𝜇𝑎𝛼1\displaystyle\lesssim\max_{a}\left\|\bar{\mu}_{a}-{\mu}_{a}\right\|\left(\left% \|\bar{\mu}_{a}-{\mu}_{a}\right\|+\left\|\bar{\pi}_{a}-{\pi}_{a}\right\|\right% )+\max_{a}\|\bar{\mu}_{a}-\mu_{a}\|_{\infty}^{\alpha+1}≲ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ( ∥ over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ + ∥ over¯ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ) + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α + 1 end_POSTSUPERSCRIPT
+1κmaxa(μ¯aμaμ¯aμa,1).1𝜅subscript𝑎subscriptnormsubscript¯𝜇𝑎subscript𝜇𝑎subscriptnormsubscript¯𝜇𝑎subscript𝜇𝑎1\displaystyle\quad+\frac{1}{\kappa}\max_{a}\left(\|\bar{\mu}_{a}-\mu_{a}\|_{% \infty}\left\|\bar{\mu}_{a}-\mu_{a}\right\|_{\mathbb{P},1}\right).+ divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ∥ over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over¯ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT ) .

Hence for a submodel εsubscript𝜀\mathbb{P}_{\varepsilon}blackboard_P start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT, we have

ddεR2(,ε)|ε=0=0,\frac{d}{d\varepsilon}R_{2}(\mathbb{P},\mathbb{P}_{\varepsilon})\Bigm{|}_{% \varepsilon=0}=0,divide start_ARG italic_d end_ARG start_ARG italic_d italic_ε end_ARG italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_P , blackboard_P start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT italic_ε = 0 end_POSTSUBSCRIPT = 0 ,

by virtue of the fact that the remainder R2(,ε)subscript𝑅2subscript𝜀R_{2}(\mathbb{P},\mathbb{P}_{\varepsilon})italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_P , blackboard_P start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ) essentially consists of only second-order products of errors between ,εsubscript𝜀\mathbb{P},\mathbb{P}_{\varepsilon}blackboard_P , blackboard_P start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT. Since there is at most one efficient influence function in nonparametric models, now we can apply Lemma 2 of Kennedy et al. (2023) and conclude that ϕCsubscriptitalic-ϕsuperscript𝐶\phi_{C^{*}}italic_ϕ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is the efficient influence function. ∎

B.4 Proof of Lemma 4.2

Proof.

For any C𝒞ksuperscript𝐶superscriptsubscript𝒞𝑘C^{*}\in\mathcal{C}_{k}^{*}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, one may write

R^(C)^𝑅superscript𝐶\displaystyle\widehat{R}(C^{*})over^ start_ARG italic_R end_ARG ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) =b=1Kn{φC(η^b)𝟙(B=b)}absentsuperscriptsubscript𝑏1𝐾subscript𝑛subscript𝜑superscript𝐶subscript^𝜂𝑏1𝐵𝑏\displaystyle=\sum_{b=1}^{K}\mathbb{P}_{n}\left\{\varphi_{C^{*}}(\widehat{\eta% }_{-b})\mathbbm{1}(B=b)\right\}= ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT { italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT - italic_b end_POSTSUBSCRIPT ) blackboard_1 ( italic_B = italic_b ) }
R(C)𝑅superscript𝐶\displaystyle R(C^{*})italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) =𝔼(φC)=b=1K{φC(η)𝟙(B=b)},absent𝔼subscript𝜑superscript𝐶superscriptsubscript𝑏1𝐾subscript𝜑superscript𝐶𝜂1𝐵𝑏\displaystyle=\mathbb{E}(\varphi_{C^{*}})=\sum_{b=1}^{K}\mathbb{P}\left\{% \varphi_{C^{*}}(\eta)\mathbbm{1}(B=b)\right\},= blackboard_E ( italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_P { italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_η ) blackboard_1 ( italic_B = italic_b ) } ,

where we drop the dependence on Z𝑍Zitalic_Z in φCsubscript𝜑superscript𝐶\varphi_{C^{*}}italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT for simplicity. Then consider the following decomposition:

n{R^(C)R(C)}𝑛^𝑅superscript𝐶𝑅superscript𝐶\displaystyle\sqrt{n}\left\{\widehat{R}(C^{*})-R(C^{*})\right\}square-root start_ARG italic_n end_ARG { over^ start_ARG italic_R end_ARG ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } =b=1K𝔾n[{φC(η^b)φC(η)}𝟙(B=b)](i)absentsuperscriptsubscript𝑏1𝐾subscriptsubscript𝔾𝑛delimited-[]subscript𝜑superscript𝐶subscript^𝜂𝑏subscript𝜑superscript𝐶𝜂1𝐵𝑏(i)\displaystyle=\sum_{b=1}^{K}\underbrace{\mathbb{G}_{n}\left[\left\{\varphi_{C^% {*}}(\widehat{\eta}_{-b})-\varphi_{C^{*}}(\eta)\right\}\mathbbm{1}(B=b)\right]% }_{\text{\clap{(i)\leavevmode\nobreak\ }}}= ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT under⏟ start_ARG blackboard_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ { italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT - italic_b end_POSTSUBSCRIPT ) - italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_η ) } blackboard_1 ( italic_B = italic_b ) ] end_ARG start_POSTSUBSCRIPT (i) end_POSTSUBSCRIPT
+nb=1K[{φC(η^b)φC(η)}𝟙(B=b)](ii)𝑛superscriptsubscript𝑏1𝐾subscriptdelimited-[]subscript𝜑superscript𝐶subscript^𝜂𝑏subscript𝜑superscript𝐶𝜂1𝐵𝑏(ii)\displaystyle\quad+\sqrt{n}\sum_{b=1}^{K}\underbrace{\mathbb{P}\left[\left\{% \varphi_{C^{*}}(\widehat{\eta}_{-b})-\varphi_{C^{*}}(\eta)\right\}\mathbbm{1}(% B=b)\right]}_{\text{\clap{(ii)\leavevmode\nobreak\ }}}+ square-root start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT under⏟ start_ARG blackboard_P [ { italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT - italic_b end_POSTSUBSCRIPT ) - italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_η ) } blackboard_1 ( italic_B = italic_b ) ] end_ARG start_POSTSUBSCRIPT (ii) end_POSTSUBSCRIPT
+𝔾n{φC(η)}.subscript𝔾𝑛subscript𝜑superscript𝐶𝜂\displaystyle\quad+\mathbb{G}_{n}\left\{\varphi_{C^{*}}(\eta)\right\}.+ blackboard_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT { italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_η ) } .

It suffices to show that the terms (i)𝑖(i)( italic_i ) and (ii)𝑖𝑖(ii)( italic_i italic_i ) are negligible, as the last term converges to N(0,var(φC))𝑁0varsubscript𝜑superscript𝐶N\left(0,\text{var}\left(\varphi_{C^{*}}\right)\right)italic_N ( 0 , var ( italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ) by the central limit theorem.

(i) Noting nn/Kless-than-or-similar-to𝑛𝑛𝐾n\lesssim n/Kitalic_n ≲ italic_n / italic_K with fixed K𝐾Kitalic_K, we have

{φC(η^b)φC(η)}𝟙(B=b)φC(η^)φC(η)less-than-or-similar-tonormsubscript𝜑superscript𝐶subscript^𝜂𝑏subscript𝜑superscript𝐶𝜂1𝐵𝑏normsubscript𝜑superscript𝐶^𝜂subscript𝜑superscript𝐶𝜂\displaystyle\left\|\left\{\varphi_{C^{*}}(\widehat{\eta}_{-b})-\varphi_{C^{*}% }(\eta)\right\}\mathbbm{1}(B=b)\right\|\lesssim\left\|\varphi_{C^{*}}(\widehat% {\eta})-\varphi_{C^{*}}(\eta)\right\|∥ { italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT - italic_b end_POSTSUBSCRIPT ) - italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_η ) } blackboard_1 ( italic_B = italic_b ) ∥ ≲ ∥ italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_η ) ∥
aφ2,a(η^)φ2,a(η)+2ΠC,a(μ)(φ1,a(η)φ1,a(η^))+{π^a+πa2φ1,a(η^)}(π^aπa)absentsubscript𝑎normsubscript𝜑2𝑎^𝜂subscript𝜑2𝑎𝜂2subscriptΠsuperscript𝐶𝑎𝜇subscript𝜑1𝑎𝜂subscript𝜑1𝑎^𝜂subscript^𝜋𝑎subscript𝜋𝑎2subscript𝜑1𝑎^𝜂subscript^𝜋𝑎subscript𝜋𝑎\displaystyle\quad\leq\sum_{a}\left\|\varphi_{2,a}(\widehat{\eta})-\varphi_{2,% a}(\eta)+2\Pi_{C^{*},a}(\mu)(\varphi_{1,a}(\eta)-\varphi_{1,a}(\widehat{\eta})% )+\left\{\widehat{\pi}_{a}+{\pi}_{a}-2\varphi_{1,a}(\widehat{\eta})\right\}(% \widehat{\pi}_{a}-{\pi}_{a})\right\|≤ ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ italic_φ start_POSTSUBSCRIPT 2 , italic_a end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT 2 , italic_a end_POSTSUBSCRIPT ( italic_η ) + 2 roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a end_POSTSUBSCRIPT ( italic_μ ) ( italic_φ start_POSTSUBSCRIPT 1 , italic_a end_POSTSUBSCRIPT ( italic_η ) - italic_φ start_POSTSUBSCRIPT 1 , italic_a end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) ) + { over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - 2 italic_φ start_POSTSUBSCRIPT 1 , italic_a end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) } ( over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) ∥
a(φ2,a(η^)φ2,a(η)+φ1,a(η^)φ1,a(η)+ΠC,a(μ^)ΠC,a(μ)).less-than-or-similar-toabsentsubscript𝑎normsubscript𝜑2𝑎^𝜂subscript𝜑2𝑎𝜂normsubscript𝜑1𝑎^𝜂subscript𝜑1𝑎𝜂normsubscriptΠsuperscript𝐶𝑎^𝜇subscriptΠsuperscript𝐶𝑎𝜇\displaystyle\quad\lesssim\sum_{a}\left(\left\|\varphi_{2,a}(\widehat{\eta})-% \varphi_{2,a}(\eta)\right\|+\left\|\varphi_{1,a}(\widehat{\eta})-\varphi_{1,a}% (\eta)\right\|+\left\|\Pi_{C^{*},a}(\widehat{\mu})-\Pi_{C^{*},a}(\mu)\right\|% \right).≲ ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ∥ italic_φ start_POSTSUBSCRIPT 2 , italic_a end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT 2 , italic_a end_POSTSUBSCRIPT ( italic_η ) ∥ + ∥ italic_φ start_POSTSUBSCRIPT 1 , italic_a end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT 1 , italic_a end_POSTSUBSCRIPT ( italic_η ) ∥ + ∥ roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a end_POSTSUBSCRIPT ( italic_μ ) ∥ ) .

By adding and subtracting terms, it is straightforward to show

φ2,a(η^)φ2,a(η)normsubscript𝜑2𝑎^𝜂subscript𝜑2𝑎𝜂\displaystyle\left\|\varphi_{2,a}(\widehat{\eta})-\varphi_{2,a}(\eta)\right\|∥ italic_φ start_POSTSUBSCRIPT 2 , italic_a end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT 2 , italic_a end_POSTSUBSCRIPT ( italic_η ) ∥
μ^a𝟙(A=a)π^a(μAμ^A)+𝟙(A=a)(YμA)(μ^aπ^aμaπ^a+μaπ^aμaπa)+(μ^aμa)(π^aπa)absentnormsubscript^𝜇𝑎1𝐴𝑎subscript^𝜋𝑎subscript𝜇𝐴subscript^𝜇𝐴1𝐴𝑎𝑌subscript𝜇𝐴subscript^𝜇𝑎subscript^𝜋𝑎subscript𝜇𝑎subscript^𝜋𝑎subscript𝜇𝑎subscript^𝜋𝑎subscript𝜇𝑎subscript𝜋𝑎subscript^𝜇𝑎subscript𝜇𝑎subscript^𝜋𝑎subscript𝜋𝑎\displaystyle\quad\leq\left\|\widehat{\mu}_{a}\frac{\mathbbm{1}(A=a)}{\widehat% {\pi}_{a}}\left(\mu_{A}-\widehat{\mu}_{A}\right)+\mathbbm{1}(A=a)(Y-\mu_{A})% \left(\frac{\widehat{\mu}_{a}}{\widehat{\pi}_{a}}-\frac{\mu_{a}}{\widehat{\pi}% _{a}}+\frac{\mu_{a}}{\widehat{\pi}_{a}}-\frac{\mu_{a}}{\pi_{a}}\right)+(% \widehat{\mu}_{a}-\mu_{a})(\widehat{\pi}_{a}-\pi_{a})\right\|≤ ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT divide start_ARG blackboard_1 ( italic_A = italic_a ) end_ARG start_ARG over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG ( italic_μ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) + blackboard_1 ( italic_A = italic_a ) ( italic_Y - italic_μ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) ( divide start_ARG over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG - divide start_ARG italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG - divide start_ARG italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG ) + ( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) ( over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) ∥
μ^aμa+π^aπa.less-than-or-similar-toabsentnormsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜋𝑎subscript𝜋𝑎\displaystyle\quad\lesssim\left\|\widehat{\mu}_{a}-\mu_{a}\right\|+\left\|% \widehat{\pi}_{a}-\pi_{a}\right\|.≲ ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ + ∥ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ .

Similarly, one may get

φ1,a(η^)φ1,a(η)μ^aμa+π^aπa.less-than-or-similar-tonormsubscript𝜑1𝑎^𝜂subscript𝜑1𝑎𝜂normsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜋𝑎subscript𝜋𝑎\left\|\varphi_{1,a}(\widehat{\eta})-\varphi_{1,a}(\eta)\right\|\lesssim\left% \|\widehat{\mu}_{a}-\mu_{a}\right\|+\left\|\widehat{\pi}_{a}-\pi_{a}\right\|.∥ italic_φ start_POSTSUBSCRIPT 1 , italic_a end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT 1 , italic_a end_POSTSUBSCRIPT ( italic_η ) ∥ ≲ ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ + ∥ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ .

Further, we showed in (B.1) that ΠC,a(μ^)ΠC,a(μ)=o(1)normsubscriptΠsuperscript𝐶𝑎^𝜇subscriptΠsuperscript𝐶𝑎𝜇subscript𝑜1\left\|\Pi_{C^{*},a}(\widehat{\mu})-\Pi_{C^{*},a}(\mu)\right\|=o_{\mathbb{P}}(1)∥ roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - roman_Π start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a end_POSTSUBSCRIPT ( italic_μ ) ∥ = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ) if maxaμ^aμa=o(1)subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎subscript𝑜1\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}=o_{\mathbb{P}}(1)roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ).

Putting the three pieces together, we conclude that φC(η^)φC(η)=o(1)normsubscript𝜑superscript𝐶^𝜂subscript𝜑superscript𝐶𝜂subscript𝑜1\left\|\varphi_{C^{*}}(\widehat{\eta})-\varphi_{C^{*}}(\eta)\right\|=o_{% \mathbb{P}}(1)∥ italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_η ) ∥ = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ) under the consistency condition in Assumption A5. Hence, we conclude

𝔾n[{φC(η^b)φC(η)}𝟙(B=b)]=o(1n),subscript𝔾𝑛delimited-[]subscript𝜑superscript𝐶subscript^𝜂𝑏subscript𝜑superscript𝐶𝜂1𝐵𝑏subscript𝑜1𝑛\displaystyle\mathbb{G}_{n}\left[\left\{\varphi_{C^{*}}(\widehat{\eta}_{-b})-% \varphi_{C^{*}}(\eta)\right\}\mathbbm{1}(B=b)\right]=o_{\mathbb{P}}\left(\frac% {1}{\sqrt{n}}\right),blackboard_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ { italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT - italic_b end_POSTSUBSCRIPT ) - italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_η ) } blackboard_1 ( italic_B = italic_b ) ] = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ) ,

which follows by the sample splitting lemma (Kennedy et al. 2018, Lemma 2).

(ii) Noting that

|[{φC(η^b)φC(η)}𝟙(B=b)]|delimited-[]subscript𝜑superscript𝐶subscript^𝜂𝑏subscript𝜑superscript𝐶𝜂1𝐵𝑏\displaystyle\left|\mathbb{P}\left[\left\{\varphi_{C^{*}}(\widehat{\eta}_{-b})% -\varphi_{C^{*}}(\eta)\right\}\mathbbm{1}(B=b)\right]\right|| blackboard_P [ { italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT - italic_b end_POSTSUBSCRIPT ) - italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_η ) } blackboard_1 ( italic_B = italic_b ) ] | |{φC(η^)φC(η)}|,less-than-or-similar-toabsentsubscript𝜑superscript𝐶^𝜂subscript𝜑superscript𝐶𝜂\displaystyle\lesssim\left|\mathbb{P}\left\{\varphi_{C^{*}}(\widehat{\eta})-% \varphi_{C^{*}}(\eta)\right\}\right|,≲ | blackboard_P { italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_η ) } | ,

by Lemma B.6 we get

|{φC(η^)φC(η)}|subscript𝜑superscript𝐶^𝜂subscript𝜑superscript𝐶𝜂\displaystyle\left|\mathbb{P}\left\{\varphi_{C^{*}}(\widehat{\eta})-\varphi_{C% ^{*}}(\eta)\right\}\right|| blackboard_P { italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_η ) } | maxaμ^aμa(μ^aμa+π^aπa)+maxaμ^aμaα+1less-than-or-similar-toabsentsubscript𝑎normsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜋𝑎subscript𝜋𝑎subscript𝑎superscriptsubscriptnormsubscript^𝜇𝑎subscript𝜇𝑎𝛼1\displaystyle\lesssim\max_{a}\left\|\widehat{\mu}_{a}-{\mu}_{a}\right\|\left(% \left\|\widehat{\mu}_{a}-{\mu}_{a}\right\|+\left\|\widehat{\pi}_{a}-{\pi}_{a}% \right\|\right)+\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}^{\alpha+1}≲ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ + ∥ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ) + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α + 1 end_POSTSUPERSCRIPT
+1κmaxa(μ^aμaμ^aμa,1).1𝜅subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1\displaystyle\quad+\frac{1}{\kappa}\max_{a}\left(\|\widehat{\mu}_{a}-\mu_{a}\|% _{\infty}\left\|\widehat{\mu}_{a}-\mu_{a}\right\|_{\mathbb{P},1}\right).+ divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT ) .

which is o(1n)subscript𝑜1𝑛o_{\mathbb{P}}(\frac{1}{\sqrt{n}})italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ) by the given nonparametric condition R2,n=o(n1/2)subscript𝑅2𝑛subscript𝑜superscript𝑛12R_{2,n}=o_{\mathbb{P}}(n^{-1/2})italic_R start_POSTSUBSCRIPT 2 , italic_n end_POSTSUBSCRIPT = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ).

Finally, the desired result follows by Slutsky’s theorem. ∎

B.5 Proof of Corollary 4.3

Proof.

The proof follows the exact same logic as that of Theorem 3.2. It boils down to show supC𝒞k|{φC(η^)φC(η)}|=o(1)𝐶subscript𝒞𝑘supremumsubscript𝜑𝐶^𝜂subscript𝜑𝐶𝜂subscript𝑜1\underset{C\in\mathcal{C}_{k}}{\sup}\left|\mathbb{P}\left\{\varphi_{C}(% \widehat{\eta})-\varphi_{C}(\eta)\right\}\right|=o_{\mathbb{P}}(1)start_UNDERACCENT italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_sup end_ARG | blackboard_P { italic_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_η ) } | = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ). This follows under the consistency condition in Assumption A5 since

supC𝒞k|{φC(η^)φC(η)}|maxaμ^aμaless-than-or-similar-to𝐶subscript𝒞𝑘supremumsubscript𝜑𝐶^𝜂subscript𝜑𝐶𝜂subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎\displaystyle\underset{C\in\mathcal{C}_{k}}{\sup}\left|\mathbb{P}\left\{% \varphi_{C}(\widehat{\eta})-\varphi_{C}(\eta)\right\}\right|\lesssim\max_{a}\|% \widehat{\mu}_{a}-\mu_{a}\|_{\infty}start_UNDERACCENT italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_sup end_ARG | blackboard_P { italic_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( over^ start_ARG italic_η end_ARG ) - italic_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_η ) } | ≲ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT

due to Lemma B.7. ∎

B.6 Proof of Theorem 4.4

Proof.

The first order condition for a solution to the minimization problem (12) is given by {φC(Z;η)}={φC(Z;η)}=0subscript𝜑superscript𝐶𝑍𝜂subscriptφsuperscript𝐶𝑍𝜂0\mathbb{P}\left\{\nabla\varphi_{C^{*}}(Z;\eta)\right\}=\mathbb{P}\left\{% \upvarphi_{C^{*}}(Z;\eta)\right\}=0blackboard_P { ∇ italic_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; italic_η ) } = blackboard_P { roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; italic_η ) } = 0, where φφ\upvarphiroman_φ is defined in (13). Also note that (12) is equivalent to minimizing R^(C)^𝑅𝐶\widehat{R}(C)over^ start_ARG italic_R end_ARG ( italic_C ) with

φC(Z;η)=a𝒜{φ1,a2(Z;η)2φ1,a(Z;η)[ΠC(μ)]a+[ΠC(μ)]a2}.subscript𝜑𝐶𝑍𝜂subscript𝑎𝒜superscriptsubscript𝜑1𝑎2𝑍𝜂2subscript𝜑1𝑎𝑍𝜂subscriptdelimited-[]subscriptΠ𝐶𝜇𝑎subscriptsuperscriptdelimited-[]subscriptΠ𝐶𝜇2𝑎\displaystyle\varphi_{C}(Z;\eta)=\sum_{a\in\mathcal{A}}\left\{\varphi_{1,a}^{2% }(Z;\eta)-2\varphi_{1,a}(Z;\eta)\left[\Pi_{C}\left(\mu\right)\right]_{a}+\left% [\Pi_{C}\left(\mu\right)\right]^{2}_{a}\right\}.italic_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_Z ; italic_η ) = ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT { italic_φ start_POSTSUBSCRIPT 1 , italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_Z ; italic_η ) - 2 italic_φ start_POSTSUBSCRIPT 1 , italic_a end_POSTSUBSCRIPT ( italic_Z ; italic_η ) [ roman_Π start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) ] start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + [ roman_Π start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT } . (A.12)

We will proceed with (A.12) in the proof.

We use the logic that parallels the proof of Theorem 3 of Kennedy et al. (2023). By abuse of notation, we rewrite the empirical moment condition as

o(1n)subscript𝑜1𝑛\displaystyle o_{\mathbb{P}}\left(\frac{1}{\sqrt{n}}\right)italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ) =n{φC^(Z;η^)}{φC(Z;η)}absentsubscript𝑛subscriptφ^𝐶𝑍^𝜂subscriptφsuperscript𝐶𝑍𝜂\displaystyle=\mathbb{P}_{n}\left\{\upvarphi_{\widehat{C}}(Z;\widehat{\eta})% \right\}-\mathbb{P}\left\{\upvarphi_{C^{*}}(Z;\eta)\right\}= blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT { roman_φ start_POSTSUBSCRIPT over^ start_ARG italic_C end_ARG end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) } - blackboard_P { roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; italic_η ) }
=(n){φC(Z;η)}+(n){φC^(Z;η^)φC(Z;η^)}absentsubscript𝑛subscriptφsuperscript𝐶𝑍𝜂subscript𝑛subscriptφ^𝐶𝑍^𝜂subscriptφsuperscript𝐶𝑍^𝜂\displaystyle=(\mathbb{P}_{n}-\mathbb{P})\left\{\upvarphi_{C^{*}}(Z;\eta)% \right\}+(\mathbb{P}_{n}-\mathbb{P})\left\{\upvarphi_{\widehat{C}}(Z;\widehat{% \eta})-\upvarphi_{C^{*}}(Z;\widehat{\eta})\right\}= ( blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - blackboard_P ) { roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; italic_η ) } + ( blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - blackboard_P ) { roman_φ start_POSTSUBSCRIPT over^ start_ARG italic_C end_ARG end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) - roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) } (A.13)
+(n){φC(Z;η^)φC(Z;η)}subscript𝑛subscriptφsuperscript𝐶𝑍^𝜂subscriptφsuperscript𝐶𝑍𝜂\displaystyle\quad+(\mathbb{P}_{n}-\mathbb{P})\left\{\upvarphi_{C^{*}}(Z;% \widehat{\eta})-\upvarphi_{C^{*}}(Z;\eta)\right\}+ ( blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - blackboard_P ) { roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) - roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; italic_η ) } (A.14)
+{φC^(Z;η^)φC(Z;η^)}+{φC(Z;η^)φC(Z;η)},subscriptφ^𝐶𝑍^𝜂subscriptφsuperscript𝐶𝑍^𝜂subscriptφsuperscript𝐶𝑍^𝜂subscriptφsuperscript𝐶𝑍𝜂\displaystyle\quad+\mathbb{P}\left\{\upvarphi_{\widehat{C}}(Z;\widehat{\eta})-% \upvarphi_{C^{*}}(Z;\widehat{\eta})\right\}+\mathbb{P}\left\{\upvarphi_{C^{*}}% (Z;\widehat{\eta})-\upvarphi_{C^{*}}(Z;\eta)\right\},+ blackboard_P { roman_φ start_POSTSUBSCRIPT over^ start_ARG italic_C end_ARG end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) - roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) } + blackboard_P { roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) - roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; italic_η ) } , (A.15)

where can be obtained by simply adding and subtracting terms. Note that the above represents a system of kp𝑘𝑝kpitalic_k italic_p equations. Here we omit the term 𝟙(B=b)1𝐵𝑏\mathbbm{1}(B=b)blackboard_1 ( italic_B = italic_b ) for simplicity. The terms in (A.13), (A.14), and (A.15) will be addressed sequentially.

The first term in (A.13) will be asymptotically multivariate Gaussian by the central limit theorem, and hence O(1/n)subscript𝑂1𝑛O_{\mathbb{P}}(1/\sqrt{n})italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 / square-root start_ARG italic_n end_ARG ). Also, under Assumption A1 and the boundedness condition 𝔼(Y2X)<\left\|\mathbb{E}(Y^{2}\mid X)\right\|_{\infty}<\infty∥ blackboard_E ( italic_Y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ italic_X ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT < ∞, it is immediate to see

(cjφ1(Z;η^))𝟙{j=d(μ^,C)}(c^jφ1(Z;η^))𝟙{j=d(μ^,C^)}normsubscriptsuperscript𝑐𝑗subscriptφ1𝑍^𝜂1𝑗𝑑^𝜇superscript𝐶subscript^𝑐𝑗subscriptφ1𝑍^𝜂1𝑗𝑑^𝜇^𝐶\displaystyle\left\|(c^{*}_{j}-\upvarphi_{1}(Z;\widehat{\eta}))\mathbbm{1}\{j=% d(\widehat{\mu},C^{*})\}-(\widehat{c}_{j}-\upvarphi_{1}(Z;\widehat{\eta}))% \mathbbm{1}\{j=d(\widehat{\mu},\widehat{C})\}\right\|∥ ( italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - roman_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) ) blackboard_1 { italic_j = italic_d ( over^ start_ARG italic_μ end_ARG , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } - ( over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - roman_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) ) blackboard_1 { italic_j = italic_d ( over^ start_ARG italic_μ end_ARG , over^ start_ARG italic_C end_ARG ) } ∥ (A.16)

is bounded for each j=1,,k𝑗1𝑘j=1,\ldots,kitalic_j = 1 , … , italic_k. In the proof of Theorem 4.5, we shall show that the term (A.16) is indeed o(1)subscript𝑜1o_{\mathbb{P}}(1)italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ). Thus, by Kennedy et al. (2018, Lemma 2), for the second term in (A.13), we get

(n){φC^(Z;η^)φC(Z;η^)}subscript𝑛subscriptφ^𝐶𝑍^𝜂subscriptφsuperscript𝐶𝑍^𝜂\displaystyle(\mathbb{P}_{n}-\mathbb{P})\left\{\upvarphi_{\widehat{C}}(Z;% \widehat{\eta})-\upvarphi_{C^{*}}(Z;\widehat{\eta})\right\}( blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - blackboard_P ) { roman_φ start_POSTSUBSCRIPT over^ start_ARG italic_C end_ARG end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) - roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) } =O(φC^(Z;η^)φC(Z;η^)n)absentsubscript𝑂normsubscriptφ^𝐶𝑍^𝜂subscriptφsuperscript𝐶𝑍^𝜂𝑛\displaystyle=O_{\mathbb{P}}\left(\frac{\left\|\upvarphi_{\widehat{C}}(Z;% \widehat{\eta})-\upvarphi_{C^{*}}(Z;\widehat{\eta})\right\|}{\sqrt{n}}\right)= italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( divide start_ARG ∥ roman_φ start_POSTSUBSCRIPT over^ start_ARG italic_C end_ARG end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) - roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) ∥ end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG )
=O(1/n).absentsubscript𝑂1𝑛\displaystyle=O_{\mathbb{P}}(1/\sqrt{n}).= italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 / square-root start_ARG italic_n end_ARG ) . (A.17)

Under the consistency condition in Assumption A5, the term in (A.14) is o(1/n)subscript𝑜1𝑛o_{\mathbb{P}}\left(1/\sqrt{n}\right)italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 / square-root start_ARG italic_n end_ARG ) by Kennedy et al. (2018, Lemma 2).

Next, we shall analyze the second term in (A.15). It suffices to analyze the j𝑗jitalic_j-th block of the derivative vector (13). By adding and subtracting terms, it is immediate to see that

[(cjφ1(Z;η^))𝟙{j=d(μ^,C)}(cjφ1(Z;η))𝟙{j=d(μ,C)}]delimited-[]subscript𝑐𝑗subscriptφ1𝑍^𝜂1𝑗𝑑^𝜇superscript𝐶subscript𝑐𝑗subscriptφ1𝑍𝜂1𝑗𝑑𝜇superscript𝐶\displaystyle\mathbb{P}\left[\left(c_{j}-\upvarphi_{1}(Z;\widehat{\eta})\right% )\mathbbm{1}\{j=d(\widehat{\mu},C^{*})\}-\left(c_{j}-\upvarphi_{1}(Z;\eta)% \right)\mathbbm{1}\{j=d(\mu,C^{*})\}\right]blackboard_P [ ( italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - roman_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) ) blackboard_1 { italic_j = italic_d ( over^ start_ARG italic_μ end_ARG , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } - ( italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - roman_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ; italic_η ) ) blackboard_1 { italic_j = italic_d ( italic_μ , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } ]
=[{φ1(Z;η^)φ1(Z;η)}𝟙{j=d(μ^,C)}]absentdelimited-[]subscriptφ1𝑍^𝜂subscriptφ1𝑍𝜂1𝑗𝑑^𝜇superscript𝐶\displaystyle=\mathbb{P}\left[\left\{\upvarphi_{1}(Z;\widehat{\eta})-\upvarphi% _{1}(Z;\eta)\right\}\mathbbm{1}\{j=d(\widehat{\mu},C^{*})\}\right]= blackboard_P [ { roman_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) - roman_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ; italic_η ) } blackboard_1 { italic_j = italic_d ( over^ start_ARG italic_μ end_ARG , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } ]
+[{cjφ1(Z;η)}(𝟙{j=d(μ^,C)}𝟙{j=d(μ,C)})].delimited-[]subscript𝑐𝑗subscriptφ1𝑍𝜂1𝑗𝑑^𝜇superscript𝐶1𝑗𝑑𝜇superscript𝐶\displaystyle\quad+\mathbb{P}\left[\left\{c_{j}-\upvarphi_{1}(Z;\eta)\right\}% \left(\mathbbm{1}\{j=d(\widehat{\mu},C^{*})\}-\mathbbm{1}\{j=d(\mu,C^{*})\}% \right)\right].+ blackboard_P [ { italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - roman_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ; italic_η ) } ( blackboard_1 { italic_j = italic_d ( over^ start_ARG italic_μ end_ARG , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } - blackboard_1 { italic_j = italic_d ( italic_μ , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } ) ] .

The first term in the above display is bounded as

|[{φ1(Z;η^)φ1(Z;η)}𝟙{j=d(μ^,C)}]|maxaμ^aμaπ^aπa𝟏(p)less-than-or-similar-todelimited-[]subscriptφ1𝑍^𝜂subscriptφ1𝑍𝜂1𝑗𝑑^𝜇superscript𝐶subscript𝑎normsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜋𝑎subscript𝜋𝑎subscript1𝑝\displaystyle\left|\mathbb{P}\left[\left\{\upvarphi_{1}(Z;\widehat{\eta})-% \upvarphi_{1}(Z;\eta)\right\}\mathbbm{1}\{j=d(\widehat{\mu},C^{*})\}\right]% \right|\lesssim\max_{a}\|\widehat{\mu}_{a}-{\mu}_{a}\|\|\widehat{\pi}_{a}-{\pi% }_{a}\|\bm{1}_{(p)}| blackboard_P [ { roman_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) - roman_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ; italic_η ) } blackboard_1 { italic_j = italic_d ( over^ start_ARG italic_μ end_ARG , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } ] | ≲ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ∥ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ bold_1 start_POSTSUBSCRIPT ( italic_p ) end_POSTSUBSCRIPT

(See Remark B.1). For the second term, first we notice that

[{cjφ1(Z;η)}(𝟙{j=d(μ^,C)}𝟙{j=d(μ,C)})]delimited-[]subscript𝑐𝑗subscriptφ1𝑍𝜂1𝑗𝑑^𝜇superscript𝐶1𝑗𝑑𝜇superscript𝐶\displaystyle\mathbb{P}\left[\left\{c_{j}-\upvarphi_{1}(Z;\eta)\right\}\left(% \mathbbm{1}\{j=d(\widehat{\mu},C^{*})\}-\mathbbm{1}\{j=d(\mu,C^{*})\}\right)\right]blackboard_P [ { italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - roman_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ; italic_η ) } ( blackboard_1 { italic_j = italic_d ( over^ start_ARG italic_μ end_ARG , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } - blackboard_1 { italic_j = italic_d ( italic_μ , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } ) ]
=[(cjμ)(𝟙{j=d(μ^,C)}𝟙{j=d(μ,C)})].absentdelimited-[]subscript𝑐𝑗𝜇1𝑗𝑑^𝜇superscript𝐶1𝑗𝑑𝜇superscript𝐶\displaystyle=\mathbb{P}\left[\left(c_{j}-\mu\right)\left(\mathbbm{1}\{j=d(% \widehat{\mu},C^{*})\}-\mathbbm{1}\{j=d(\mu,C^{*})\}\right)\right].= blackboard_P [ ( italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_μ ) ( blackboard_1 { italic_j = italic_d ( over^ start_ARG italic_μ end_ARG , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } - blackboard_1 { italic_j = italic_d ( italic_μ , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } ) ] .

Next, letting dd(μ,C)𝑑𝑑𝜇superscript𝐶d\equiv d(\mu,C^{*})italic_d ≡ italic_d ( italic_μ , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and d^d(μ^,C)^𝑑𝑑^𝜇superscript𝐶\widehat{d}\equiv d(\widehat{\mu},C^{*})over^ start_ARG italic_d end_ARG ≡ italic_d ( over^ start_ARG italic_μ end_ARG , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), we have

|[(cjμ)(𝟙{j=d^}𝟙{j=d})]|delimited-[]subscriptsuperscript𝑐𝑗𝜇1𝑗^𝑑1𝑗𝑑\displaystyle\left|\mathbb{P}\left[\left(c^{*}_{j}-\mu\right)\left(\mathbbm{1}% \{j=\widehat{d}\}-\mathbbm{1}\{j=d\}\right)\right]\right|| blackboard_P [ ( italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_μ ) ( blackboard_1 { italic_j = over^ start_ARG italic_d end_ARG } - blackboard_1 { italic_j = italic_d } ) ] |
[cjμ2𝟏(p)|𝟙{j=d^}𝟙{j=d}|]absentdelimited-[]subscriptnormsubscriptsuperscript𝑐𝑗𝜇2subscript1𝑝1𝑗^𝑑1𝑗𝑑\displaystyle\leq\mathbb{P}\left[\left\|c^{*}_{j}-\mu\right\|_{2}\bm{1}_{(p)}% \left|\mathbbm{1}\{j=\widehat{d}\}-\mathbbm{1}\{j=d\}\right|\right]≤ blackboard_P [ ∥ italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_μ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT ( italic_p ) end_POSTSUBSCRIPT | blackboard_1 { italic_j = over^ start_ARG italic_d end_ARG } - blackboard_1 { italic_j = italic_d } | ]
=𝟏(p)[𝟙{fcd(μ)<fcd^(μ)}{fcd^(μ)fcd(μ)}],absentsubscript1𝑝delimited-[]1subscript𝑓subscriptsuperscript𝑐𝑑𝜇subscript𝑓subscriptsuperscript𝑐^𝑑𝜇subscript𝑓subscriptsuperscript𝑐^𝑑𝜇subscript𝑓subscriptsuperscript𝑐𝑑𝜇\displaystyle=\bm{1}_{(p)}\mathbb{P}\left[\mathbbm{1}\left\{\sqrt{f_{c^{*}_{d}% }(\mu)}<\sqrt{f_{c^{*}_{\widehat{d}}}(\mu)}\right\}\left\{\sqrt{f_{c^{*}_{% \widehat{d}}}(\mu)}-\sqrt{f_{c^{*}_{d}}(\mu)}\right\}\right],= bold_1 start_POSTSUBSCRIPT ( italic_p ) end_POSTSUBSCRIPT blackboard_P [ blackboard_1 { square-root start_ARG italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) end_ARG < square-root start_ARG italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) end_ARG } { square-root start_ARG italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) end_ARG - square-root start_ARG italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) end_ARG } ] ,

where fcj(μ)=μcj22subscript𝑓subscriptsuperscript𝑐𝑗𝜇superscriptsubscriptnorm𝜇subscriptsuperscript𝑐𝑗22f_{c^{*}_{j}}(\mu)=\|\mu-c^{*}_{j}\|_{2}^{2}italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) = ∥ italic_μ - italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Hence, using the same logic that we used to obtain (B.3) and (B.3) in the proof of Lemma B.6, one may get

[𝟙{fcd(μ)<fcd^(μ)}{fcd^(μ)fcd(μ)}]delimited-[]1subscript𝑓subscriptsuperscript𝑐𝑑𝜇subscript𝑓subscriptsuperscript𝑐^𝑑𝜇subscript𝑓subscriptsuperscript𝑐^𝑑𝜇subscript𝑓subscriptsuperscript𝑐𝑑𝜇\displaystyle\mathbb{P}\left[\mathbbm{1}\left\{\sqrt{f_{c^{*}_{d}}(\mu)}<\sqrt% {f_{c^{*}_{\widehat{d}}}(\mu)}\right\}\left\{\sqrt{f_{c^{*}_{\widehat{d}}}(\mu% )}-\sqrt{f_{c^{*}_{d}}(\mu)}\right\}\right]blackboard_P [ blackboard_1 { square-root start_ARG italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) end_ARG < square-root start_ARG italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) end_ARG } { square-root start_ARG italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) end_ARG - square-root start_ARG italic_f start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_μ ) end_ARG } ]
maxaμ^aμaα+1+1κmaxaμ^aμaμ^aμa,1.less-than-or-similar-toabsentsubscript𝑎superscriptsubscriptnormsubscript^𝜇𝑎subscript𝜇𝑎𝛼11𝜅subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1\displaystyle\lesssim\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}^{\alpha+1}% +\frac{1}{\kappa}\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}\left\|\widehat% {\mu}_{a}-\mu_{a}\right\|_{\mathbb{P},1}.≲ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α + 1 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT .

Therefore the second term in (A.15) is bounded as

{φC(Z;η^)φC(Z;η)}subscriptφsuperscript𝐶𝑍^𝜂subscriptφsuperscript𝐶𝑍𝜂\displaystyle\mathbb{P}\left\{\upvarphi_{C^{*}}(Z;\widehat{\eta})-\upvarphi_{C% ^{*}}(Z;\eta)\right\}blackboard_P { roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) - roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; italic_η ) }
(maxaμ^aμaπ^aπa+maxaμ^aμaα+1+1κmaxaμ^aμaμ^aμa,1)𝟏(p).less-than-or-similar-toabsentsubscript𝑎normsubscript^𝜇𝑎subscript𝜇𝑎normsubscript^𝜋𝑎subscript𝜋𝑎subscript𝑎superscriptsubscriptnormsubscript^𝜇𝑎subscript𝜇𝑎𝛼11𝜅subscript𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎subscriptnormsubscript^𝜇𝑎subscript𝜇𝑎1subscript1𝑝\displaystyle\lesssim\left(\max_{a}\|\widehat{\mu}_{a}-{\mu}_{a}\|\|\widehat{% \pi}_{a}-{\pi}_{a}\|+\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}^{\alpha+1}% +\frac{1}{\kappa}\max_{a}\|\widehat{\mu}_{a}-\mu_{a}\|_{\infty}\left\|\widehat% {\mu}_{a}-\mu_{a}\right\|_{\mathbb{P},1}\right)\bm{1}_{(p)}.≲ ( roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ ∥ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ + roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α + 1 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_κ end_ARG roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT ) bold_1 start_POSTSUBSCRIPT ( italic_p ) end_POSTSUBSCRIPT .

Finally, we tackle the first term in (A.15). Recall that the ‘Hessian’ matrix of {φC(Z;η)}subscriptφsuperscript𝐶𝑍𝜂\mathbb{P}\left\{\upvarphi_{C^{*}}(Z;\eta)\right\}blackboard_P { roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; italic_η ) } is computed by

C{φC(Z;η)}|C=C=2diag(𝟏(p)p1,,𝟏(p)pk)M(C,η),evaluated-at𝐶subscriptφ𝐶𝑍𝜂𝐶superscript𝐶2diagsubscript1𝑝subscriptsuperscript𝑝1subscript1𝑝subscriptsuperscript𝑝𝑘𝑀superscript𝐶𝜂\displaystyle\frac{\partial}{\partial C}\mathbb{P}\left\{\upvarphi_{C}(Z;\eta)% \right\}\Big{|}_{C=C^{*}}=2\text{diag}\left(\bm{1}_{(p)}p^{*}_{1},\ldots,\bm{1% }_{(p)}p^{*}_{k}\right)\equiv M(C^{*},\eta),divide start_ARG ∂ end_ARG start_ARG ∂ italic_C end_ARG blackboard_P { roman_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_Z ; italic_η ) } | start_POSTSUBSCRIPT italic_C = italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = 2 diag ( bold_1 start_POSTSUBSCRIPT ( italic_p ) end_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_1 start_POSTSUBSCRIPT ( italic_p ) end_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ≡ italic_M ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_η ) ,

where pj=(j=d)subscriptsuperscript𝑝𝑗𝑗𝑑p^{*}_{j}=\mathbb{P}(j=d)italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = blackboard_P ( italic_j = italic_d ). By the given condition that each pj>0subscriptsuperscript𝑝𝑗0p^{*}_{j}>0italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0, the matrix M(C,η)𝑀superscript𝐶𝜂M(C^{*},\eta)italic_M ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_η ) is nonsingular. Also we have C^𝑝C𝑝^𝐶superscript𝐶\widehat{C}\xrightarrow{p}C^{*}over^ start_ARG italic_C end_ARG start_ARROW overitalic_p → end_ARROW italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT by Corollary 4.3. Hence by Taylor’s theorem, we get the linear approximation

{φC^(Z;η^)φC(Z;η^)}subscriptφ^𝐶𝑍^𝜂subscriptφsuperscript𝐶𝑍^𝜂\displaystyle\mathbb{P}\left\{\upvarphi_{\widehat{C}}(Z;\widehat{\eta})-% \upvarphi_{C^{*}}(Z;\widehat{\eta})\right\}blackboard_P { roman_φ start_POSTSUBSCRIPT over^ start_ARG italic_C end_ARG end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) - roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) } =M(C,η^)(C^C)+o(C^C1)absent𝑀superscript𝐶^𝜂^𝐶superscript𝐶subscript𝑜subscriptnorm^𝐶superscript𝐶1\displaystyle=M(C^{*},\widehat{\eta})(\widehat{C}-C^{*})+o_{\mathbb{P}}\left(% \|\widehat{C}-C^{*}\|_{1}\right)= italic_M ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , over^ start_ARG italic_η end_ARG ) ( over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
=M(C,η)(C^C)+o(C^C1),absent𝑀superscript𝐶𝜂^𝐶superscript𝐶subscript𝑜subscriptnorm^𝐶superscript𝐶1\displaystyle=M(C^{*},\eta)(\widehat{C}-C^{*})+o_{\mathbb{P}}\left(\|\widehat{% C}-C^{*}\|_{1}\right),= italic_M ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_η ) ( over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ,

where the last equality follows by virtue of the fact that |(j=d^)(j=d)|=o(1)𝑗^𝑑𝑗𝑑subscript𝑜1|\mathbb{P}(j=\widehat{d})-\mathbb{P}(j=d)|=o_{\mathbb{P}}(1)| blackboard_P ( italic_j = over^ start_ARG italic_d end_ARG ) - blackboard_P ( italic_j = italic_d ) | = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ) under the consistency condition in Assumption A5. Putting this back into the original empirical moment condition, together with the other results, we have

o(1n)subscript𝑜1𝑛\displaystyle o_{\mathbb{P}}\left(\frac{1}{\sqrt{n}}\right)italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ) =(n){φC(Z;η)}+M(C,η)(C^C)+R2,n𝟏(p)+O(1n)+o(C^C1),absentsubscript𝑛subscriptφsuperscript𝐶𝑍𝜂𝑀superscript𝐶𝜂^𝐶superscript𝐶subscript𝑅2𝑛subscript1𝑝subscript𝑂1𝑛subscript𝑜subscriptnorm^𝐶superscript𝐶1\displaystyle=(\mathbb{P}_{n}-\mathbb{P})\left\{\upvarphi_{C^{*}}(Z;\eta)% \right\}+M(C^{*},\eta)(\widehat{C}-C^{*})+R_{2,n}\bm{1}_{(p)}+O_{\mathbb{P}}% \left(\frac{1}{\sqrt{n}}\right)+o_{\mathbb{P}}\left(\|\widehat{C}-C^{*}\|_{1}% \right),= ( blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - blackboard_P ) { roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; italic_η ) } + italic_M ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_η ) ( over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_R start_POSTSUBSCRIPT 2 , italic_n end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT ( italic_p ) end_POSTSUBSCRIPT + italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ) + italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ,

or equivalently,

C^C^𝐶superscript𝐶\displaystyle\widehat{C}-C^{*}over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT =M(C,η)1(n){φC(Z;η)}+O(R2,n)+o(C^C1)+O(1n)absent𝑀superscriptsuperscript𝐶𝜂1subscript𝑛subscriptφsuperscript𝐶𝑍𝜂subscript𝑂subscript𝑅2𝑛subscript𝑜subscriptnorm^𝐶superscript𝐶1subscript𝑂1𝑛\displaystyle=-M(C^{*},\eta)^{-1}(\mathbb{P}_{n}-\mathbb{P})\left\{\upvarphi_{% C^{*}}(Z;\eta)\right\}+O_{\mathbb{P}}(R_{2,n})+o_{\mathbb{P}}\left(\|\widehat{% C}-C^{*}\|_{1}\right)+O_{\mathbb{P}}\left(\frac{1}{\sqrt{n}}\right)= - italic_M ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_η ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - blackboard_P ) { roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; italic_η ) } + italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 2 , italic_n end_POSTSUBSCRIPT ) + italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG )
=O(1n+R2,n)+o(C^C1)absentsubscript𝑂1𝑛subscript𝑅2𝑛subscript𝑜subscriptnorm^𝐶superscript𝐶1\displaystyle=O_{\mathbb{P}}\left(\frac{1}{\sqrt{n}}+R_{2,n}\right)+o_{\mathbb% {P}}\left(\|\widehat{C}-C^{*}\|_{1}\right)= italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG + italic_R start_POSTSUBSCRIPT 2 , italic_n end_POSTSUBSCRIPT ) + italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )

by the nonsingularity of M(C,η)𝑀superscript𝐶𝜂M(C^{*},\eta)italic_M ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_η ). This implies

C^C1(1+o(1))=O(1n+R2,n),subscriptnorm^𝐶superscript𝐶11subscript𝑜1subscript𝑂1𝑛subscript𝑅2𝑛\displaystyle\|\widehat{C}-C^{*}\|_{1}\left(1+o_{\mathbb{P}}(1)\right)=O_{% \mathbb{P}}\left(\frac{1}{\sqrt{n}}+R_{2,n}\right),∥ over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 + italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ) ) = italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG + italic_R start_POSTSUBSCRIPT 2 , italic_n end_POSTSUBSCRIPT ) ,

so that

C^C1=O(1n+R2,n).subscriptnorm^𝐶superscript𝐶1subscript𝑂1𝑛subscript𝑅2𝑛\|\widehat{C}-C^{*}\|_{1}=O_{\mathbb{P}}\left(\frac{1}{\sqrt{n}}+R_{2,n}\right).∥ over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG + italic_R start_POSTSUBSCRIPT 2 , italic_n end_POSTSUBSCRIPT ) .

Next, by Pollard (1982, Lemma A), under Assumption A1, the map C{fC()}maps-tosuperscript𝐶subscript𝑓superscript𝐶C^{*}\mapsto\mathbb{P}\{f_{C^{*}}(\cdot)\}italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ↦ blackboard_P { italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ ) } is differentiable with derivative γC()subscript𝛾superscript𝐶\gamma_{C^{*}}(\cdot)italic_γ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ ), which leads to the following first-order approximation:

R(C^)R(C)𝑅^𝐶𝑅superscript𝐶\displaystyle R(\widehat{C})-R(C^{*})italic_R ( over^ start_ARG italic_C end_ARG ) - italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ={fC^(μ)fC(μ)}absentsubscript𝑓^𝐶𝜇subscript𝑓superscript𝐶𝜇\displaystyle=\mathbb{P}\left\{f_{\widehat{C}}(\mu)-f_{C^{*}}(\mu)\right\}= blackboard_P { italic_f start_POSTSUBSCRIPT over^ start_ARG italic_C end_ARG end_POSTSUBSCRIPT ( italic_μ ) - italic_f start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) }
=(C^C)γC(μ)+o(C^C1).absentsuperscript^𝐶superscript𝐶topsubscript𝛾superscript𝐶𝜇subscript𝑜subscriptnorm^𝐶superscript𝐶1\displaystyle=(\widehat{C}-C^{*})^{\top}\gamma_{C^{*}}(\mu)+o_{\mathbb{P}}% \left(\left\|\widehat{C}-C^{*}\right\|_{1}\right).= ( over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_μ ) + italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) .

The linear term must vanish as setting C=C𝐶superscript𝐶C=C^{*}italic_C = italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT minimizes R(C)={fC(μ)}𝑅𝐶subscript𝑓𝐶𝜇R(C)=\mathbb{P}\{f_{C}(\mu)\}italic_R ( italic_C ) = blackboard_P { italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_μ ) }. Consequently, we have

R(C^)R(C)=o(C^C1)𝑅^𝐶𝑅superscript𝐶subscript𝑜subscriptnorm^𝐶superscript𝐶1\displaystyle R(\widehat{C})-R(C^{*})=o_{\mathbb{P}}\left(\left\|\widehat{C}-C% ^{*}\right\|_{1}\right)italic_R ( over^ start_ARG italic_C end_ARG ) - italic_R ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
=o(O(R2,n+1n))absentsubscript𝑜subscript𝑂subscript𝑅2𝑛1𝑛\displaystyle=o_{\mathbb{P}}\left(O_{\mathbb{P}}\left(R_{2,n}+\frac{1}{\sqrt{n% }}\right)\right)= italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 2 , italic_n end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ) )
=o(R2,n+1n).absentsubscript𝑜subscript𝑅2𝑛1𝑛\displaystyle=o_{\mathbb{P}}\left(R_{2,n}+\frac{1}{\sqrt{n}}\right).= italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 2 , italic_n end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ) .

B.7 Proof of Theorem 4.5

Proof.

First, we argue that the function class C={φC(;η¯):C𝒞k}subscript𝐶conditional-setsubscriptφ𝐶¯𝜂𝐶subscript𝒞𝑘\mathcal{F}_{C}=\{\upvarphi_{C}(\cdot;\bar{\eta}):C\in\mathcal{C}_{k}\}caligraphic_F start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = { roman_φ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( ⋅ ; over¯ start_ARG italic_η end_ARG ) : italic_C ∈ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } is Donsker for any fixed η¯¯𝜂\bar{\eta}over¯ start_ARG italic_η end_ARG, if (μNC(κ/2))=0𝜇subscript𝑁𝐶𝜅20\mathbb{P}(\mu\in N_{C}(\kappa/2))=0blackboard_P ( italic_μ ∈ italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_κ / 2 ) ) = 0. This follows by noticing that Csubscriptsuperscript𝐶\mathcal{F}_{C^{*}}caligraphic_F start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT consists of sums of locally Lipschitz functions with non-overlap**, compact supports, each region defined with the indicator 𝟙{j=d(μ¯,C)}1𝑗𝑑¯𝜇superscript𝐶\mathbbm{1}\{j=d(\bar{\mu},C^{*})\}blackboard_1 { italic_j = italic_d ( over¯ start_ARG italic_μ end_ARG , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) }, j=1,,k𝑗1𝑘j=1,\ldots,kitalic_j = 1 , … , italic_k, and so has a finite bracketing integral.

Next, recall the empirical moment condition in the proof of Theorem 4.4. For the second term in (A.13), note that one may rewrite (A.16) as

(cjφ1(Z;η^))𝟙{j=d(μ^,C)}(c^jφ1(Z;η^))𝟙{j=d(μ^,C^)}normsubscriptsuperscript𝑐𝑗subscriptφ1𝑍^𝜂1𝑗𝑑^𝜇superscript𝐶subscript^𝑐𝑗subscriptφ1𝑍^𝜂1𝑗𝑑^𝜇^𝐶\displaystyle\left\|(c^{*}_{j}-\upvarphi_{1}(Z;\widehat{\eta}))\mathbbm{1}\{j=% d(\widehat{\mu},C^{*})\}-(\widehat{c}_{j}-\upvarphi_{1}(Z;\widehat{\eta}))% \mathbbm{1}\{j=d(\widehat{\mu},\widehat{C})\}\right\|∥ ( italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - roman_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) ) blackboard_1 { italic_j = italic_d ( over^ start_ARG italic_μ end_ARG , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } - ( over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - roman_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) ) blackboard_1 { italic_j = italic_d ( over^ start_ARG italic_μ end_ARG , over^ start_ARG italic_C end_ARG ) } ∥ (A.18)
={c^jφ1(Z;η^)}[𝟙{j=d(μ^,C)}𝟙{j=d(μ^,C^)}]\displaystyle=\big{\|}\left\{\widehat{c}_{j}-\upvarphi_{1}(Z;\widehat{\eta})% \right\}\left[\mathbbm{1}\{j=d(\widehat{\mu},C^{*})\}-\mathbbm{1}\{j=d(% \widehat{\mu},\widehat{C})\}\right]= ∥ { over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - roman_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) } [ blackboard_1 { italic_j = italic_d ( over^ start_ARG italic_μ end_ARG , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } - blackboard_1 { italic_j = italic_d ( over^ start_ARG italic_μ end_ARG , over^ start_ARG italic_C end_ARG ) } ]
+(cjc^j)𝟙{j=d(μ^,C)}.\displaystyle\qquad+(c^{*}_{j}-\widehat{c}_{j})\mathbbm{1}\{j=d(\widehat{\mu},% C^{*})\}\big{\|}.+ ( italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) blackboard_1 { italic_j = italic_d ( over^ start_ARG italic_μ end_ARG , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } ∥ .

Noting the following notational equivalence

𝟙{ζj(μ^;C^>0)}1subscript𝜁𝑗^𝜇^𝐶0\displaystyle\mathbbm{1}\left\{\zeta_{j}(\widehat{\mu};\widehat{C}>0)\right\}blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ; over^ start_ARG italic_C end_ARG > 0 ) } =𝟙{j=d(μ^,C^)}absent1𝑗𝑑^𝜇^𝐶\displaystyle=\mathbbm{1}\{j=d(\widehat{\mu},\widehat{C})\}= blackboard_1 { italic_j = italic_d ( over^ start_ARG italic_μ end_ARG , over^ start_ARG italic_C end_ARG ) }
𝟙{ζj(μ^;C)>0}1subscript𝜁𝑗^𝜇superscript𝐶0\displaystyle\mathbbm{1}\left\{\zeta_{j}(\widehat{\mu};C^{*})>0\right\}blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ; italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) > 0 } =𝟙{j=d(μ^,C)},absent1𝑗𝑑^𝜇superscript𝐶\displaystyle=\mathbbm{1}\{j=d(\widehat{\mu},C^{*})\},= blackboard_1 { italic_j = italic_d ( over^ start_ARG italic_μ end_ARG , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } ,

and letting ζjζj(μ^;C)superscriptsubscript𝜁𝑗subscript𝜁𝑗^𝜇superscript𝐶\zeta_{j}^{*}\equiv\zeta_{j}(\widehat{\mu};C^{*})italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≡ italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ; italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and ζ^jζj(μ^;C^)subscript^𝜁𝑗subscript𝜁𝑗^𝜇^𝐶\widehat{\zeta}_{j}\equiv\zeta_{j}(\widehat{\mu};\widehat{C})over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≡ italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ; over^ start_ARG italic_C end_ARG ), similarly as in the proof of Lemma B.6, one may show that for any j{1,,k}𝑗1𝑘j\in\{1,\ldots,k\}italic_j ∈ { 1 , … , italic_k }, under the margin condition with any κ>0𝜅0\kappa>0italic_κ > 0, α>0𝛼0\alpha>0italic_α > 0,

{|𝟙{ζ^j>0}𝟙{ζj>0}|}1subscript^𝜁𝑗01superscriptsubscript𝜁𝑗0\displaystyle\mathbb{P}\left\{\left|\mathbbm{1}\left\{\widehat{\zeta}_{j}>0% \right\}-\mathbbm{1}\left\{\zeta_{j}^{*}>0\right\}\right|\right\}blackboard_P { | blackboard_1 { over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 } - blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > 0 } | }
=(|𝟙{ζ^j>0}𝟙{ζj>0}|[𝟙{|ζ^jζj|κ}+𝟙{|ζ^jζj|>κ}])absent1subscript^𝜁𝑗01superscriptsubscript𝜁𝑗0delimited-[]1subscript^𝜁𝑗superscriptsubscript𝜁𝑗𝜅1subscript^𝜁𝑗superscriptsubscript𝜁𝑗𝜅\displaystyle=\mathbb{P}\left(\left|\mathbbm{1}\left\{\widehat{\zeta}_{j}>0% \right\}-\mathbbm{1}\left\{\zeta_{j}^{*}>0\right\}\right|\left[\mathbbm{1}% \left\{\left|\widehat{\zeta}_{j}-\zeta_{j}^{*}\right|\leq\kappa\right\}+% \mathbbm{1}\left\{\left|\widehat{\zeta}_{j}-\zeta_{j}^{*}\right|>\kappa\right% \}\right]\right)= blackboard_P ( | blackboard_1 { over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 } - blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > 0 } | [ blackboard_1 { | over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | ≤ italic_κ } + blackboard_1 { | over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | > italic_κ } ] )
[{|ζj||ζ^jζj|||ζ^jζj|κ}𝟙{|ζ^jζj|κ}]\displaystyle\leq\mathbb{P}\left[\mathbb{P}\left\{\left|\zeta_{j}^{*}\right|% \leq\left|\widehat{\zeta}_{j}-\zeta_{j}^{*}\right|\Bigm{|}\left|\widehat{\zeta% }_{j}-\zeta_{j}^{*}\right|\leq\kappa\right\}\mathbbm{1}\left\{\left|\widehat{% \zeta}_{j}-\zeta_{j}^{*}\right|\leq\kappa\right\}\right]≤ blackboard_P [ blackboard_P { | italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | ≤ | over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | | over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | ≤ italic_κ } blackboard_1 { | over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | ≤ italic_κ } ]
+[{|𝟙{ζ^j>0}𝟙{ζj>0}|||ζ^jζj|>κ}𝟙{|ζ^jζj|>κ}]\displaystyle\quad+\mathbb{P}\left[\mathbb{P}\left\{\left|\mathbbm{1}\left\{% \widehat{\zeta}_{j}>0\right\}-\mathbbm{1}\left\{\zeta_{j}^{*}>0\right\}\right|% \Bigm{|}\left|\widehat{\zeta}_{j}-\zeta_{j}^{*}\right|>\kappa\right\}\mathbbm{% 1}\left\{\left|\widehat{\zeta}_{j}-\zeta_{j}^{*}\right|>\kappa\right\}\right]+ blackboard_P [ blackboard_P { | blackboard_1 { over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 } - blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > 0 } | | | over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | > italic_κ } blackboard_1 { | over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | > italic_κ } ]
ζ^jζjα+𝟙{ζ^j>0}𝟙{ζj>0}ζ^jζj,1,less-than-or-similar-toabsentsuperscriptsubscriptnormsubscript^𝜁𝑗superscriptsubscript𝜁𝑗𝛼subscriptnorm1subscript^𝜁𝑗01superscriptsubscript𝜁𝑗0subscriptnormsubscript^𝜁𝑗superscriptsubscript𝜁𝑗1\displaystyle\lesssim\left\|\widehat{\zeta}_{j}-\zeta_{j}^{*}\right\|_{\infty}% ^{\alpha}+\left\|\mathbbm{1}\left\{\widehat{\zeta}_{j}>0\right\}-\mathbbm{1}% \left\{\zeta_{j}^{*}>0\right\}\right\|_{\infty}\left\|\widehat{\zeta}_{j}-% \zeta_{j}^{*}\right\|_{\mathbb{P},1},≲ ∥ over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT + ∥ blackboard_1 { over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 } - blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > 0 } ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT blackboard_P , 1 end_POSTSUBSCRIPT ,
c^jcj1α+𝟙{ζ^j>0}𝟙{ζj>0}c^jcj1,less-than-or-similar-toabsentsuperscriptsubscriptnormsubscript^𝑐𝑗subscriptsuperscript𝑐𝑗1𝛼subscriptnorm1subscript^𝜁𝑗01superscriptsubscript𝜁𝑗0subscriptnormsubscript^𝑐𝑗subscriptsuperscript𝑐𝑗1\displaystyle\lesssim\left\|\widehat{c}_{j}-c^{*}_{j}\right\|_{1}^{\alpha}+% \left\|\mathbbm{1}\left\{\widehat{\zeta}_{j}>0\right\}-\mathbbm{1}\left\{\zeta% _{j}^{*}>0\right\}\right\|_{\infty}\left\|\widehat{c}_{j}-c^{*}_{j}\right\|_{1},≲ ∥ over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT + ∥ blackboard_1 { over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 } - blackboard_1 { italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > 0 } ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (A.19)

where the last inequality follows by the fact that the function ζj(,C)subscript𝜁𝑗𝐶\zeta_{j}(\cdot,C)italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ⋅ , italic_C ) is locally Lipschitz at C𝐶Citalic_C. Hence, by Corollary 4.3 as well as the boundedness condition 𝔼(Y2X)<\left\|\mathbb{E}(Y^{2}\mid X)\right\|_{\infty}<\infty∥ blackboard_E ( italic_Y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ italic_X ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT < ∞, from (A.19) it follows that

{c^jφ1(Z;η^)}[𝟙{j=d(μ^,C)}𝟙{j=d(μ^,C^)}]=o(1).normsubscript^𝑐𝑗subscriptφ1𝑍^𝜂delimited-[]1𝑗𝑑^𝜇superscript𝐶1𝑗𝑑^𝜇^𝐶subscript𝑜1\displaystyle\left\|\left\{\widehat{c}_{j}-\upvarphi_{1}(Z;\widehat{\eta})% \right\}\left[\mathbbm{1}\{j=d(\widehat{\mu},C^{*})\}-\mathbbm{1}\{j=d(% \widehat{\mu},\widehat{C})\}\right]\right\|=o_{\mathbb{P}}(1).∥ { over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - roman_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) } [ blackboard_1 { italic_j = italic_d ( over^ start_ARG italic_μ end_ARG , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } - blackboard_1 { italic_j = italic_d ( over^ start_ARG italic_μ end_ARG , over^ start_ARG italic_C end_ARG ) } ] ∥ = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ) .

Also, it is immediate to see that (cjc^j)𝟙{j=d(μ^,C)}=o(1)normsubscriptsuperscript𝑐𝑗subscript^𝑐𝑗1𝑗𝑑^𝜇superscript𝐶subscript𝑜1\left\|(c^{*}_{j}-\widehat{c}_{j})\mathbbm{1}\{j=d(\widehat{\mu},C^{*})\}% \right\|=o_{\mathbb{P}}(1)∥ ( italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) blackboard_1 { italic_j = italic_d ( over^ start_ARG italic_μ end_ARG , italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } ∥ = italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ) by Corollary 4.3. Hence, by the triangle inequality, the term (A.18) is o(1)subscript𝑜1o_{\mathbb{P}}(1)italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ).

Now, consider the following identity

(n){φC^(Z;η^)φC(Z;η^)}subscript𝑛subscriptφ^𝐶𝑍^𝜂subscriptφsuperscript𝐶𝑍^𝜂\displaystyle(\mathbb{P}_{n}-\mathbb{P})\left\{\upvarphi_{\widehat{C}}(Z;% \widehat{\eta})-\upvarphi_{C^{*}}(Z;\widehat{\eta})\right\}( blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - blackboard_P ) { roman_φ start_POSTSUBSCRIPT over^ start_ARG italic_C end_ARG end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) - roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) }
=(n){φC^(Z;η^)φC(Z;η^)}𝟙{C^Cκ/2}absentsubscript𝑛subscriptφ^𝐶𝑍^𝜂subscriptφsuperscript𝐶𝑍^𝜂1norm^𝐶superscript𝐶𝜅2\displaystyle=(\mathbb{P}_{n}-\mathbb{P})\left\{\upvarphi_{\widehat{C}}(Z;% \widehat{\eta})-\upvarphi_{C^{*}}(Z;\widehat{\eta})\right\}\mathbbm{1}\left\{% \|\widehat{C}-C^{*}\|\leq\kappa/2\right\}= ( blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - blackboard_P ) { roman_φ start_POSTSUBSCRIPT over^ start_ARG italic_C end_ARG end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) - roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) } blackboard_1 { ∥ over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ≤ italic_κ / 2 } (A.20)
+(n){φC^(Z;η^)φC(Z;η^)}𝟙{C^C>κ/2}subscript𝑛subscriptφ^𝐶𝑍^𝜂subscriptφsuperscript𝐶𝑍^𝜂1norm^𝐶superscript𝐶𝜅2\displaystyle\quad+(\mathbb{P}_{n}-\mathbb{P})\left\{\upvarphi_{\widehat{C}}(Z% ;\widehat{\eta})-\upvarphi_{C^{*}}(Z;\widehat{\eta})\right\}\mathbbm{1}\left\{% \|\widehat{C}-C^{*}\|>\kappa/2\right\}+ ( blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - blackboard_P ) { roman_φ start_POSTSUBSCRIPT over^ start_ARG italic_C end_ARG end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) - roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) } blackboard_1 { ∥ over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ > italic_κ / 2 } (A.21)

Under the strong margin condition with some κ>0𝜅0\kappa>0italic_κ > 0 and α=𝛼\alpha=\inftyitalic_α = ∞, by Lemma 19.24 of Van der Vaart (2000), we conclude that the term in (A.20) is o(1/n)subscript𝑜1𝑛o_{\mathbb{P}}\left(1/\sqrt{n}\right)italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 / square-root start_ARG italic_n end_ARG ). The term in (A.21) is o(1/n)subscript𝑜1𝑛o_{\mathbb{P}}\left(1/\sqrt{n}\right)italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 / square-root start_ARG italic_n end_ARG ) as well, since when 𝔼(Y2X)<\left\|\mathbb{E}(Y^{2}\mid X)\right\|_{\infty}<\infty∥ blackboard_E ( italic_Y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ italic_X ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT < ∞, (A.17) and Corollary 4.3 imply

(n){φC^(Z;η^)φC(Z;η^)}𝟙{C^C>κ/2}subscript𝑛subscriptφ^𝐶𝑍^𝜂subscriptφsuperscript𝐶𝑍^𝜂1norm^𝐶superscript𝐶𝜅2\displaystyle(\mathbb{P}_{n}-\mathbb{P})\left\{\upvarphi_{\widehat{C}}(Z;% \widehat{\eta})-\upvarphi_{C^{*}}(Z;\widehat{\eta})\right\}\mathbbm{1}\left\{% \|\widehat{C}-C^{*}\|>\kappa/2\right\}( blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - blackboard_P ) { roman_φ start_POSTSUBSCRIPT over^ start_ARG italic_C end_ARG end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) - roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; over^ start_ARG italic_η end_ARG ) } blackboard_1 { ∥ over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ > italic_κ / 2 } =O(1/n)o(1)absentsubscript𝑂1𝑛subscript𝑜1\displaystyle=O_{\mathbb{P}}\left(1/\sqrt{n}\right)o_{\mathbb{P}}\left(1\right)= italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 / square-root start_ARG italic_n end_ARG ) italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 )
=o(1/n).absentsubscript𝑜1𝑛\displaystyle=o_{\mathbb{P}}\left(1/\sqrt{n}\right).= italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 / square-root start_ARG italic_n end_ARG ) .

Applying this to the original moment condition, together with the other results in Section B.6, we obtain

C^C=M(C,η)1(n){φC(Z;η)}+O(R2,n)+o(C^C1)+o(1n)^𝐶superscript𝐶𝑀superscriptsuperscript𝐶𝜂1subscript𝑛subscriptφsuperscript𝐶𝑍𝜂subscript𝑂subscript𝑅2𝑛subscript𝑜subscriptnorm^𝐶superscript𝐶1subscript𝑜1𝑛\displaystyle\widehat{C}-C^{*}=-M(C^{*},\eta)^{-1}(\mathbb{P}_{n}-\mathbb{P})% \left\{\upvarphi_{C^{*}}(Z;\eta)\right\}+O_{\mathbb{P}}(R_{2,n})+o_{\mathbb{P}% }\left(\|\widehat{C}-C^{*}\|_{1}\right)+o_{\mathbb{P}}\left(\frac{1}{\sqrt{n}}\right)over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = - italic_M ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_η ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - blackboard_P ) { roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; italic_η ) } + italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 2 , italic_n end_POSTSUBSCRIPT ) + italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( ∥ over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ) (A.22)

Substituting the result of Theorem 4.4 into (A.22) finally yields

C^C=M(C,η)1(n){φC(Z;η)}+O(R2,n)+o(1n).^𝐶superscript𝐶𝑀superscriptsuperscript𝐶𝜂1subscript𝑛subscriptφsuperscript𝐶𝑍𝜂subscript𝑂subscript𝑅2𝑛subscript𝑜1𝑛\displaystyle\widehat{C}-C^{*}=-M(C^{*},\eta)^{-1}(\mathbb{P}_{n}-\mathbb{P})% \left\{\upvarphi_{C^{*}}(Z;\eta)\right\}+O_{\mathbb{P}}(R_{2,n})+o_{\mathbb{P}% }\left(\frac{1}{\sqrt{n}}\right).over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = - italic_M ( italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_η ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - blackboard_P ) { roman_φ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z ; italic_η ) } + italic_O start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 2 , italic_n end_POSTSUBSCRIPT ) + italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ) .