Causal K-Means Clustering††thanks: We thank Larry Wasserman for helpful discussions and comments. A part of this work was done while Kwangho Kim was a PhD student at Carnegie Mellon University.
Abstract
Causal effects are often characterized with population summaries. These might provide an incomplete picture when there are heterogeneous treatment effects across subgroups. Since the subgroup structure is typically unknown, it is more challenging to identify and evaluate subgroup effects than population effects. We propose a new solution to this problem: Causal k-Means Clustering, which harnesses the widely-used k-means clustering algorithm to uncover the unknown subgroup structure. Our problem differs significantly from the conventional clustering setup since the variables to be clustered are unknown counterfactual functions. We present a plug-in estimator which is simple and readily implementable using off-the-shelf algorithms, and study its rate of convergence. We also develop a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning, and show that this estimator achieves fast root-n rates and asymptotic normality in large nonparametric models. Our proposed methods are especially useful for modern outcome-wide studies with multiple treatment levels. Further, our framework is extensible to clustering with generic pseudo-outcomes, such as partially observed outcomes or otherwise unknown functions. Finally, we explore finite sample properties via simulation, and illustrate the proposed methods in a study of treatment programs for adolescent substance abuse.
Keywords: Causal inference; Heterogeneous treatment effect; Personalization; Subgroup analysis; Observational studies
1 Introduction
1.1 Heterogeneity in Treatment Effects
Statistical causal inference is all about estimating what would happen to some response when a “cause” of interest is changed or intervened upon. In causal inference, the average treatment effect (ATE) has regularly emerged as one of the most sought-after effects to measure. For a binary treatment , the ATE is defined by
(1) |
where is the potential outcome that would have been observed under treatment (Rubin 1974). There has been lots of work concerning efficient and flexible estimation of the ATE and its analogs (Van der Laan et al. 2003; Chernozhukov et al. 2016; Kennedy 2022).
However, the effect of treatment often varies across subgroups, both in terms of magnitude and direction. Certain subgroups may experience larger effects than others. A treatment could even benefit certain subgroups while harming others. A potential shortcoming of the ATE is that it can mask this effect heterogeneity. Identifying treatment effect heterogeneity and corresponding subgroups plays an essential role in a variety of fields, including policy evaluation, drug development, and health care, and has sparked growing interest. For example, patients with different subtypes of cancer often react differently to the same treatment; however, our understanding of cancer subtypes at the molecular level is limited, and there is little consensus about which treatments are most effective for which patients (Kravitz et al. 2004; Hayden 2009). Typically, a functional form of the relationship between treatment effects and unit attributes is unknown a priori, therefore such effect heterogeneity has to be explored using data-driven methods. Despite a lot of recent work in this area, there are still many unsolved problems, and it has not been studied as extensively as other branches of causal inference (Kennedy 2023).
To better understand treatment effect heterogeneity, investigators often target to estimate the conditional average treatment effect (CATE):
(2) |
where is a vector of observed covariates. The CATE offers the potential to personalize causal effects by making them specific to each individual’s characteristics. Many methods have been proposed for CATE estimation, with a focus in recent years on leveraging the benefits of machine learning. For example, van der Laan & Luedtke (2015) developed a framework for loss-based super learning. Athey & Imbens (2016); Zhang et al. (2017) proposed a recursive partitioning approach. Foster et al. (2011); Wager & Athey (2018) and Imai et al. (2013) adopted random forests and support vector machine classifiers, respectively. Grimmer et al. (2017) proposed a weighted ensemble approach. Shalit et al. (2017) developed a neural network architecture based on integral probability metrics. Künzel et al. (2017) presented a meta-algorithm with a particular focus on unbalanced designs. Nie & Wager (2021) gave a novel adaptation of RKHS regression methods and studied conditions for oracle efficiency. Kennedy (2023) provided generic model-free error bounds and presented an algorithm achieving the fastest possible convergence rates under smoothness assumptions.
1.2 Understanding Heterogeneity via Cluster Analysis
In contrast to earlier work, which has focused on supervised learning methodologies, we consider analyzing treatment effect heterogeneity from an unsupervised learning perspective. We develop Causal Clustering, a new technique for exploring heterogeneous treatment effects leveraging tools from cluster analysis. We aim to understand the structure of effect heterogeneity by identifying underlying subgroups as clusters. Our work is therefore more descriptive and discovery-based, and fills an important gap in the literature.
We illustrate the idea of causal clustering through the case of binary treatments in Figure LABEL:fig:causal-cluster-illustration. We generate a sample where a projection of each observation is drawn from a mixture of six Gaussian distributions with different means and covariance functions, with the overall ATE set to zero. By construction, there are six clusters, with units within each cluster being more homogeneous in terms of the CATE. When it comes to analyzing the heterogeneity of treatment effects, people often rely on the histogram of the CATE as in Figure LABEL:fig:causal-cluster-illustration-(c). However, in this case, the histogram fails to reveal the details about the true subgroup structure. By adapting the idea of cluster analysis, we aim to uncover clusters with markedly different responses to a given treatment than the rest, while maintaining a high degree of homogeneity within each cluster, as shown in Figure LABEL:fig:causal-cluster-illustration-(b). This allows for an interesting new study of subgroup structure; as far as we know, clustering methods have yet to be employed in causal inference or heterogeneous effects problems.
Our problem differs significantly from the conventional clustering setup since the variable to be clustered consists of unknown functions (i.e., potential outcome regression functions) that must be estimated. Clustering with these unknown “pseudo-outcomes” has not received as much attention as clustering on standard fully observed data, despite its importance. Some previous work considered cluster analysis using partially observed outcomes, yet still in a vector form with fixed dimensions. For example, Serafini et al. (2020) explored missing data problems in clustering, and Haviland et al. (2011) studied group-based trajectory modeling with non-random dropouts. Su et al. (2018) considered clustering with measurement errors. In a similar context, Kumar & Patel (2007) considered clustering on unknown model parameters, though without theoretical analysis. To the best of our knowledge, none of the existing methods in clustering literature have considered nonparametric approaches to clustering with unknown functions. In our analysis, we show that if the nuisance estimation error with respect to those unknown functionals is sufficiently small, then the excess clustering risk is near zero. In this sense, our work is in a similar spirit to the classification versus regression distinction in statistical learning (Devroye et al. 2013, Theorem 2.2).
In addition to the existing supervised-learning based approaches, our framework offers a complementary tool for identifying subgroups that substantially differ from each other. Our proposed methods are particularly useful in outcome-wide studies with multiple treatment levels (VanderWeele 2017; VanderWeele et al. 2016); instead of probing a high-dimensional CATE surface to assess the subgroup structure, one may attempt to uncover lower-dimensional clusters with similar responses to a given treatment set.
The remainder of the paper is structured as follows. In Section 2, we formalize the idea of causal clustering based on the k-means algorithm. In Section 3, we present a plug-in estimator, which is simple and readily implementable yet will in general not be -consistent. In Section 4, we develop an efficient bias-corrected estimator for k-means causal clustering under a margin condition, which attains fast rates and asymptotic normality under weak nonparametric conditions. In section 5, we illustrate our approach using simulations and real data on effects of treatment programs for substance abuse. Section 6 concludes with a discussion.
2 Setup and estimands
Consider a random sample of tuples , where represents the outcome, denotes an intervention, and comprises observed covariates. For simplicity, we focus on univariate outcomes, although our methodology can be easily extended to multivariate outcomes. Throughout, we rely on the following widely-used identification assumptions (e.g., Imbens & Rubin 2015, Chapter 12):
Assumption C1 (consistency).
if .
Assumption C2 (no unmeasured confounding).
.
Assumption C3 (positivity).
is bounded away from 0 a.s. .
For , let the outcome regression function be denoted by
For , one may define the pairwise CATE by
(3) | ||||
Then, we define the conditional counterfactual mean vector as
(4) |
If all coordinates of a point were the same, there would be no treatment effect on the conditional mean scale. Also, adjacent units in the conditional counterfactual mean vector space would have similar responses to a given set of treatments, since for two units ,
This provides vital motivation for uncovering subgroup structure via cluster analysis on projections of a sample onto the conditional counterfactual mean vector space. Crucially, standard clustering theory is limited here since the variable to be clustered is , a collection of the unknown regression functions, which themselves have to be estimated.
In this work, we propose a novel k-means causal clustering. k-means (also known as vector quantization) is one of the oldest and most popular clustering algorithms, having originated in signal processing. It works by finding representative points (or cluster centers) which defines a Voronoi tessellation. There has been a substantial amount of research on -means clustering. (See, for review, Jain (2010) or the monograph of Graf & Luschgy (2007)). It is one of the few clustering methods whose theoretical properties are rather well-understood, as the analysis is relatable to principal components analysis (Ding & He 2004).
We call a set of representative points a codebook where each . Let be the projection of onto :
Then we define the population clustering risk with respect to by
(5) |
and the corresponding optimal codebook by
(6) |
where denotes all codebooks of length in the image of defined in (4). When is fixed, the population clustering risk (5) can be viewed as a real-valued functional on a nonparametric model. Importantly, is a non-smooth functional of the observed data distribution, so the standard semiparametric efficiency theory does not immediately apply. In Section 4, we shall propose an efficient estimator for under a margin condition.
The conditional counterfactual mean vector in (4) can be easily tailored for a specific use through reparametrization without compromising our subsequent results. With , for instance, one may consider with untreated and as a baseline risk instead of . This may be more useful for exploring the relationship between the baseline risk and the treatment effect as illustrated in Figure LABEL:fig:alternative-parametrization. As has been shown in the literature of heterogeneous treatment effects, the difference in regression functions may be more structured and simple than the individual components (e.g., Chernozhukov et al. 2018; Kennedy 2023). Some parametrizations might help harness this nontrivial structure (e.g., smoothness or sparsity) of each CATE function. For example, clustering on could be easier than clustering on , when we are less concerned with the baseline risk. If we are interested in how a treatment shifts the quantiles (e.g. Chernozhukov & Hansen 2005; Zhang et al. 2012), we can redefine our conditional counterfactual mean vector by for some prespecified (for median, ), where is the quantile function of our potential outcome , i.e., for .
In the sequel, we use the shorthand and . We let denote norm for any fixed vector . For a given function , we use the notation
as the -norm of . Also, we let denote the conditional expectation given the sample operator , as in . Notice that is random only if depends on samples, in which case . Otherwise and can be used exchangeably. For example, if is constructed on a separate (training) sample , then for a new observation . We let denote the empirical measure as in . Lastly, we use the shorthand to denote for some universal constant .
3 Plug-in Estimator
Suppose the are all known. In this case, the optimal codebook can be estimated by computing a minimizer of the empirical clustering risk, just as in the standard k-means clustering:
(7) |
The common method used to find is known as Lloyd’s algorithm (Lloyd 1982; Kanungo et al. 2002), yet there are other recent developments as well (Leskovec et al. 2020). A solution of such algorithms normally depends on the starting values. Some popular methods for choosing good starting values are discussed in, for example, Tseng & Wong (2005); Arthur & Vassilvitskii (2007).
The problem of evaluating how good is, compared to the true , has been extensively studied. Pollard (1981) proved strong consistency of k-means clustering in the sense that as well as . Borrowing techniques from statistical learning theory, Linder et al. (1994) and Biau et al. (2008) showed that when an input vector is almost surely bounded, the expected excess risk may decay at and rates, respectively. More recently, it has been shown that faster or rates can be attained under a margin condition on the source distribution (Levrard 2015, 2018); we shall go over this margin condition in detail shortly.
However, in our setting we cannot estimate using as in (7) since we do not know each . Instead, we propose the following plug-in estimator
(8) |
where is some initial estimator of the outcome regression functions. We will use sample splitting to avoid imposing empirical process conditions on the function class of (Kennedy 2016, 2022). For now, we suppose that are constructed on a separate, independent sample; this will be discussed in more detail in the following section.
Due to the non-smoothness of the projection function , in general we would not expect the proposed plug-in estimator (8) to inherit the rate of convergence of . To resolve this, we shall assume that the source distribution is concentrated around in a similar spirit to Levrard (2015, 2018).
In the sequel, the set of minimizers of the clustering risk will be denoted by , i.e., . For , we define the Voronoi cell associated with a cluster as the closed set by
and its boundary by
And we write the entire boundaries induced from as
Next, for any and some , we define a set by
![Refer to caption](extracted/5700153/FIG/hyperbola.png)
can be viewed as a neighborhood of in which the distance from a point to two nearest cluster centers differs by as much as . For example, in 2-dimensional Euclidean space (i.e., when ), forms a region surrounded by hyperbolas that are symmetric around each segment in , as shown in Figure 3. Now we introduce the following margin condition.
Definition 3.1 (Margin condition).
A distribution satisfies a margin condition with radius and rate if and only if for all ,
The above margin condition requires a local control of the probability around for , hence implies that every optimal codebook forms a "natural classification". A larger indicates that is "more structured", facilitating the formation of such a natural classifier, whereas a smaller suggests that a natural classifier is less likely to exist; when , the density is unbounded near . Levrard (2015, 2018) used the same condition with to achieve fast rates of convergence for the excess risk, and provided some instances of the corresponding natural classifiers. This type of margin condition, where the weight of the neighborhood of the critical region is controlled, has been often adopted for a wide range of problems in causal inference involving estimation of non-smooth target parameters (e.g., van der Laan & Luedtke 2015; Luedtke & Van Der Laan 2016; Kennedy et al. 2018; Levis et al. 2023; Kim & Zubizarreta 2023). We introduce the following mild boundedness and consistency assumptions as well.
Assumption A1.
a.s.
Assumption A2.
.
In the next theorem, we give upper bounds of the excess risk, showing that the proposed plug-in estimator (8) is risk consistent.
Theorem 3.1.
A proof of the above theorem and all subsequent proofs can be found in Web Appendix B. The term in commonly appears in the literature involving efficient estimation of non-smooth functionals based on the margin condition, including those listed above. The term is due to the fact that the margin condition in Definition 3.1 only requires a local control in the neighborhood ; if , this term vanishes. Theorem 3.1 essentially states that the extra price we pay for excess risk is the estimation error of the outcome regression functions.
The fact that is risk consistent does not imply that is actually close to the true codebook . To assure consistency of , an additional condition is required as follows.
Assumption A3.
is unique up to relabeling of its coordinates: i.e., is a singleton.
The uniqueness specified in Assumption A3 is also used in earlier work by Pollard (1981, 1982). The next theorem states that the proposed plug-in estimator is consistent.
Theorem 3.2.
The map from into is differentiable if (Pollard 1982). Based on Theorems 3.1 and 3.2, one may thus characterize the rate of convergence of as stated in the next corollary.
Corollary 3.3.
The plug-in estimator is simple and intuitive. When an initial estimator is available or is fitted in a separate independent sample, (8) is readily implementable using the standard, off-the-shelf algorithms including Lloyd’s algorithm. Otherwise, we can estimate the risk via cross-fitting, where we swap the samples, repeat the procedure, and average the results to regain full sample size efficiency. Then we compute the optimal codebook that minimizes the estimated risk. We shall address this in further detail shortly.
Note that the convergence rate in Theorem 3.1 essentially inherits from . Hence, for either the risk or the codebook, rates of convergence would be expected to be slower than with non-normal limiting distributions not centered at the true parameter, unless careful undersmoothing of particular estimators (e.g., splines) is used. Consequently, valid confidence intervals (even via bootstrap) may not be constructed. In the following section, we will develop an estimator that can be consistent and asymptotically normal even if the nuisance functions are estimated flexibly at slower than rates, in a wide variety of settings.
4 Semiparametric Estimator
In this section, we describe estimators that can achieve faster rates than the plug-in estimator from Section 3 based upon semiparametric efficiency theory.
4.1 Proposed estimator
For convenience, we introduce the following additional notations
(9) | ||||
where denotes a set of relevant nuisance functions . is a conditional probability of receiving the treatment ; when , denotes the propensity score. Notice that and are the uncentered efficient influence function for the parameters and , respectively. The efficient influence function is important to construct optimal estimators since its variance equals the efficiency bound (in asymptotic minimax sense). Shortly, we shall see that exploiting the efficient influence function endows our estimators with desirable properties such as double robustness or general second-order bias, allowing us to relax nonparametric conditions on nuisance function estimation. We refer the interested reader to, for example, van der Vaart (2002); Tsiatis (2007); Kennedy (2016, 2022) for more details about influence functions and semiparametric efficiency theory.
Next, for any fixed , we define
(10) |
where we let denote a set of all nuisance functions collectively, and be the -th element of the projection . is the uncentered efficient influence function for whenever satisfies the margin condition, as formally stated below.
Lemma 4.1.
We now describe how to construct the proposed estimator for . Following (Robins et al. 2008; Zheng & Van Der Laan 2010; Chernozhukov et al. 2017; Newey & Robins 2018; Kennedy 2023) and many others, we use sample splitting (or cross-fitting) to allow for arbitrarily complex nuisance estimators . Specifically with fixed , we split the data into disjoint groups, each with size approximately, by drawing variables independent of the data; indicates that subject was split into group . This could be done, for example, by drawing each uniformly from . We propose our estimator for as
(11) |
where we let denote empirical averages only over the set of units in group and let denote the nuisance estimator constructed only using those units . In the following section, we will show that the above estimator is asymptotically efficient under weak conditions for any .
Then we propose estimating the optimal cluster codebook as a minimizer of :
(12) |
After finding the function , can be computed on a full sample. Note that the cross-fitting procedure described above is equally applicable to the plug-in estimator (8). (12) can be computed using first-order (e.g., gradient descent) or second-order (e.g., Newton-Raphson) methods based on the derivative formulas (13) and (14) specified in the following section.
4.2 Asymptotic Properties
In this subsection, we analyze asymptotic properties of the proposed estimator. For notational simplicity, we define the remainder term that appears in our results as follows:
Note that terms in are all second-order, as opposed to , the analogous bias term for the plug-in estimator in the previous section. We introduce the following additional assumptions pertaining to our nuisance estimation.
Assumption A4.
for some .
Assumption A5.
.
Assumption A6.
.
Assumption A5 is a mild consistency assumption, with no requirement on rates of convergence. Assumption A6 may hold, for example, under standard -type rate conditions on which can be attained under smoothness, sparsity, or other structural constraints (e.g., Kennedy 2016).
Lemma 4.1 allows us to specify conditions under which is an asymptotically normal and efficient estimator for , for any satisfying the margin condition, as stated in the following lemma.
Lemma 4.2.
Under the similar conditions as Theorem 3.2, we can show the proposed codebook estimator (12) is consistent, as stated in the following corollary.
Corollary 4.3.
We now focus on the asymptotic properties of , particularly on identifying conditions that assure consistency and asymptotic normality in large nonparametric models. In the next theorem, our first main result of this section, we compute an asymptotic bound for the excess risk, as well as the rate of convergence for .
Theorem 4.4.
Note that the condition is equivalent to , i.e., there are no vacant Voronoi cells, and guarantees that the derivative matrix is nonsingular. Theorem 4.4 shows that the proposed codebook estimator and the associated excess risk may attain substantially faster rates than its nuisance estimators . Specifically if (weaker assumption than A6), we can attain rates for and faster-than- rates for excess risk by virtue of the fact that involves products of nuisance estimation errors.
Asymptotic normality of estimated codebooks in the standard k-means clustering was first studied by Pollard (1982). However, extending the classic result of Pollard (1982) to causal clustering poses some difficulties due to the complexity of our new risk function which relies on multiple nuisance components in an infinite-dimensional function space. To achieve asymptotic normality for our estimated codebook , we shall adopt the logic employed in Kennedy et al. (2023).
Let where each is defined in (9). With a slight abuse of notation, as was done in Bottou & Bengio (1994) we compute the derivative of at any for some fixed by
(13) | ||||
where we let , i.e., the subscript for the nearest center to a given . Similarly, one may compute the derivative matrix of at :
(14) |
where and is a -dimensional vector of all ones.
Notice that the solutions of the minimization problem (12) can be equivalently expressed by solutions to the following empirical moment condition (up to error):
In the next theorem, we give the second main result of this section, which presents conditions allowing for consistency and asymptotic normality of .
Theorem 4.5.
Theorem 4.5 requires a stronger version of the margin condition where is completely empty. Note that we still do not restrict the radius . Importantly, Theorem 4.5 implies that can be not only consistent but also asymptotically normal under the rate condition in Assumption A6, which may hold even when the nuisance estimators are generic and flexibly fit. In this case, asymptotically valid confidence intervals can be readily constructed via bootstrap methods.
5 Illustration
5.1 Simulation Study
In order to assess the performance of the proposed estimators, we conduct a small simulation study. We consider a simplified scenario where a generated codebook forms a natural classifier satisfying the margin condition. As briefly shown in Figure LABEL:fig:experiments-(a), we demonstrate that, as anticipated by our theoretical results, the proposed semiparametric estimator from Section 4 generally has smaller error than the plug-in estimator from Section 3, and achieves parametric rates even with rates on nuisance estimation. Details and full results are included in Web Appendix A.
5.2 Case Study
Here we apply our method to the real-world dataset that was collected to study the relative effects of three treatment programs for adolescent substance abuse, i.e., community (), MET&CBT-5 (), SCY () (McCaffrey et al. 2013; Burgette et al. 2017). For illustration purpose, we use a subset of publicly available data via the twang R package. The dataset consists of samples, youths for each treatment, and covariates including age, ethnicity, and criminal history. Our outcome is the program effectiveness score, where higher scores indicate reduced frequency of substance use.
We use the proposed semiparametric estimator with splits, using the gradient descent algorithm for optimization. For nonparametric estimation we used the cross-validation-based Super Learner ensemble (Van der Laan et al. 2007) to combine regression splines, support vector machine regression, and random forests. The Elbow method indicates that can be a reasonable choice. Figure LABEL:fig:experiments-(b) displays the four clusters in the counterfactual mean vector space, revealing a substantial degree of heterogeneity. In Figure LABEL:fig:experiments-(c), we also present the density plots for the pairwise CATE estimates and , across different clusters. This helps to understand how units in each cluster respond differently to a specific treatment. For instance, for Cluster 2, the traditional community program is more effective than the MET&CBT-5, while there is no significant difference between the community and SCY programs. On the other hand, for units in Cluster 4, the MET&CBT-5 is moderately more successful than the community program, whereas the SCY is significantly less effective.
6 Discussion
In this paper, we propose a new framework for analyzing treatment effect heterogeneity by leveraging tools in cluster analysis. We provide flexible nonparametric estimators for a wide class of models. The proposed methods allow for the discovery of subgroup structure in studies with multiple treatments or outcomes. Our framework is extensible to clustering with generic pseudo-outcomes, such as partially observed outcomes or unknown functionals.
Our findings open up a plethora of intriguing opportunities for future work. In an upcoming companion paper, we consider kernel-based undersmoothing approaches for causal k-means clustering, which do not require the margin condition. Much more work is required to expand causal clustering to other widely-used clustering algorithms, such as density-based clustering and hierarchical clustering. Different algorithms rely on different assumptions about the data, necessitating distinct analysis. Connecting to prescriptive methods, such as optimal treatment regimes, and other settings involving, for example, time-varying treatments, instrumental variables, or mediation would be also promising directions for future research.
References
- (1)
- Arthur & Vassilvitskii (2007) Arthur, D. & Vassilvitskii, S. (2007), k-means++: The advantages of careful seeding, in ‘Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms’, Society for Industrial and Applied Mathematics, pp. 1027–1035.
- Athey & Imbens (2016) Athey, S. & Imbens, G. (2016), ‘Recursive partitioning for heterogeneous causal effects’, Proceedings of the National Academy of Sciences 113(27), 7353–7360.
- Biau et al. (2008) Biau, G., Devroye, L. & Lugosi, G. (2008), ‘On the performance of clustering in hilbert spaces’, IEEE Transactions on Information Theory 54(2), 781–790.
- Bottou & Bengio (1994) Bottou, L. & Bengio, Y. (1994), ‘Convergence properties of the k-means algorithms’, Advances in neural information processing systems 7.
- Burgette et al. (2017) Burgette, L., Griffin, B. A. & McCaffrey, D. (2017), ‘Propensity scores for multiple treatments: A tutorial for the mnps function in the twang package’, R package. Rand Corporation .
- Chernozhukov et al. (2017) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C. & Newey, W. (2017), ‘Double/debiased/neyman machine learning of treatment effects’, American Economic Review 107(5), 261–65.
- Chernozhukov et al. (2016) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C. & Newey, W. K. (2016), Double machine learning for treatment and causal parameters, Technical report, cemmap working paper.
- Chernozhukov et al. (2018) Chernozhukov, V., Demirer, M., Duflo, E. & Fernandez-Val, I. (2018), Generic machine learning inference on heterogeneous treatment effects in randomized experiments, with an application to immunization in india, Technical report, National Bureau of Economic Research.
- Chernozhukov & Hansen (2005) Chernozhukov, V. & Hansen, C. (2005), ‘An iv model of quantile treatment effects’, Econometrica 73(1), 245–261.
- Devroye et al. (2013) Devroye, L., Györfi, L. & Lugosi, G. (2013), A probabilistic theory of pattern recognition, Vol. 31, Springer Science & Business Media.
- Ding & He (2004) Ding, C. & He, X. (2004), K-means clustering via principal component analysis, in ‘Proceedings of the twenty-first international conference on Machine learning’, ACM, p. 29.
- Foster et al. (2011) Foster, J. C., Taylor, J. M. & Ruberg, S. J. (2011), ‘Subgroup identification from randomized clinical trial data’, Statistics in medicine 30(24), 2867–2880.
- Giné & Nickl (2021) Giné, E. & Nickl, R. (2021), Mathematical foundations of infinite-dimensional statistical models, Cambridge university press.
- Graf & Luschgy (2007) Graf, S. & Luschgy, H. (2007), Foundations of quantization for probability distributions, Springer.
- Grimmer et al. (2017) Grimmer, J., Messing, S. & Westwood, S. J. (2017), ‘Estimating heterogeneous treatment effects and the effects of heterogeneous treatments with ensemble methods’, Political Analysis 25(4), 413–434.
- Haviland et al. (2011) Haviland, A. M., Jones, B. L. & Nagin, D. S. (2011), ‘Group-based trajectory modeling extended to account for nonrandom participant attrition’, Sociological Methods & Research 40(2), 367–390.
- Hayden (2009) Hayden, E. C. (2009), ‘Personalized cancer therapy gets closer’.
- Imai et al. (2013) Imai, K., Ratkovic, M. et al. (2013), ‘Estimating treatment effect heterogeneity in randomized program evaluation’, The Annals of Applied Statistics 7(1), 443–470.
- Imbens & Rubin (2015) Imbens, G. W. & Rubin, D. B. (2015), Causal inference in statistics, social, and biomedical sciences, Cambridge University Press.
- Jain (2010) Jain, A. K. (2010), ‘Data clustering: 50 years beyond k-means’, Pattern recognition letters 31(8), 651–666.
- Kanungo et al. (2002) Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R. & Wu, A. Y. (2002), ‘An efficient k-means clustering algorithm: Analysis and implementation’, IEEE Transactions on Pattern Analysis & Machine Intelligence (7), 881–892.
- Kennedy et al. (2023) Kennedy, E., Balakrishnan, S. & Wasserman, L. (2023), ‘Semiparametric counterfactual density estimation’, Biometrika p. asad017.
- Kennedy (2016) Kennedy, E. H. (2016), Semiparametric theory and empirical processes in causal inference, in ‘Statistical causal inferences and their applications in public health research’, Springer, pp. 141–167.
- Kennedy (2022) Kennedy, E. H. (2022), ‘Semiparametric doubly robust targeted double machine learning: a review’, arXiv preprint arXiv:2203.06469 .
- Kennedy (2023) Kennedy, E. H. (2023), ‘Towards optimal doubly robust estimation of heterogeneous causal effects’, Electronic Journal of Statistics 17(2), 3008–3049.
- Kennedy et al. (2018) Kennedy, E. H., Balakrishnan, S. & G’Sell, M. (2018), ‘Sharp instruments for classifying compliers and generalizing causal effects’, arXiv preprint arXiv:1801.03635 .
- Kim & Zubizarreta (2023) Kim, K. & Zubizarreta, J. R. (2023), Fair and robust estimation of heterogeneous treatment effects for policy learning, in ‘Proceedings of the 40th International Conference on Machine Learning’, Vol. 202 of Proceedings of Machine Learning Research, PMLR, pp. 16997–17014.
- Kravitz et al. (2004) Kravitz, R. L., Duan, N. & Braslow, J. (2004), ‘Evidence-based medicine, heterogeneity of treatment effects, and the trouble with averages’, The Milbank Quarterly 82(4), 661–687.
- Kumar & Patel (2007) Kumar, M. & Patel, N. R. (2007), ‘Clustering data with measurement errors’, Computational Statistics & Data Analysis 51(12), 6084–6101.
- Künzel et al. (2017) Künzel, S. R., Sekhon, J. S., Bickel, P. J. & Yu, B. (2017), ‘Meta-learners for estimating heterogeneous treatment effects using machine learning’, arXiv preprint arXiv:1706.03461 .
- Leskovec et al. (2020) Leskovec, J., Rajaraman, A. & Ullman, J. D. (2020), Mining of massive data sets, Cambridge university press.
- Levis et al. (2023) Levis, A. W., Bonvini, M., Zeng, Z., Keele, L. & Kennedy, E. H. (2023), ‘Covariate-assisted bounds on causal effects with instrumental variables’, arXiv preprint arXiv:2301.12106 .
- Levrard (2015) Levrard, C. (2015), ‘Nonasymptotic bounds for vector quantization in hilbert spaces’, The Annals of Statistics pp. 592–619.
- Levrard (2018) Levrard, C. (2018), ‘Quantization/clustering: when and why does -means work?’, Journal de la société française de statistique 159(1), 1–26.
- Linder et al. (1994) Linder, T., Lugosi, G. & Zeger, K. (1994), ‘Rates of convergence in the source coding theorem, in empirical quantizer design, and in universal lossy source coding’, IEEE Transactions on Information Theory 40(6), 1728–1740.
- Lloyd (1982) Lloyd, S. (1982), ‘Least squares quantization in pcm’, IEEE transactions on information theory 28(2), 129–137.
- Luedtke & Van Der Laan (2016) Luedtke, A. R. & Van Der Laan, M. J. (2016), ‘Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy’, Annals of statistics 44(2), 713.
- McCaffrey et al. (2013) McCaffrey, D. F., Griffin, B. A., Almirall, D., Slaughter, M. E., Ramchand, R. & Burgette, L. F. (2013), ‘A tutorial on propensity score estimation for multiple treatments using generalized boosted models’, Statistics in medicine 32(19), 3388–3414.
- Newey & Robins (2018) Newey, W. K. & Robins, J. R. (2018), ‘Cross-fitting and fast remainder rates for semiparametric estimation’, arXiv preprint arXiv:1801.09138 .
- Nie & Wager (2021) Nie, X. & Wager, S. (2021), ‘Quasi-oracle estimation of heterogeneous treatment effects’, Biometrika 108(2), 299–319.
- Pollard (1981) Pollard, D. (1981), ‘Strong consistency of k-means clustering’, The Annals of Statistics pp. 135–140.
- Pollard (1982) Pollard, D. (1982), ‘A central limit theorem for -means clustering’, The Annals of Probability 10(4), 919–926.
- Robins et al. (2008) Robins, J., Li, L., Tchetgen, E., van der Vaart, A. et al. (2008), Higher order influence functions and minimax estimation of nonlinear functionals, in ‘Probability and statistics: essays in honor of David A. Freedman’, Institute of Mathematical Statistics, pp. 335–421.
- Rubin (1974) Rubin, D. B. (1974), ‘Estimating causal effects of treatments in randomized and nonrandomized studies.’, Journal of Educational Psychology 66(5), 688.
- Serafini et al. (2020) Serafini, A., Murphy, T. B. & Scrucca, L. (2020), ‘Handling missing data in model-based clustering’, arXiv preprint arXiv:2006.02954 .
- Shalit et al. (2017) Shalit, U., Johansson, F. D. & Sontag, D. (2017), Estimating individual treatment effect: generalization bounds and algorithms, in ‘International conference on machine learning’, PMLR, pp. 3076–3085.
- Su et al. (2018) Su, Y., Reedy, J. & Carroll, R. J. (2018), ‘Clustering in general measurement error models’, Statistica Sinica 28(4), 2337.
- Tseng & Wong (2005) Tseng, G. C. & Wong, W. H. (2005), ‘Tight clustering: a resampling-based approach for identifying stable and tight patterns in data’, Biometrics 61(1), 10–16.
- Tsiatis (2007) Tsiatis, A. (2007), Semiparametric theory and missing data, Springer Science & Business Media.
- Van der Laan et al. (2003) Van der Laan, M. J., Laan, M. & Robins, J. M. (2003), Unified methods for censored longitudinal data and causality, Springer Science & Business Media.
- van der Laan & Luedtke (2015) van der Laan, M. J. & Luedtke, A. R. (2015), ‘Targeted learning of the mean outcome under an optimal dynamic treatment rule’, Journal of causal inference 3(1), 61–95.
- Van der Laan et al. (2007) Van der Laan, M. J., Polley, E. C. & Hubbard, A. E. (2007), ‘Super learner’, Statistical applications in genetics and molecular biology 6(1).
- van der Vaart (2002) van der Vaart, A. (2002), Semiparametric statistics, number 1781 in ‘Lecture Notes in Math.’, Springer, pp. 331–457. MR1915446.
- Van der Vaart (2000) Van der Vaart, A. W. (2000), Asymptotic statistics, Vol. 3, Cambridge university press.
- Van Der Vaart & Wellner (1996) Van Der Vaart, A. W. & Wellner, J. A. (1996), Weak convergence, in ‘Weak convergence and empirical processes’, Springer, pp. 16–28.
- VanderWeele (2017) VanderWeele, T. J. (2017), ‘Outcome-wide epidemiology’, Epidemiology (Cambridge, Mass.) 28(3), 399.
- VanderWeele et al. (2016) VanderWeele, T. J., Li, S., Tsai, A. C. & Kawachi, I. (2016), ‘Association between religious service attendance and lower suicide rates among us women’, JAMA psychiatry 73(8), 845–851.
- Wager & Athey (2018) Wager, S. & Athey, S. (2018), ‘Estimation and inference of heterogeneous treatment effects using random forests’, Journal of the American Statistical Association 113(523), 1228–1242.
- Zhang et al. (2017) Zhang, W., Le, T. D., Liu, L., Zhou, Z.-H. & Li, J. (2017), ‘Mining heterogeneous causal effects for personalized cancer treatment’, Bioinformatics 33(15), 2372–2378.
- Zhang et al. (2012) Zhang, Z., Chen, Z., Troendle, J. F. & Zhang, J. (2012), ‘Causal inference on quantiles with an obstetric application’, Biometrics 68(3), 697–706.
- Zheng & Van Der Laan (2010) Zheng, W. & Van Der Laan, M. J. (2010), ‘Asymptotic theory for cross-validated targeted maximum likelihood estimation’, Working Paper 273 .
Web Appendix
Appendix A Simulation Study Details
We consider a simple data generating process as follows. First, we fix , each of which is randomly drawn from a set . Then we randomly pick points in a bounded hypercube under a constraint that every pairwise mutual Euclidean distance between two cluster centers is always greater than . A set of these points is considered as our true codebook ; consequently defines the associated Voronoi cells. To assign roughly equal numbers of units to each , for each unit , we draw a label from a multinomial distribution: with . Given this label information, we set where follows a truncated normal distribution of with the threshold of . This guarantees that the nearest center for units with label in the counterfactual mean vector space is always , and that the margin condition holds. Next, we model our observed data generating process by and , where and . Finally, we let and , where and , respectively, which ensures that and .
We randomly pick different pairs of and vary the sample size from to for each . For each tuple, we generate data according to the above specified process, and then compute and the corresponding risk using the plug-in estimator from Section 3, as well as the semiparametric estimator from Section 4. We use splits and the gradient descent algorithm for optimization. We run the simulation times for each at two different nuisance rates of . Results are presented in Figure LABEL:fig:app-sim.
For both fast () and slow () rates at which the nuisance functions are estimated, the performance of the proposed semiparametric estimator is improved as grows, nearly at rates. On the other hand, the plug-in estimator shows far worse performance at the slow nuisance estimation rates, as it is no longer expected to converge at rates. Hence, the simulation results validate our theoretical findings in Sections 3 and 4, and support our recommendation to use the proposed semiparametric estimator described in Section 4 in practice.
Appendix B Proofs
Notation Guide. Hereafter, we let denote the -norm in order to simplify notation and avoid any confusion with the Euclidean norm , as the -norm is used most frequently in the proofs. For simplicity, we drop the dependence on if the context is clear. Also, for any fixed , we let for so that , and let
Further, we let so that under the margin condition for any , . With a slight abuse of notation, we write .
B.1 Proof of Theorem 3.1
Before proving Theorem 3.1, we present the three following lemmas.
Lemma B.1.
Suppose that Assumption A1 holds, and satisfies the margin condition with some , . Then we have
where denotes the -th coordinate of .
Proof.
Recall that . Letting and , we have
On the one hand, by the iterated expectation we have that
where the last inequality follows by the margin condition. Similarly,
where the first and second inequalities follow by Hölder’s and Markov’s inequalities, respectively, and the fact that each is Lipschitz at .
Putting the two pieces together, we finally obtain that
∎
The next lemma shows that one may achieve faster rates for the bias of .
Lemma B.2.
Suppose that Assumption A1 holds and satisfies the margin condition with some , . Then we have
Proof.
The following lemma computes the bias of our plug-in risk estimator .
Lemma B.3.
Proof.
It is immediate to see that
(A.1) | ||||
where , . The central limit theorem implies . Also, it follows by Lemma B.2 that
Further, under Assumption A1, it follows that
(A.2) |
which, by the triangle inequality, leads to
For the second term in the last display, note that
(A.3) |
where the last inequality follows by Lemma B.1. Hence, by the given consistency condition in Assumption A2, we get , , , and thereby conclude that . Hence, , and by the sample splitting lemma (Kennedy et al. 2018, Lemma 2), we obtain .
Proof of Theorem 3.1.
Notice that
(A.4) |
Since a.s., Linder et al. (1994, Theorem 1) implies the following bound for the first term in (B.1):
(A.5) |
Hence we obtain that
The same argument as in the preceding proof can be used to compute the rate of convergence in expectation as well. Specifically, when a.s., Biau et al. (2008, Theorem 2.1) implies that
Also, by virtue of Lemma B.2 one may deduce that
Using the above inequalities instead of (A.5) and (B.1), we obtain that
∎
B.2 Proof of Theorem 3.2
Lemma B.4.
For any , under Assumption A1, we have
First, we aim to show
To this end, consider the following decomposition for any :
We will analyze the terms in the following order: (iii) (ii) (i).
(iii) Consider sets of the subgraph . The shattering number of is , which follows by the fact that each is represented as a union of the complements of spheres. Hence the function class is a VC-class. For any fixed and , by the stability property (e.g., Van Der Vaart & Wellner 1996, Lemma 2.6.17) the function class is also a VC-class. Taking as the envelope function, we have under the given boundedness condition. Thus, is -Glivenko-Cantelli, yielding .
(ii) Under Assumption A1, by Lemma B.4 we have
which is under the consistency condition in Assumptions A2.
(i) Let for the function class from before. Then,
One may view the nuisance functions as fixed given the training data . Since is a VC-subgraph for any fixed , so is given . Let the VC index of be . Then we have
for some universal constants . Hence applying Giné & Nickl (2021, Theorem 3.5.4), we obtain that
Taking the envelope which is bounded, it is immediate to show that as the integral in the last display is finite. Consequently we get .
Now that we have shown the desired consistency follows by Van der Vaart (2000, Theorem 5.7), noting that is a continuous, bounded function whose domain is compact, and that is unique.
B.3 Proof of Lemma 4.1
Before proving the main result, we introduce the following lemma.
Proof.
Since , it follows
∎
Remark B.1.
(Kennedy (2022, Example 2)) For , it is well known that
Lemma B.6.
Proof.
Letting
and
one may write
Now note that
(A.7) |
where the last equality follows by the fact that . For the first term in the last display, it is immediate to see by Lemma B.5 and Remark B.1 that
(A.8) |
Next, let us rewrite the second term in (A.7) by
By mimicking the proof of Theorem 2 of Levis et al. (2023), we have that
(A.9) |
where the first inequality follows by the fact that and , the third by the margin condition, and the last by local Lipschitz continuity of each at under Assumption A1.
Similarly as above, we also note that
(A.10) |
which the first inequality follow by Hölder’s inequality, the second by Markov’s inequality. Putting these together, we finally obtain that
∎
Remark B.2 (Proof of Lemma B.2).
Using the same logic as in the proof of Lemma B.6, we may obtain the following uniform bound.
Proof.
Remark B.3 (Proof of Lemma B.4).
Proof of Lemma 4.1.
Recall that and . For two distributions , the second-order remainder term in the von Mises expansion is given by
(A.11) | ||||
By Lemma B.6, the last term in (A.11) is further bounded as
Hence for a submodel , we have
by virtue of the fact that the remainder essentially consists of only second-order products of errors between . Since there is at most one efficient influence function in nonparametric models, now we can apply Lemma 2 of Kennedy et al. (2023) and conclude that is the efficient influence function. ∎
B.4 Proof of Lemma 4.2
Proof.
For any , one may write
where we drop the dependence on in for simplicity. Then consider the following decomposition:
It suffices to show that the terms and are negligible, as the last term converges to by the central limit theorem.
(i) Noting with fixed , we have
By adding and subtracting terms, it is straightforward to show
Similarly, one may get
Further, we showed in (B.1) that if .
Putting the three pieces together, we conclude that under the consistency condition in Assumption A5. Hence, we conclude
which follows by the sample splitting lemma (Kennedy et al. 2018, Lemma 2).
Finally, the desired result follows by Slutsky’s theorem. ∎
B.5 Proof of Corollary 4.3
B.6 Proof of Theorem 4.4
Proof.
The first order condition for a solution to the minimization problem (12) is given by , where is defined in (13). Also note that (12) is equivalent to minimizing with
(A.12) |
We will proceed with (A.12) in the proof.
We use the logic that parallels the proof of Theorem 3 of Kennedy et al. (2023). By abuse of notation, we rewrite the empirical moment condition as
(A.13) | ||||
(A.14) | ||||
(A.15) |
where can be obtained by simply adding and subtracting terms. Note that the above represents a system of equations. Here we omit the term for simplicity. The terms in (A.13), (A.14), and (A.15) will be addressed sequentially.
The first term in (A.13) will be asymptotically multivariate Gaussian by the central limit theorem, and hence . Also, under Assumption A1 and the boundedness condition , it is immediate to see
(A.16) |
is bounded for each . In the proof of Theorem 4.5, we shall show that the term (A.16) is indeed . Thus, by Kennedy et al. (2018, Lemma 2), for the second term in (A.13), we get
(A.17) |
Under the consistency condition in Assumption A5, the term in (A.14) is by Kennedy et al. (2018, Lemma 2).
Next, we shall analyze the second term in (A.15). It suffices to analyze the -th block of the derivative vector (13). By adding and subtracting terms, it is immediate to see that
The first term in the above display is bounded as
(See Remark B.1). For the second term, first we notice that
Next, letting and , we have
where . Hence, using the same logic that we used to obtain (B.3) and (B.3) in the proof of Lemma B.6, one may get
Therefore the second term in (A.15) is bounded as
Finally, we tackle the first term in (A.15). Recall that the ‘Hessian’ matrix of is computed by
where . By the given condition that each , the matrix is nonsingular. Also we have by Corollary 4.3. Hence by Taylor’s theorem, we get the linear approximation
where the last equality follows by virtue of the fact that under the consistency condition in Assumption A5. Putting this back into the original empirical moment condition, together with the other results, we have
or equivalently,
by the nonsingularity of . This implies
so that
Next, by Pollard (1982, Lemma A), under Assumption A1, the map is differentiable with derivative , which leads to the following first-order approximation:
The linear term must vanish as setting minimizes . Consequently, we have
∎
B.7 Proof of Theorem 4.5
Proof.
First, we argue that the function class is Donsker for any fixed , if . This follows by noticing that consists of sums of locally Lipschitz functions with non-overlap**, compact supports, each region defined with the indicator , , and so has a finite bracketing integral.
Next, recall the empirical moment condition in the proof of Theorem 4.4. For the second term in (A.13), note that one may rewrite (A.16) as
(A.18) | ||||
Noting the following notational equivalence
and letting and , similarly as in the proof of Lemma B.6, one may show that for any , under the margin condition with any , ,
(A.19) |
where the last inequality follows by the fact that the function is locally Lipschitz at . Hence, by Corollary 4.3 as well as the boundedness condition , from (A.19) it follows that
Also, it is immediate to see that by Corollary 4.3. Hence, by the triangle inequality, the term (A.18) is .
Now, consider the following identity
(A.20) | |||
(A.21) |
Under the strong margin condition with some and , by Lemma 19.24 of Van der Vaart (2000), we conclude that the term in (A.20) is . The term in (A.21) is as well, since when , (A.17) and Corollary 4.3 imply
Applying this to the original moment condition, together with the other results in Section B.6, we obtain
(A.22) |
Substituting the result of Theorem 4.4 into (A.22) finally yields
∎