Differential recall bias in estimating treatment effects in observational studies

Abstract.

Observational studies are frequently used to estimate the effect of an exposure or treatment on an outcome. To obtain an unbiased estimate of the treatment effect, it is crucial to measure the exposure accurately. A common type of exposure misclassification is recall bias, which occurs in retrospective cohort studies when study subjects may inaccurately recall their past exposure. Particularly challenging is differential recall bias in the context of self-reported binary exposures, where the bias may be directional rather than random , and its extent varies according to the outcomes experienced. This paper makes several contributions: (1) it establishes bounds for the average treatment effect (ATE) even when a validation study is not available; (2) it proposes multiple estimation methods across various strategies predicated on different assumptions; and (3) it suggests a sensitivity analysis technique to assess the robustness of the causal conclusion, incorporating insights from prior research. The effectiveness of these methods is demonstrated through simulation studies that explore various model misspecification scenarios. These approaches are then applied to investigate the effect of childhood physical abuse on mental health in adulthood.

Key words and phrases:

Blocking; Causal inference; Differential recall bias; Prognostic score; Propensity score; Stratification.

Differential Recall Bias in Estimating Treatment Effects
in Observational Studies⁰⁰footnotetext: This article has been accepted for publication in Biometrics Published by Oxford University Press.

Suhwan Bong¹, Kwonsang Lee^*^**Corresponding author: [email protected]¹ and Francesca Dominici²

¹Department of Statistics, Seoul National University

²Department of Biostatistics, Harvard T.H. Chan School of Public Health

1. Introduction

Observational studies are conducted to quantify the evidence of a potential causal relationship between an exposure or treatment and a given outcome. While numerous methods have been proposed to address confounding bias in observational studies, only a few have considered the challenges associated with accurately measuring exposure. One such challenge is the presence of recall bias, which refers to a systematic error that occurs when participants inaccurately recall or omit details of past experiences, potentially influenced by subsequent events. Recall bias is particularly problematic in studies relying on self-reporting, such as retrospective cohort studies. It can lead to exposure misclassification, which can manifest as random or differential misclassification (Rothman,, 2012). Unlike random recall bias, differential recall bias occurs when the misclassification of exposure information varies according to the value of other study variables. Specifically, under differential recall bias, exposure is differentially under-reported (over-reported) depending on the outcome. In addition to differential recall bias, random recall bias occurs when inaccuracies in the reporting of past events are due to chance and are not influenced by any specific factors. If the inaccuracies are equally likely to occur across the groups, then the bias may cancel out, and the estimated treatment effect may be unbiased (Raphael,, 1987). However, differential recall bias will likely lead to a biased estimate (Rothman,, 2012). In retrospective cohort studies or case-control studies, differential recall bias cannot be eliminated even after adjusting for confounders. This paper will focus on methods for addressing differential recall bias in retrospective cohort studies.

In our motivating example, which examines the data on childhood physical abuse and adult anger, it is noted from prior research that significant differential recall bias can occur. This introduces a systematic bias, further compounding the bias already present due to confounders. Adults tend to under-report their exposures to childhood abuse because they are hesitant to disclose their experiences, even in anonymous or confidential surveys, due to feelings of shame, guilt, or fear of retaliation. Also, it is possible that individuals who have experienced childhood abuse and suffer from anger issues may be more likely to report their abuse, as their anger may be related to unresolved trauma or emotional distress stemming from the abuse. However, it is important to note that the relationship between childhood abuse and adult anger is complex, and under-reporting of childhood abuse is a common problem that can vary depending on a range of individual and contextual factors (Fergusson et al.,, 2000).

There is an extensive literature on measurement error correction and exposure misclassification in epidemiology (Carroll et al.,, 1995; Rothman et al.,, 2008). For instance, numerous studies have concentrated on the error-in-variables model, addressing measurement error in linear regression problems (Lindley,, 1953; Lord,, 1960; Cochran,, 1968; Fuller,, 1980; Carroll et al.,, 1985). The regression calibration algorithm is proposed as a general approach by Carroll and Stefanski, (1990) and Gleser, (1990). Several measurement error correction methods have also been developed for binary misclassification problems and logistic regression models (Bross,, 1954; Armstrong,, 1985; Stefanski and Carroll,, 1985; Rosner et al.,, 1989, 1990). However, previous methodologies addressing measurement error have predominantly focused on outcome prediction and targeted the association of variables. In our research, we aim to address this issue while considering confounders and focusing on the causal relationship between variables. To the best of our knowledge, contributions regarding the impact of differential recall bias on measures with causal interpretations are scarce.

Accounting for measurement error in causal inference is important. However, studies on measurement error have typically focused on mismeasured covariates and misclassified outcomes. For example, previous studies have investigated measurement error in covariates (McCaffrey et al.,, 2013; Lockwood and McCaffrey,, 2016) and misclassification of binary outcomes (Gravel and Platt,, 2018). However, only a limited number of studies have addressed misclassified exposure, or recall bias. Imai and Yamamoto, (2010) proposed a nonparametric identification method for estimating the average treatment effect (ATE) with differential treatment measurement error. This method could address both over-reporting and under-reporting measurement errors, but it was based on strict assumptions, such as no misclassification for compliant groups. Furthermore, their bounds for the ATE were derived from the true treatment assigning probability, which may be unknown when recall bias is present. Babanezhad et al., (2010) and Braun et al., (2017) have shown that the exposure misclassification could significantly impact causal analysis. Babanezhad et al., (2010) compared several causal estimators for time-varying exposure reclassification cases, and Braun et al., (2017) proposed a likelihood-based method that adjusts for exposure misclassification bias, presupposing on non-differential measurement error assumption. In summary, these studies highlight a gap regarding causal inference methodologies that can analytically assess the impact of differential recall bias.

The primary goal of this paper is to introduce a collection of robust estimators for estimating the ATE in the presence of differential recall bias. We emphasize the significance of delineating the impact of differential recall bias on ATE within the causal inference framework. Additionally, this paper has 3 further contributions. First, we derive bounds for the ATE that do not rely on a validation study. The bounds can be refined with additional information about the nature of differential recall bias. This allows researchers to tailor the bounds based on the available evidence. Second, we propose multiple estimation methods using 2 different strategies for estimating the ATE—maximum likelihood estimation (MLE) and stratification. Each of the methods is based on its own assumptions, offering valuable insights on potential model misspecification. Finally, we propose a novel sensitivity analysis approach to assess the impact of differential recall bias on our conclusion. This is a crucial and useful way to quantify the evidence, given that the degree of differential recall bias is typically unknown in practice. We illustrate the application and efficacy of this sensitivity analysis method by applying it to the real data in Section 6.

2. Notation and Recall Bias Model

2.1. Causal Inference Framework and Target Parameters

We start by introducing the potential outcome framework (Rubin,, 1974). Assume $N$ individuals in total. We denote ${\mathbf{X}}_{i}\in\mathcal{X}\subseteq\mathbb{R}^{d}$ as an observed covariate vector for the $i$ th individual. We let $Z_{i}=1$ , to indicate that individual $i$ was exposed to a certain binary exposure, and $Z_{i}=0$ otherwise. We can define potential outcomes as follows: If $Z_{i}=0$ , then individual $i$ exhibits response $Y_{i}(0)$ ; if $Z_{i}=1$ , then individual $i$ exhibits $Y_{i}(1)$ . Only one of the two potential outcomes can be observed depending on the exposure of individual $i$ . The response exhibited by individual $i$ is $Y_{i}=Z_{i}Y_{i}(1)+(1-Z_{i})Y_{i}(0)$ . In this paper, $Y_{i}(0)$ and $Y_{i}(1)$ are assumed to be binary. Depending on the occurrence of the outcome, the potential outcome is equal to either 0 or 1. We consider two assumptions: (1) unconfoundedness and (2) positivity. The unconfoundedness assumption means that the potential outcomes $(Y_{i}(0),Y_{i}(1))$ are conditionally independent of the treatment $Z$ given ${\mathbf{X}}_{i}$ , that is, $(Y_{i}(0),Y_{i}(1))\perp\!\!\!\perp Z_{i}|{\mathbf{X}}_{i}$ . The positivity assumption means that the probability $\Pr(Z_{i}=1|{\mathbf{X}}_{i})$ lies in $(0,1)$ . These assumptions together are often called strong ignorability (Rosenbaum and Rubin,, 1983). We also adopt the Stable Unit Treatment Value Assumption (Rubin,, 1980) to identify causal effects; that is, the potential outcomes for each individual are not affected by the treatment status of other individuals.

Binary exposure is frequently retrospectively investigated to find the cause of the outcome in observational studies; thus, exposure to a risk factor is never randomized. A naive comparison of the prevalence of the outcome between the exposed and unexposed groups can be misleading due to the confounding bias. The effect caused by treatment to an individual $i$ is defined as the difference, $Y_{i}(1)-Y_{i}(0)$ . However, it is impossible to observe both $Y_{i}(0)$ and $Y_{i}(1)$ for any individual. Under strong ignorability assumptions, it is possible to identify the ATE. Thus, our parameter of interest is the ATE, $\tau=\mathbb{E}[Y_{i}(1)]-\mathbb{E}[Y_{i}(0)]$ . In some instances, we are interested in estimating the conditional average treatment effect (CATE) at a given level of ${\mathbf{X}}_{i}=\mathbf{x}$ for $\mathbf{x}\in\mathcal{X}$ as $\tau(\mathbf{x})=\mathbb{E}[Y_{i}(1)|{\mathbf{X}}_{i}=\mathbf{x}]-\mathbb{E}[Y% _{i}(0)|{\mathbf{X}}_{i}=\mathbf{x}]$ .

2.2. Differential Recall Bias Model

Some observational studies, including retrospective cohort studies, are retrospective in nature. Thus, recall bias may occur when the exposures are self-reported. In this paper, we consider situations with differential recall bias where the exposure is under-reported (over-reported) differently depending on the outcome. In the presence of differential recall bias, the underlying true exposure $Z_{i}$ is not observed. Instead, we observe the biased exposure $Z_{i}^{*}$ with recall bias. If no recall bias exists, then $Z_{i}=Z_{i}^{*}$ . In the childhood abuse example, previous literature has indicated that exposure was mainly under-reported. To address this issue, we assume that either over-reporting or under-reporting recall biases occur. Since over-reporting can be treated similarly, we focus on the under-reporting recall bias in this paper.

Assumption 1 (Differential Recall Bias).

Recall bias occurs independently with probability $\eta_{y}(\mathbf{x})$ for individuals with $Y_{i}=y$ , $Z_{i}=1$ , and ${\mathbf{X}}_{i}=\mathbf{x}$ where $y=0,1$ and $\mathbf{x}\in\mathcal{X}$ .

	$\displaystyle\eta_{0}(\mathbf{x})=\Pr(Z_{i}^{*}=0\|Y_{i}=0,Z_{i}=1,{\mathbf{X}}% _{i}=\mathbf{x}),$
	$\displaystyle\eta_{1}(\mathbf{x})=\Pr(Z_{i}^{*}=0\|Y_{i}=1,Z_{i}=1,{\mathbf{X}}% _{i}=\mathbf{x}).$

Assumption 1 proposes the differential recall bias model that assumes that the occurrence and magnitude of bias depend on the observed outcome and ${\mathbf{X}}_{i}$ . In essence, after stratifying the data based on ${\mathbf{X}}_{i}$ , the $2\times 2$ contingency table of $Y_{i}$ and $Z_{i}$ can be subject to misclassification. The parameters $\eta_{0}(\mathbf{x})$ and $\eta_{1}(\mathbf{x})$ represent the probability of under-reporting depending on outcome variables. In under-reported cases, $Z_{i}=0$ always implies $Z_{i}^{*}=0$ , but $Z_{i}=1$ implies either $Z_{i}^{*}=1$ or $Z_{i}^{*}=0$ . Therefore, recall bias occurs only when $Z_{i}=1$ . Note that if there is no recall bias, then $\eta_{0}(\mathbf{x})=\eta_{1}(\mathbf{x})=0$ .

3. Identification of Causal Parameters

If recall bias is absent and the exposure is observed correctly, then $\tau$ and $\tau(\mathbf{x})$ can be identified under strong ignorability assumptions. However, in the presence of recall bias, if the inference is made on the basis of observed $Z_{i}^{*}$ rather than $Z_{i}$ , then we obtain a biased estimate because $(Y_{i}(0),Y_{i}(1))\not\perp\!\!\!\perp Z_{i}^{*}|{\mathbf{X}}_{i}$ . To describe this, consider the probabilities based on observed exposure $Z^{\ast}$ , $p_{y|z}^{\ast}(\mathbf{x})=\Pr(Y_{i}=y|Z_{i}^{\ast}=z,{\mathbf{X}}_{i}=\mathbf% {x})$ for $z=0,1$ , $y=0,1$ , and $\mathbf{x}\in\mathcal{X}$ . Then,

\displaystyle p_{y|z}(\mathbf{x})=\frac{p_{yz}(\mathbf{x})}{p_{1z}(\mathbf{x})% +p_{0z}(\mathbf{x})}\not=\frac{p_{yz}^{\ast}(\mathbf{x})}{p_{1z}^{\ast}(% \mathbf{x})+p_{0z}^{\ast}(\mathbf{x})}=p_{y|z}^{\ast}(\mathbf{x})

where $p_{yz}(\mathbf{x})=\Pr(Y_{i}=y,Z_{i}=z|{\mathbf{X}}_{i}=\mathbf{x})$ and $p_{yz}^{\ast}(\mathbf{x})=\Pr(Y_{i}=y,Z_{i}^{\ast}=z|{\mathbf{X}}_{i}=\mathbf{% x})$ for $z=0,1$ , $y=0,1$ , and $\mathbf{x}\in\mathcal{X}$ holds.

The precise recall bias mechanism in real-life scenarios is often unknown. Assuming we lack precise knowledge of the recall bias parameter functions stated in Assumption 1, it becomes infeasible to identify the ATE with certainty. However, if we can establish bounds on the recall bias parameter functions, we can potentially bound the target parameters. Under the recall bias model in Assumption 1, the following relationships hold:

	$\displaystyle p_{11}(\mathbf{x})=\frac{p_{11}^{\ast}(\mathbf{x})}{1-\eta_{1}(% \mathbf{x})},\quad p_{10}(\mathbf{x})=p_{10}^{\ast}(\mathbf{x})-\frac{\eta_{1}% (\mathbf{x})}{1-\eta_{1}(\mathbf{x})}p_{11}^{\ast}(\mathbf{x}),$
	$\displaystyle p_{01}(\mathbf{x})=\frac{p_{01}^{\ast}(\mathbf{x})}{1-\eta_{0}(% \mathbf{x})},\quad p_{00}(\mathbf{x})=p_{00}^{\ast}(\mathbf{x})-\frac{\eta_{0}% (\mathbf{x})}{1-\eta_{0}(\mathbf{x})}p_{01}^{\ast}(\mathbf{x}).$		(1)

The following proposition allows partial identification of the causal parameters if the recall bias occurs with the probability of at most $\delta$ .

Proposition 1.

Under Assumption 1, suppose there exists a constant $0\leq\delta<1$ which $0\leq\eta_{0}(\mathbf{x}),\eta_{1}(\mathbf{x})\leq\delta$ holds for all $\mathbf{x}\in\mathcal{X}$ . Then the following inequalities hold for all $\mathbf{x}\in\mathcal{X}$ :

	$\displaystyle\frac{p_{11}^{}(\mathbf{x})}{p_{11}^{}(\mathbf{x})+\frac{1}{1-% \delta}p_{01}^{*}(\mathbf{x})}$	$\displaystyle\leq p_{1\|1}(\mathbf{x})\leq\frac{p_{11}^{}(\mathbf{x})}{p_{11}^% {}(\mathbf{x})+(1-\delta)p_{01}^{*}(\mathbf{x})},$
	$\displaystyle\frac{p_{10}^{}(\mathbf{x})-\frac{\delta}{1-\delta}p_{11}^{}(% \mathbf{x})}{p_{10}^{}(\mathbf{x})+p_{00}^{}(\mathbf{x})-\frac{\delta}{1-% \delta}p_{11}^{*}(\mathbf{x})}$	$\displaystyle\leq p_{1\|0}(\mathbf{x})\leq\frac{p_{10}^{}(\mathbf{x})}{p_{10}^% {}(\mathbf{x})+p_{00}^{}(\mathbf{x})-\frac{\delta}{1-\delta}p_{01}^{}(% \mathbf{x})}.$

This proposition can be used when we can constrain the occurrence probability of recall bias using domain knowledge.

Some additional assumptions may be useful to narrow the bounds of estimands. For example, when studying the potential impact of childhood abuse on mental health issues in adulthood, it is important to consider the possibility of individuals hiding or feeling shame about their previous experiences. Additionally, those who have mental health problems in adulthood may be more likely to under-report their history of abuse, as they may feel particularly affected by their experiences and may be hesitant to disclose them. Providing this additional information enables us to make the assumption that $\eta_{0}(\mathbf{x})\leq\eta_{1}(\mathbf{x})$ .

Proposition 2.

Under Assumption 1,

(a)

Suppose there exists a constant $0\leq\delta<1$ which $0\leq\eta_{0}(\mathbf{x})\leq\eta_{1}(\mathbf{x})\leq\delta$ holds for all $\mathbf{x}\in\mathcal{X}$ . Then, the following inequalities hold for all $\mathbf{x}\in\mathcal{X}$ :

	$\displaystyle p^{}_{1\|1}(\mathbf{x})\leq p_{1\|1}(\mathbf{x})\leq\frac{p_{11}^% {}(\mathbf{x})}{p_{11}^{}(\mathbf{x})+p_{01}^{}(\mathbf{x})(1-\delta)},$
	$\displaystyle\frac{p_{10}^{}(\mathbf{x})-\frac{\delta}{1-\delta}p_{11}^{}(% \mathbf{x})}{p_{10}^{}(\mathbf{x})+p_{00}^{}(\mathbf{x})-\frac{\delta}{1-% \delta}p_{11}^{}(\mathbf{x})}\leq p_{1\|0}(\mathbf{x})\leq\max\left\{p^{}_{1\|% 0}(\mathbf{x}),\frac{p_{10}^{}(\mathbf{x})-\frac{\delta}{1-\delta}p_{11}^{}(% \mathbf{x})}{p_{10}^{}(\mathbf{x})+p_{00}^{}(\mathbf{x})-\frac{\delta}{1-% \delta}\{p_{01}^{}(\mathbf{x})+p_{11}^{}(\mathbf{x})\}}\right\}.$

(b)

Suppose there exists a constant $0\leq\delta<1$ which $0\leq\eta_{1}(\mathbf{x})\leq\eta_{0}(\mathbf{x})\leq\delta$ holds for all $\mathbf{x}\in\mathcal{X}$ . Then, the following inequalities hold for all $\mathbf{x}\in\mathcal{X}$ :

	$\displaystyle\frac{p_{11}^{}(\mathbf{x})}{p_{11}^{}(\mathbf{x})+p_{01}^{}(% \mathbf{x})/(1-\delta)}\leq p_{1\|1}(\mathbf{x})\leq p^{}_{1\|1}(\mathbf{x}),$
	$\displaystyle\min\left\{p^{}_{1\|0}(\mathbf{x}),\frac{p_{10}^{}(\mathbf{x})-% \frac{\delta}{1-\delta}p_{11}^{}(\mathbf{x})}{p_{10}^{}(\mathbf{x})+p_{00}^{% }(\mathbf{x})-\frac{\delta}{1-\delta}\{p_{01}^{}(\mathbf{x})+p_{11}^{}(% \mathbf{x})\}}\right\}\leq p_{1\|0}(\mathbf{x})\leq\frac{p_{10}^{}(\mathbf{x})% }{p_{10}^{}(\mathbf{x})+p_{00}^{}(\mathbf{x})-\frac{\delta}{1-\delta}p_{01}^% {*}(\mathbf{x})}.$

This proposition implies that by assuming a relationship between two parameters, $\eta_{0}(\mathbf{x})$ and $\eta_{1}(\mathbf{x})$ , we can narrow down either the upper or lower bound of the ATE. We can partially identify the causal parameter when the exact recall bias parameter functions are unknown. We can also point identify the causal treatment effect parameters if the recall bias parameter functions are known.

Proposition 3.

Under Assumption 1, the following equality holds for all $\mathbf{x}\in\mathcal{X}$ .

\displaystyle\tau(\mathbf{x})=\frac{\frac{p_{11}^{*}(\mathbf{x})}{1-\eta_{1}(% \mathbf{x})}}{\frac{p_{11}^{*}(\mathbf{x})}{1-\eta_{1}(\mathbf{x})}+\frac{p_{0% 1}^{*}(\mathbf{x})}{1-\eta_{0}(\mathbf{x})}}-\frac{p_{10}^{*}(\mathbf{x})-% \frac{\eta_{1}(\mathbf{x})}{1-\eta_{1}(\mathbf{x})}p_{11}^{*}(\mathbf{x})}{p_{% 10}^{*}(\mathbf{x})-\frac{\eta_{1}(\mathbf{x})}{1-\eta_{1}(\mathbf{x})}p_{11}^% {*}(\mathbf{x})+p_{00}^{*}(\mathbf{x})-\frac{\eta_{0}(\mathbf{x})}{1-\eta_{0}(% \mathbf{x})}p_{01}^{*}(\mathbf{x})}.

If a validation study is available, $\eta_{0}(\mathbf{x}),\eta_{1}(\mathbf{x})$ can be estimated and then be plugged into the above equation. However, it is not common in many situations, especially at an early stage of research.

4. Methods for Recovering the Treatment Effects in the Presence of Recall Bias

In this section, we propose two estimation methods that provide consistent estimates of the ATE in the presence of recall bias and confounding: (1) maximum likelihood estimation and (2) stratification. We suggest three stratification techniques for the stratification method: (1) propensity score stratification, (2) prognostic score stratification, and (3) blocking. Furthermore, we discuss the nearest-neighbor combination method used to address the problems in the stratification method with recall bias.

For given $\eta_{0}(\mathbf{x})$ and $\eta_{1}(\mathbf{x})$ , the first maximum likelihood (ML)-based method requires the correct identification of models for exposure and two potential outcomes to obtain a consistent estimate of $\tau$ . The stratification-based method requires a few model assumptions. Stratification can be implemented on the basis of either propensity scores or prognostic scores (Hansen,, 2008). The propensity score stratification method requires a correctly specified exposure model, while the prognostic score stratification method needs a correctly specified outcome model. The last blocking method suggested by Karmakar et al., (2021) does not need any model assumption. In the following subsections, we discuss these estimation methods in more detail.

4.1. Maximum Likelihood Estimation

Consider the outcome models $m_{z}(\mathbf{x})=\Pr(Y_{i}=1|Z_{i}=z,{\mathbf{X}}_{i}=\mathbf{x})$ for $z=0,1$ that are the models for the two probabilities, $p_{1|1}(\mathbf{x})$ and $p_{1|0}(\mathbf{x})$ and, the propensity score model $e(\mathbf{x})$ for $\Pr(Z_{i}=1|{\mathbf{X}}_{i}=\mathbf{x})$ . The probability $p_{z}(\mathbf{x})=\mathbb{E}[Y_{i}(z)|{\mathbf{X}}_{i}=\mathbf{X}]$ can be identified as $p_{1|z}(\mathbf{x})$ , which can thus be estimated by $m_{1}(\mathbf{x})$ . It is well-known that, in the absence of recall bias, either $m_{z}(\mathbf{x})$ or $e(\mathbf{x})$ is required to be correctly specified to obtain a consistent estimate. However, in the presence of recall bias, $m_{z}(\mathbf{x})$ nor $e(\mathbf{x})$ cannot be estimated from the observable data set due to the absence of true $Z_{i}$ . We can rather estimate the ATE as a function of the tuning parameters of the recall bias model.

The first method presented in this subsection uses maximum likelihood estimation. $m_{z}(\mathbf{x})$ and $e(\mathbf{x})$ must be specified to construct the likelihood function to obtain an estimate for given $\eta_{0}(\mathbf{x})$ and $\eta_{1}(\mathbf{x})$ . Under Assumptions 1, the joint probability $\Pr(Y_{i},Z_{i}^{\ast}|{\mathbf{X}}_{i})$ of observable variables can be represented by a function of $m_{0}(\mathbf{x})$ , $m_{1}(\mathbf{x})$ , and $e(\mathbf{x})$ . We assume models $m_{z}(\mathbf{x};\bm{\gamma}_{z}),z=0,1$ and $e(\mathbf{x};\bm{\beta})$ with parameters $\bm{\gamma}_{z}$ and $\bm{\beta}$ , respectively. For instance, logistic regressions can be used such as $m(Z,\mathbf{X};\bm{\gamma})=\exp(\gamma_{z}Z+\bm{\gamma}_{\mathbf{X}}^{T}% \mathbf{X})/\{1+\exp(\gamma_{z}Z+\bm{\gamma}_{\mathbf{X}}^{T}\mathbf{X})\}$ with $m_{1}(\mathbf{X})=m(1,\mathbf{X};\bm{\gamma})$ and $m_{0}(\mathbf{X})=m(0,\mathbf{X};\bm{\gamma})$ and $e(\mathbf{X})=\exp({\bm{\beta}}^{T}\mathbf{X})/\{1+\exp(\bm{\beta}^{T}\mathbf{% X})\}$ . These model parameters can be estimated by solving the following maximization problem:

\widehat{\bm{\theta}}=(\widehat{\bm{\beta}},\widehat{\bm{\gamma}}_{0},\widehat% {\bm{\gamma}}_{1})=\operatorname*{argmax}_{\bm{\beta},\bm{\gamma}_{0},\bm{% \gamma}_{1}}\sum_{i=1}^{N}\log\Pr(Y_{i}=y_{i},Z_{i}^{*}=z_{i}|\mathbf{X}_{i}=% \mathbf{x}_{i}).

Once we obtain the estimate $\widehat{\bm{\theta}}$ , we can compute $\widehat{m}_{z}(\mathbf{x})=m_{t}(\mathbf{x};\widehat{\bm{\gamma}}_{z})$ and $\widehat{e}(\mathbf{x})=e(\mathbf{x};\widehat{\bm{\beta}})$ . The marginal probabilities ${p}_{z}$ are then estimated by taking sample averages of $\widehat{m}_{z}(\mathbf{x})$ as $\widehat{p}_{1}^{ML}=\frac{1}{N}\sum_{i=1}^{N}\widehat{m}_{1}({\mathbf{X}}_{i})$ and $\widehat{p}_{0}^{ML}=\frac{1}{N}\sum_{i=1}^{N}\widehat{m}_{0}({\mathbf{X}}_{i})$ . Then, the ATE estimate is $\widehat{\tau}^{ML}=\widehat{p}_{1}^{ML}-\widehat{p}_{0}^{ML}$ . This estimate is consistent if $m_{z}(\mathbf{x};\bm{\gamma}_{z})$ and $e(\mathbf{x};\bm{\beta})$ must be correctly specified for fixed $\eta_{0}(\mathbf{x}),\eta_{1}(\mathbf{x})$ .

A key challenge with this method is the requirement for researchers to specify $\eta_{0}(\mathbf{x})$ and $\eta_{1}(\mathbf{x})$ , which are typically unknown in practice. If these can be estimated from external sources, such estimates can be incorporated into the likelihood function. In the absence of information about $\eta_{0}(\mathbf{x})$ and $\eta_{1}(\mathbf{x})$ , an additional assumption might be made that $\eta_{0}(\mathbf{x})=\eta_{0}$ and $\eta_{1}(\mathbf{x})=\eta_{1}$ , implying constant values across all $\mathbf{x}$ . To address this uncertainty, a sensitivity analysis could be employed, utilizing a plausible range for $\eta_{0}$ and $\eta_{1}$ informed by prior research. This involves testing every possible combination of $(\eta_{0},\eta_{1})$ within the specified range and assessing the impact on the ATE estimate’s variation.

4.2. Stratification

Stratification can be alternatively used to estimate $\tau$ by aiming to balance the covariate distributions between exposed and unexposed groups. Compared with the ML method, stratification requires fewer assumptions in general. If stratification can be successfully created while adjusting for confounders, the estimation of $\tau$ is straightforward. Assume that there are $I$ strata. Each stratum $i$ , contains $n_{i}$ individuals. There are $N=\sum_{i=1}^{I}n_{i}$ individuals in total. Denote $ij$ as the $j$ th individual in stratum $i$ for $j=1,\dots,n_{i}$ . If we assume $(Y_{ij}(1),Y_{ij}(0))\perp Z_{ij}$ holds within each stratum $i$ , then the stratum-specific probabilities $p_{1i}=\mathbb{E}_{\mathbf{X}|\textbf{stratum i}}[p_{1}(\mathbf{X})]$ and $p_{0i}=\mathbb{E}_{\mathbf{X}|\textbf{stratum i}}[p_{0}(\mathbf{X})]$ can be identified from the $2\times 2$ table generated by stratum $i$ . However, $Z_{ij}^{\ast}$ is observed instead of $Z_{ij}$ due to recall bias. Therefore, the recall bias adjustment using (3) is required. For stratum $i$ , assume that Table 1 with $a_{i}^{*},b_{i}^{*},c_{i}^{*},d_{i}^{*}$ is observed.

Table 1. The

2\times 2

contingency observed table for the

i

th stratum.

	$Y=1$	$Y=0$
Exposed $(Z^{*}=1)$	$a_{i}^{*}$	$b_{i}^{*}$	$a_{i}^{}+b_{i}^{}$
Not exposed $(Z^{*}=0)$	$c_{i}^{*}$	$d_{i}^{*}$	$c_{i}^{}+d_{i}^{}$
	$a_{i}^{}+c_{i}^{}$	$b_{i}^{}+d_{i}^{}$	$n_{i}^{*}$

Proposition 4.

Suppose there are $2\times 2$ contingency tables for $I$ strata on $Y$ and $Z^{*}$ with $a_{i}^{*},b_{i}^{*},c_{i}^{*},d_{i}^{*}$ as in Table 1. The stratum-specific ATE $\tau_{i}=\mathbb{E}_{\mathbf{X}|\textbf{stratum i}}[\tau(\mathbf{x})]$ can be estimated for known $\eta_{0}(\mathbf{x}),\eta_{1}(\mathbf{x})$ as

	$\displaystyle\widehat{\tau}_{i}=\frac{\sum_{j=1}^{n_{i}}\frac{Z_{ij}^{}Y_{ij}% }{1-\eta_{1}(\mathbf{x}_{ij})}}{\sum_{j=1}^{n_{i}}\frac{Z_{ij}^{}Y_{ij}}{1-% \eta_{1}(\mathbf{x}_{ij})}+\sum_{j=1}^{n_{i}}\frac{Z_{ij}^{*}(1-Y_{ij})}{1-% \eta_{0}(\mathbf{x}_{ij})}}$
	$\displaystyle-\frac{\sum_{j=1}^{n_{i}}(1-Z_{ij}^{})Y_{ij}-\sum_{j=1}^{n_{i}}Z% _{ij}^{}Y_{ij}\frac{\eta_{1}(\mathbf{x}_{ij})}{1-\eta_{1}(\mathbf{x}_{ij})}}{% \sum_{j=1}^{n_{i}}(1-Z_{ij}^{})Y_{ij}-\sum_{j=1}^{n_{i}}Z_{ij}^{}Y_{ij}\frac% {\eta_{1}(\mathbf{x}_{ij})}{1-\eta_{1}(\mathbf{x}_{ij})}+\sum_{j=1}^{n_{i}}(1-% Z_{ij}^{})(1-Y_{ij})-\sum_{j=1}^{n_{i}}Z_{ij}^{}(1-Y_{ij})\frac{\eta_{1}(% \mathbf{x}_{ij})}{1-\eta_{1}(\mathbf{x}_{ij})}}.$

Also, for $0\leq\eta_{0}(\mathbf{x}),\eta_{1}(\mathbf{x})\leq\delta_{i}$ , the bound can be estimated as

\frac{a_{i}^{*}}{a_{i}^{*}+b_{i}^{*}\frac{1}{1-\delta_{i}}}-\frac{c_{i}^{*}}{c% _{i}^{*}+d_{i}^{*}-b_{i}^{*}\frac{\delta_{i}}{1-\delta_{i}}}\leq\widehat{\tau}% _{i}\leq\frac{a_{i}^{*}}{a_{i}^{*}+b_{i}^{*}(1-\delta_{i})}-\frac{c_{i}^{*}-a_% {i}^{*}\frac{\delta_{i}}{1-\delta_{i}}}{c_{i}^{*}+d_{i}^{*}-a_{i}^{*}\frac{% \delta_{i}}{1-\delta_{i}}}.

This proposition is directly obtained from Propositions 1 and 3. If $\eta_{0}(\mathbf{x})=\eta_{0}$ and $\eta_{1}(\mathbf{x})=\eta_{1}$ , the estimate $\widehat{\tau}_{i}$ is simplified as $\frac{a_{i}^{*}/(1-\eta_{1})}{a_{i}^{*}/(1-\eta_{1})+b_{i}^{*}/(1-\eta_{0})}-% \frac{c_{i}^{*}-a_{i}^{*}(\eta_{1}/(1-\eta_{1}))}{c_{i}^{*}-a_{i}^{*}(\eta_{1}% /(1-\eta_{1}))+d_{i}^{*}-b_{i}^{*}(\eta_{0}/(1-\eta_{0}))}$ . The marginal probabilities can be estimated by the weighted average of these stratum-specific $\widehat{\tau}_{i}$ with weights $s_{i}=n_{i}/N$ . Therefore, the ATE is estimated by $\widehat{\tau}^{S}=\sum_{i=1}^{I}\widehat{\tau}_{i}(s_{i}/N)$ . The bound can be similarly obtained.

4.2.1. Propensity Score Stratification

Among stratification-based methods, stratification based on propensity score is the most common approach (Rosenbaum and Rubin,, 1983). The propensity score is a conditional probability of the treatment assignment given the observed covariates, $e(\mathbf{x})=\Pr(Z_{i}=1|\mathbf{X}_{i}=\mathbf{x})$ . We only have to assume a treatment model to create strata using propensity scores. However, similar to many stratification-based methods, this method relies on the assumption that stratification achieves covariate balance by at least approximately. Furthermore, strata are formed on the basis of biasedly estimated propensity score $\widehat{e}^{*}(\mathbf{x})=\Pr(Z_{i}^{*}=1|\mathbf{X}_{i}=\mathbf{x})$ using $Z^{*}$ instead of unobservable $Z$ . It is not feasible to compare the covariate distributions between the exposed and unexposed groups. Thus, constructing strata based on the propensity score can be problematic if $\eta_{0}(\mathbf{x})$ and $\eta_{1}(\mathbf{x})$ are significantly different from $0$ . However, if $\eta_{0}(\mathbf{x})=\eta_{1}(\mathbf{x})=\eta$ for all $\mathbf{x}\in\mathcal{X}$ , then the covariate balance between the $Z^{*}=1$ and $Z^{*}=0$ groups is asymptotically the same as that between the $Z=1$ and $Z=0$ groups since $e^{\ast}(\mathbf{x})=(1-\eta)e(\mathbf{x})$ . If the recall bias occurs with the same probability across the $Y=0$ and $Y=1$ groups (i.e., recall bias is not differential), then $e^{\ast}(\mathbf{x})$ is also a balancing score. Thus, we can create valid strata using the biased propensity score obtained by observable variables.

4.2.2. Prognostic Score Stratification

Instead of using the propensity score, the prognostic score can be utilized to construct strata (Hansen,, 2008). If there is $\Psi(\mathbf{X}_{ij})$ such that $Y_{ij}(0)\perp\!\!\!\perp\mathbf{X}_{ij}|\Psi(\mathbf{X}_{ij})$ , then we call $\Psi(\cdot)$ the prognostic score. Similar to propensity score stratification, prognostic score stratification permits the estimation of exposure effects within the exposed group. If $(Y_{ij}(0),Y_{ij}(1))\perp\!\!\!\perp\mathbf{X}_{ij}|\Psi(\mathbf{X}_{ij})$ is further assumed, then prognostic score stratification is valid for estimating overall exposure effects. For instance, if $m(Z_{ij},\mathbf{X}_{ij};\bm{\gamma})=\exp(\gamma_{z}Z_{ij}+\bm{\gamma}_{% \mathbf{X}}^{T}\mathbf{X}_{ij})/\{1+\exp(\gamma_{z}Z+\bm{\gamma}_{\mathbf{X}}^% {T}\mathbf{X}_{ij})\}$ is assumed, $\Psi(\mathbf{X}_{ij})=\bm{\gamma}_{\mathbf{X}}^{T}\mathbf{X}_{ij}$ is the prognostic score.

Like propensity score stratification, stratification on the prognostic score leads to a desirable and balanced structure. Since we do not know $\Psi(\mathbf{X}_{ij})$ a priori, it has to be estimated from the data. As mentioned before, if $\eta_{0}(\mathbf{x})=\eta_{1}(\mathbf{x})=\eta$ , then the probabilities of recall bias occurrence between the $Y=1$ and $Y=0$ groups are the same. In this case, the prognostic score can be used in stratification while estimating the treatment effect. Since the exposure was under-reported, we know $Z_{ij}^{*}=1$ always implies $Z_{ij}=1$ . We first estimate $\gamma_{\mathbf{X}}$ by using the data of the $Z_{i}^{*}=1$ group. Assuming that the recall bias occurs randomly, we then calculate the prognostic scores $\Psi(\mathbf{X}_{ij})=\bm{\gamma}_{\mathbf{X}}^{T}\mathbf{X}_{ij}$ for all individuals. The outcome models should be correctly specified for prognostic score stratification. Even though $\widehat{\tau}^{Prog}$ needs fewer modeling assumptions than $\widehat{\tau}^{ML}$ , modeling assumption is still required. Moreover, score-based stratifications need a further assumption that $\eta_{0}(\mathbf{x})=\eta_{1}(\mathbf{x})=\eta$ to be justified.

4.2.3. Blocking

In Sections 4.2.1 and 4.2.2, proper scores based on modeling assumptions are required to create valid strata. Also, score-based stratifications could be problematic if $\eta_{0}(\mathbf{x})$ and $\eta_{1}(\mathbf{x})$ significantly differ. Stratification based on propensity score also requires accurate treatment model identification, and the outcome model must be correctly specified to create strata with a prognostic score. However, the blocking method does not require any model assumption. Our goal is to make covariates $\mathbf{X}_{ij}$ in block $i$ to be similar. If the covariates in each block are almost the same, then we assume that $(Y_{ij}(0),Y_{ij}(1))\perp Z_{ij}$ in each block $i$ holds. Karmakar et al., (2021) used the blockingChallenge package in R to build blocks.

Suppose that there are $N=Ik$ individuals. To make $I$ blocks with size $k$ , $I$ individuals are first randomly chosen as template individuals for each block. The remaining $I(k-1)$ individuals are then matched to template individuals using optimal matching at a ratio of $(k-1):1$ . After the first blocking, separate an individual who is the most distant from the remaining $k-1$ individuals in each block. Setting these $I$ individuals as template individuals for each block, optimal matching is used again to build $I$ blocks. Repeating this process facilitates the implementation of an effective minimum within-block distance stratification. Repeat this process until no changes occur to obtain $I$ blocks, which are strata with size $k$ .

The blocking method does not require any model assumption. However, the covariates $\mathbf{X}_{ij}$ in each block $i$ need to be similar. When achieving covariate balance is difficult or a weak overlap situation emerges, such blocks are not obtained. If the covariate balance within the block can be easily achieved, the blocking method is likely to provide a reliable estimator. Different from $\widehat{\tau}^{ML}$ , $\widehat{\tau}^{Prop}$ , and $\widehat{\tau}^{Prog}$ , an advantage of $\widehat{\tau}^{B}$ is that any modeling assumptions is unnecessary. This stratification technique is still robust under model misspecification. We will examine the performances of these estimators in Section 5.

5. Simulation Studies

We conduct simulation studies to compare the performance of the proposed methods: (1) ML, (2) propensity score stratification, (3) prognostic score stratification, and (4) blocking. We consider various model specification scenarios to examine how they can successfully recover the true treatment effect under different model misspecification cases. In addition, we include Naïve estimators based on inverse probability weighting (IPW) and outcome regression (OR), assuming no misclassification error.

We consider four independent covariates, $\mathbf{X}_{i}=(X_{i1},X_{i2},X_{i3},X_{i4})$ . $X_{i1}$ and $X_{i2}$ are binary covariates, whereas $X_{i3}$ and $X_{i4}$ are continuous covariates. We also consider four simulation scenarios where the exposure and outcome models are correctly specified or misspecified: (i) (cor, cor), (ii) (cor, mis), (iii) (mis, cor), and (iv) (mis, mis). For example, (mis, cor) means the exposure model is misspecified, but the outcome model is correctly specified. We randomly generate exposure $Z_{i}$ and potential outcomes $(Y_{i}(0),Y_{i}(1))$ of each individual depending on the model specification scenario. However, due to recall bias, we cannot observe the true exposure $Z_{i}$ ; we observe the biased exposure $Z_{i}^{\ast}$ instead. We assume that the exposure is under-reported for this simulation study. We generate $Z_{i}^{\ast}$ based on the observed outcome $Y_{i}=Y_{i}(1)Z_{i}+Y_{i}(0)(1-Z_{i})$ (See Web Appendix for the detailed simulation settings).

We compare the considered methods considering their successful recovery of the true ATE under different model misspecification scenarios. In addition to this factor of model misspecification, we also consider two sample sizes ( $N=1000$ or $2000$ ) and two constant recall bias parameter functions ( $(\eta_{0}(\mathbf{x}),\eta_{1}(\mathbf{x}))=(0.1,0.1)$ or $(0.1,0.2)$ ) throughout this simulation. We fix the strata size to 50 in stratification methods with the nearest-neighbor combination method. Table 2 shows the simulation results that are obtained from 1000 simulated datasets.

Naïve estimators assuming no misclassification error exhibit poor performance across various model misspecification scenarios, particularly in cases with a differential recall bias. This highlights the necessity of adjusting bias when we overlook the potential for exposure misclassification. Among the estimators we proposed, if both the exposure and potential outcome models are correctly specified, then $\widehat{\tau}^{ML}$ is the best estimator. $\widehat{\tau}^{Prop}$ and $\widehat{\tau}^{Prog}$ show similar performance in each scenario. Particularly, even in the treatment model misspecification scenario, stratification based on propensity score shows slightly better results than stratification based on prognostic score. Score-based stratifications perform agreeably, although $\eta_{0}$ and $\eta_{1}$ are different. $\widehat{\tau}^{B}$ provides the least biased estimate in the case of misspecification for both models. On the contrary, $\widehat{\tau}^{ML}$ shows the worst performance in (mis, mis) scenario. As expected, the model dependency for the blocking method is the smallest, and that for the ML method is the largest. This finding leads to a good result of the blocking estimator and a poor result of the ML estimator in the worst model misspecification scenario. Even though we require weak assumptions, the blocking estimator performs well throughout every model misspecification scenario. If the models are misspecified, $\widehat{\tau}^{ML}$ , $\widehat{\tau}^{Prop}$ , and $\widehat{\tau}^{Prog}$ are no longer consistent estimates of $\tau$ .

Table 2. Performance of the estimation methods for recovering the average treatment effect. Six methods are compared, (1) Naïve inverse probability weighting (IPW), (2) Naïve outcome regression (OR), (3) maximum likelihood, (4) stratification based on propensity scores, (5) stratification based on prognostic scores, and (6) blocking. Absolute bias and root mean square error (RMSE) are reported, with all values multiplied by 100.

			Method
$(\eta_{0},\eta_{1})$	Scenario	$N$	Naïve IPW	Naïve OR	ML	Prop	Prog	Block
(0.1, 0.1)	(cor, cor)	1000	0.553 (4.802)	0.584 (4.629)	0.040 (3.106)	0.065 (3.338)	0.399 (3.381)	0.993 (3.391)
		2000	0.579 (3.449)	0.568 (3.313)	0.004 (2.117)	0.016 (2.233)	0.268 (2.291)	0.570 (2.349)
	(cor, mis)	1000	0.310 (4.842)	0.228 (4.532)	0.238 (3.097)	0.056 (3.289)	0.143 (3.246)	0.044 (3.226)
		2000	0.529 (3.560)	0.266 (3.336)	0.201 (2.174)	0.008 (2.284)	0.157 (2.280)	0.049 (2.289)
	(mis, cor)	1000	0.352 (4.044)	0.383 (4.025)	0.031 (2.733)	0.063 (2.879)	0.075 (2.834)	0.150 (2.913)
		2000	0.610 (2.900)	0.636 (2.892)	0.015 (1.926)	0.025 (1.954)	0.046 (1.977)	0.106 (2.027)
	(mis, mis)	1000	2.050 (4.557)	2.014 (4.531)	2.952 (4.023)	2.540 (3.837)	2.658 (3.896)	1.860 (3.432)
		2000	2.273 (3.649)	2.235 (3.613)	3.106 (3.659)	2.594 (3.318)	2.782 (3.444)	1.384 (2.457)
(0.1, 0.2)	(cor, cor)	1000	4.843 (7.066)	4.910 (6.922)	0.061 (3.274)	0.050 (3.567)	0.321 (3.517)	0.877 (3.565)
		2000	4.523 (5.717)	4.654 (5.754)	0.051 (2.245)	0.012 (2.334)	0.207 (2.404)	0.371 (2.431)
	(cor, mis)	1000	4.608 (6.993)	4.659 (6.733)	0.271 (3.262)	0.337 (3.541)	0.465 (3.484)	0.281 (3.397)
		2000	4.881 (6.076)	4.765 (5.884)	0.026 (2.302)	0.004 (2.390)	0.233 (2.381)	0.108 (2.445)
	(mis, cor)	1000	5.155 (6.475)	5.148 (6.447)	0.062 (2.881)	0.137 (2.998)	0.180 (2.978)	0.062 (3.073)
		2000	4.922 (5.817)	4.960 (5.846)	0.086 (2.081)	0.027 (2.139)	0.021 (2.099)	0.034 (2.208)
	(mis, mis)	1000	2.334 (4.869)	2.397 (4.891)	3.370 (4.443)	2.916 (4.198)	3.006 (4.296)	2.164 (3.743)
		2000	2.255 (3.724)	2.288 (3.733)	3.129 (3.725)	2.663 (3.407)	2.778 (3.490)	1.403 (2.557)

6. Data Example: Child Abuse and Adult Anger

In this section, we apply the causal inference framework to the motivating example of our research, which examines the causal relationship between childhood abuse and adult anger. We consider a retrospective cohort study to examine the question, “Does child abuse by either parent increase a likelihood toward to adult anger?”. This study focuses on the publicly available 1993-1994 sibling survey of the Wisconsin Longitudinal Study (WLS). The treatment is defined as the presence or absence of childhood abuse by either the father or mother, and the outcome is determined by a binary indicator of whether either parent exhibits a high anger score. See Springer et al., (2007); Small et al., (2013) for additional details regarding the WLS data.

Springer et al., (2007) indicated that the results might be affected by a tendency to under-reporting of abuse. Adults are likely not to report their childhood abuse even though there is any. With this information, we applied (1) ML, (2) propensity score stratification, (3) prognostic score stratification, and (4) blocking for the estimation of the ATE. The logistic outcome regression with the seven covariates without interaction terms is considered for the ML method. The same exposure model is used for propensity score stratification, whereas prognostic score stratification is based on the same outcome model. Ten strata are constructed by using the quantile values of the estimated score. A block size of 20 is used for the blocking method to build blocks. Fergusson et al., (2000) asserted that a severe amount of false negative responses (approximately 50%) exist when reporting childhood abuse, whereas false positive responses are absent. Based on this study, we can consider $0\leq\eta_{0}(\mathbf{x}),\eta_{1}(\mathbf{x})\leq\delta$ by letting $\delta\leq 0.5$ and compute the bounds according to $\delta$ as $\delta$ increases to 0.5.

Refer to caption — Figure 1. (a) Bounds of the average treatment effect (ATE) with $0\leq\eta_{0},\eta_{1}\leq\delta$ , (b) bounds of the ATE with $0\leq\eta_{0}\leq\eta_{1}\leq\delta$ , (c) point estimates of the ATE across the line of $\eta_{0}=\eta_{1}$ , and (d) point estimates of the ATE from maximum likelihood (ML) and blocking methods with 95% bootstrapped confidence intervals.

As shown in Figure 1(a), the bounds become wider as $\delta$ increases. All four bounds are above zero until $\delta=0.22$ . It is shown that they have a similar pattern, but the blocking method is the least sensitive to $\delta$ . Since the prognostic score stratification is similar to the blocking, this figure may indicate that the propensity score model is misspecified. Moreover, Deblinger and Runyon, (2005) stated that individuals who have high anger scores are more likely to experience recall bias when reporting childhood abuse experiences, that is, $\eta_{0}(\mathbf{x})\leq\eta_{1}(\mathbf{x})$ . This allows us to narrow down the lower bounds of the ATE from Proposition 2, as presented in Figure 1(b). Hence, we can conclude that childhood abuse has a causal effect on high anger scores in individuals, even in the presence of differential recall bias, based on this assumption without knowing $\eta_{0}(\mathbf{x}),\eta_{1}(\mathbf{x})$ .

Table 3. The effects of recall bias for six values of

\eta_{0}=\eta_{1}

. The estimates and 95% bootstrap confidence intervals are displayed for the maximum likelihood (ML) and stratification methods

	Method
$(\eta_{0},\eta_{1})$	ML	Prop	Prog	Block
(0.0,0.0)	0.067 ( $\pm$ 0.052)	0.066 ( $\pm$ 0.035)	0.088 ( $\pm$ 0.055)	0.090 ( $\pm$ 0.081)
(0.1,0.1)	0.068 ( $\pm$ 0.053)	0.068 ( $\pm$ 0.035)	0.089 ( $\pm$ 0.055)	0.091 ( $\pm$ 0.082)
(0.2,0.2)	0.070 ( $\pm$ 0.055)	0.069 ( $\pm$ 0.035)	0.091 ( $\pm$ 0.056)	0.093 ( $\pm$ 0.084)
(0.3,0.3)	0.073 ( $\pm$ 0.057)	0.072 ( $\pm$ 0.034)	0.093 ( $\pm$ 0.058)	0.096 ( $\pm$ 0.086)
(0.4,0.4)	0.076 ( $\pm$ 0.061)	0.075 ( $\pm$ 0.033)	0.097 ( $\pm$ 0.059)	0.100 ( $\pm$ 0.089)
(0.5,0.5)	0.081 ( $\pm$ 0.066)	0.082 ( $\pm$ 0.033)	0.103 ( $\pm$ 0.062)	0.106 ( $\pm$ 0.093)

We may further narrow down the bounds of the ATE if we make a stronger assumption. Fergusson et al., (2000) suggested that the probabilities of recall bias may not differ significantly based on an adult’s anger score. This suggests that recall bias may not be strongly related to an individual’s level of anger. This allows us to assume $\eta_{0}(\mathbf{x})=\eta_{1}(\mathbf{x})$ . Robins et al., (1985) pointed out that there is a minimal impact of reporters’ demographic characteristics, such as sex, age, and social class, on recall bias, which further allows us to assume $\eta_{0}(\mathbf{x})=\eta_{1}(\mathbf{x})=\eta$ . The setup of the range $0\leq\eta_{0}=\eta_{1}\leq 0.5$ requires the strongest assumption, but it helps us to look at the results succinctly. The estimates for various values of $\eta_{0}$ and $\eta_{1}$ are shown in Table 3. For this case, the variance estimation can be accompanied so that we can provide confidence intervals for various $\delta$ values. Figure 1(c) shows the estimates of the ATE across the line of $\eta_{0}=\eta_{1}$ . All the estimates increase as $\eta_{0}=\eta_{1}$ increases. Furthermore, the 95% CIs of all methods do not contain 0. In Figure 1(d), we particularly focus on the results of the ML and blocking estimators when $\eta_{0}=\eta_{1}$ . Even though the confidence interval of the blocking estimator is broader than that of the ML estimator, possibly due to the fact that it requires weak assumptions, the confidence interval still stays above 0. These results imply that the under-reporting issue does not alter the initial conclusion; on the contrary, it strengthens the conclusion that there is significant evidence that child abuse increases the adult anger score.

We also conduct a sensitivity analysis of recall bias with parameters $0\leq\eta_{0},\eta_{1}\leq 0.5$ . Figure 2 shows contour plots of the estimated ATEs for the values of $\eta_{0}$ and $\eta_{1}$ in this region. This figure reveals that most of the estimates are above 0. Especially for the blocking method, estimates are below 0 only for a small region of $\eta_{0}\geq\frac{1}{2}\eta_{1}+0.4$ for $0\leq\eta_{1}\leq 0.2$ .

7. Discussion

In this paper, we derived the ATE bounds and incorporated knowledge from prior research in order to narrow the bounds. Also, we proposed several approaches to estimate the bounds. Most of discussion were focused on the ATE, but the same argument can be applied to different measures such as average treatment effect on the treated (ATT), risk ratio, or odds ratio. In Web Appendix, we include more detailed discussions about these measures.

Also, another difficulty in applying the stratification methods proposed arises in computation. For instance, in some cases, both $a_{i}^{*}$ and $b_{i}^{*}$ are zero. To avoid any computational issues, we also propose a nearest-neighbor combination method to improve the stability of the estimates. This method is discussed in Web Appendix.

Finally, one limitation is that we cannot assess the covariate balance before adjustment since the exposure variable is misclassified. We can consider an indirect approach to check the covariate balance under the constant recall bias assumption. If stratification is successful, we may assume that the covariate distributions between the treated and control groups are equal, at least asymptotically. Also, if the magnitude of recall bias is independent of the covariates, we are able to assess the balance with a biased treatment $Z^{*}$ . First, we calculate the number of treated and control units with bias correction such as $a_{i}^{*}/(1-\eta_{1})+b_{i}^{*}/(1-\eta_{0})$ and $c_{i}^{*}+d_{i}^{*}-a_{i}^{*}(\eta_{1}/(1-\eta_{1}))-b_{i}^{*}(\eta_{0}/(1-% \eta_{0}))$ respectively. Second, under the assumption that the covariate vectors within each stratum are similar, we can compute the average covariate vectors for each stratum. Every individual in the same stratum shares the same average vector, which is not restrictive since we are going to compare the weighted means after all. Finally, we compare the absolute standardized mean difference between the treated and control groups using corrected weights based on this assumption. This new technique is also discussed in Web Appendix.

Acknowledgments

This work was supported by NIH grants (R01ES026217, R01MD012769, R01ES028033, 1R01ES030616, 1R01AG066793-01R01, 1R01ES029950, R01ES028033-S1), Alfred P. Sloan Foundation (G-2020-13946), Vice Provost for Research at Harvard University (Climate Change Solutions Fund), the New Faculty Startup Fund from Seoul National University, the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (2021R1C1C1012750), and the Global-LAMP Program of the National Research Foundation of Korea (NRF) grant funded by the Ministry of Education (No. RS-2023-00301976).

Supplementary Materials

Supplementary materials containing the Web Appendices are available online. The R codes are available at https://github.com/suhwanbong121/recall_bias_observational_study.

References

Armstrong, (1985) Armstrong, B. (1985). Measurement error in the generalised linear model. Communications in Statistics-Simulation and Computation, 14(3):529–544.
Babanezhad et al., (2010) Babanezhad, M., Vansteelandt, S., and Goetghebeur, E. (2010). Comparison of causal effect estimators under exposure misclassification. Journal of Statistical Planning and Inference, 140(5):1306–1319.
Braun et al., (2017) Braun, D., Gorfine, M., Parmigiani, G., Arvold, N. D., Dominici, F., and Zigler, C. (2017). Propensity scores with misclassified treatment assignment: a likelihood-based adjustment. Biostatistics, 18(4):695–710.
Bross, (1954) Bross, I. (1954). Misclassification in 2 x 2 tables. Biometrics, 10(4):478–486.
Carroll et al., (1985) Carroll, R. J., Gallo, P., and Gleser, L. J. (1985). Comparison of least squares and errors-in-variables regression, with special reference to randomized analysis of covariance. Journal of the American Statistical Association, 80(392):929–932.
Carroll et al., (1995) Carroll, R. J., Ruppert, D., and Stefanski, L. A. (1995). Measurement error in nonlinear models. New York: Chapman and Hall.
Carroll and Stefanski, (1990) Carroll, R. J. and Stefanski, L. A. (1990). Approximate quasi-likelihood estimation in models with surrogate predictors. Journal of the American Statistical Association, 85(411):652–663.
Cochran, (1968) Cochran, W. G. (1968). Errors of measurement in statistics. Technometrics, 10(4):637–666.
Deblinger and Runyon, (2005) Deblinger, E. and Runyon, M. K. (2005). Understanding and treating feelings of shame in children who have experienced maltreatment. Child Maltreatment, 10(4):364–376.
Fergusson et al., (2000) Fergusson, D. M., Horwood, L. J., and Woodward, L. J. (2000). The stability of child abuse reports: a longitudinal study of the reporting behaviour of young adults. Psychological Medicine, 30(3):529–544.
Fuller, (1980) Fuller, W. A. (1980). Properties of some estimators for the errors-in-variables model. The Annals of Statistics, 8(2):407–422.
Gleser, (1990) Gleser, L. J. (1990). Improvements of the naive approach to estimation in nonlinear errors-in-variables regression models. Contemporary Mathematics, 112:99–114.
Gravel and Platt, (2018) Gravel, C. A. and Platt, R. W. (2018). Weighted estimation for confounded binary outcomes subject to misclassification. Statistics in Medicine, 37(3):425–436.
Hansen, (2008) Hansen, B. B. (2008). The prognostic analogue of the propensity score. Biometrika, 95(2):481–488.
Imai and Yamamoto, (2010) Imai, K. and Yamamoto, T. (2010). Causal inference with differential measurement error: Nonparametric identification and sensitivity analysis. American Journal of Political Science, 54(2):543–560.
Karmakar et al., (2021) Karmakar, B., Small, D. S., and Rosenbaum, P. R. (2021). Reinforced designs: Multiple instruments plus control groups as evidence factors in an observational study of the effectiveness of catholic schools. Journal of the American Statistical Association, 116(533):82–92.
Lindley, (1953) Lindley, D. (1953). Estimation of a functional relationship. Biometrika, 40(1/2):47–49.
Lockwood and McCaffrey, (2016) Lockwood, J. R. and McCaffrey, D. F. (2016). Matching and weighting with functions of error-prone covariates for causal inference. Journal of the American Statistical Association, 111(516):1831–1839.
Lord, (1960) Lord, F. M. (1960). Large-sample covariance analysis when the control variable is fallible. Journal of the American Statistical Association, 55(290):307–321.
McCaffrey et al., (2013) McCaffrey, D. F., Lockwood, J. R., and Setodji, C. M. (2013). Inverse probability weighting with error-prone covariates. Biometrika, 100(3):671–680.
Raphael, (1987) Raphael, K. (1987). Recall bias: a proposal for assessment and control. International Journal of Epidemiology, 16(2):167–170.
Robins et al., (1985) Robins, L. N., Schoenberg, S. P., Holmes, S. J., Ratcliff, K. S., Benham, A., and Works, J. (1985). Early home environment and retrospective recall: A test for concordance between siblings with and without psychiatric disorders. American Journal of Orthopsychiatry, 55(1):27–41.
Rosenbaum and Rubin, (1983) Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55.
Rosner et al., (1990) Rosner, B., Spiegelman, D., and Willett, W. C. (1990). Correction of logistic regression relative risk estimates and confidence intervals for measurement error: the case of multiple covariates measured with error. American Journal of Epidemiology, 132(4):734–745.
Rosner et al., (1989) Rosner, B., Willett, W., and Spiegelman, D. (1989). Correction of logistic regression relative risk estimates and confidence intervals for systematic within-person measurement error. Statistics in Medicine, 8(9):1051–1069.
Rothman, (2012) Rothman, K. J. (2012). Epidemiology: an introduction. New York: Oxford University Press.
Rothman et al., (2008) Rothman, K. J., Greenland, S., Lash, T. L., et al. (2008). Modern epidemiology, volume 3. Wolters Kluwer Health/Lippincott Williams & Wilkins Philadelphia.
Rubin, (1974) Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5):688–701.
Rubin, (1980) Rubin, D. B. (1980). Randomization analysis of experimental data: The fisher randomization test comment. Journal of the American Statistical Association, 75(371):591–593.
Small et al., (2013) Small, D. S., Cheng, J., Halloran, M. E., and Rosenbaum, P. R. (2013). Case definition and design sensitivity. Journal of the American Statistical Association, 108(504):1457–1468.
Springer et al., (2007) Springer, K. W., Sheridan, J., Kuo, D., and Carnes, M. (2007). Long-term physical and mental health consequences of childhood physical abuse: Results from a large population-based sample of men and women. Child Abuse & Neglect, 31(5):517–530.
Stefanski and Carroll, (1985) Stefanski, L. A. and Carroll, R. J. (1985). Covariate measurement error in logistic regression. The Annals of Statistics, 14(3):1335–1351.

	$\displaystyle\frac{p_{11}^{}(\mathbf{x})}{p_{11}^{}(\mathbf{x})+\frac{1}{1-% \delta}p_{01}^{*}(\mathbf{x})}$	$\displaystyle\leq p_{1\|1}(\mathbf{x})\leq\frac{p_{11}^{}(\mathbf{x})}{p_{11}^% {}(\mathbf{x})+(1-\delta)p_{01}^{*}(\mathbf{x})},$
	$\displaystyle\frac{p_{10}^{}(\mathbf{x})-\frac{\delta}{1-\delta}p_{11}^{}(% \mathbf{x})}{p_{10}^{}(\mathbf{x})+p_{00}^{}(\mathbf{x})-\frac{\delta}{1-% \delta}p_{11}^{*}(\mathbf{x})}$	$\displaystyle\leq p_{1\|0}(\mathbf{x})\leq\frac{p_{10}^{}(\mathbf{x})}{p_{10}^% {}(\mathbf{x})+p_{00}^{}(\mathbf{x})-\frac{\delta}{1-\delta}p_{01}^{}(% \mathbf{x})}.$