Derivation of outcome-dependent dietary patterns for low-income women obtained from survey data using a Supervised Weighted Overfitted Latent Class Analysis

Stephanie M. Wu ^1,∗, Matthew R. Williams ^2,∗∗, Terrance D. Savitsky ^3,∗∗∗,
Briana J.K. Stephenson ^{1,∗∗∗∗}

¹Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, U.S.A
²RTI International, Research Triangle Park, North Carolina, U.S.A
³Office of Survey Methods Research, U.S. Bureau of Labor Statistics, Washington, DC, U.S.A
*email: [email protected]
*email: [email protected]
*email: [email protected]
*email: [email protected]

Abstract

Poor diet quality is a key modifiable risk factor for hypertension and disproportionately impacts low-income women. Analyzing diet-driven hypertensive outcomes in this demographic is challenging due to the complexity of dietary data and selection bias when the data come from surveys, a main data source for understanding diet-disease relationships in understudied populations. Supervised Bayesian model-based clustering methods summarize dietary data into latent patterns that holistically capture relationships among foods and a known health outcome but do not sufficiently account for complex survey design. This leads to biased estimation and inference and lack of generalizability of the patterns. To address this, we propose a supervised weighted overfitted latent class analysis (SWOLCA) based on a Bayesian pseudo-likelihood approach that integrates sampling weights into an exposure-outcome model for discrete data. Our model adjusts for stratification, clustering, and informative sampling, and handles modifying effects via interaction terms within a Markov chain Monte Carlo Gibbs sampling algorithm. Simulation studies confirm that the SWOLCA model exhibits good performance in terms of bias, precision, and coverage. Using data from the National Health and Nutrition Examination Survey (2015-2018), we demonstrate the utility of our model by characterizing dietary patterns associated with hypertensive outcomes among low-income women in the United States.

Keywords— Bayesian clustering; Dietary pattern analysis; Latent class analysis; NHANES; Survey design

1 Introduction

Low-income women are understudied in cardiometabolic research despite being disproportionately burdened by poor diet quality and its negative health impacts (Zhang et al., 2018). Hypertension, a pervasive and major risk factor for cardiovascular disease, illustrates this gap (Whelton et al., 2018). While considerable research explores diet-hypertension links, few cohort studies focus on low-income women. Therefore, analyzing diet-hypertension patterns in this key demographic requires greater reliance on data from surveys, which allow targeted inclusion of hard-to-reach communities through techniques such as oversampling and stratification. When using survey data, analyses must properly account for complex survey design elements to avoid biased estimation and variance underestimation and to generalize beyond the sample (Pfeffermann, 1996; Parker et al., 2022; Williams and Savitsky, 2021).

Dietary scores, such as the Dietary Approaches to Stop Hypertension (DASH) score (Sacks et al., 2001), have been used to evaluate intake of key food groups according to prescriptive guidance. These metrics are standardized across populations but can lack flexibility in grou** foods in ways more reflective of population-specific dietary behavior. Alternatively, latent class analysis (LCA) is a clustering method that achieves this level of flexibility through data-driven derivation of underlying dietary consumption patterns (Lazarsfeld and Henry, 1968; Sotres-Alvarez et al., 2010). This enables additional insight into the behaviors of targeted populations and the creation of policy tailored to their dietary needs. Exploration of diet-disease relationships using LCA typically entails a two- or three-step approach. First, LCA is used to identify patterns; then, an association is measured via regression analysis using the LCA-derived pattern as a covariate, with possible bias adjustments to account for measurement error (Fung et al., 2001; Bray et al., 2015). This is useful when testing a single exposure across many outcomes; however, when interest lies in obtaining a targeted understanding of how dietary patterns influence a specific health outcome, such as hypertension, a one-step supervised approach offers advantages in identifying outcome-informed patterns and smaller diet-outcome effects while correctly propagating classification uncertainty (Stephenson et al., 2022; Molitor et al., 2010; Elliott et al., 2020).

Extensions of LCA-based approaches to account for survey design have been met with challenges. Under a frequentist setting, high-dimensionality and sparseness of diet data lead to issues with parameter stability and matrix inversion (Asparouhov, 2005; Patterson et al., 2002). Under a Bayesian setting, models lack proper variance estimation (Stephenson and Willett, 2023; Stephenson et al., 2024), inhibit classification (Gunawan et al., 2020), or ignore the survey design entirely. Without proper incorporation of design features such as informative sampling and clustering, dietary patterns can be misidentified, and characterization of the diet-hypertension relationship can be biased with incorrect posterior intervals.

This paper aims to improve analysis of diet-hypertension patterns in low-income women using survey data. We propose a supervised weighted overfitted latent class analysis (SWOLCA) that uses a Bayesian pseudo-likelihood approach to account for complex survey design and produce accurate estimation and uncertainty quantification for a multivariate categorical exposure and a binary outcome. We also introduce a mixture reference coding scheme to allow interactions between dietary patterns and other covariates. Our model enables us to: 1) uncover the prevalence and profile of dietary patterns dependent on hypertensive status amongst low-income adult women; 2) efficiently measure the association between diet and hypertension while accounting for interactions with covariates; and 3) integrate survey sampling weights to produce accurate point and interval estimation for our target population.

The remaining sections of this paper are organized as follows. In Section 2, we describe our proposed SWOLCA model along with a brief background. In Section 3, we discuss implementation considerations for parameter estimation. In Section 4, we conduct a simulation study comparing SWOLCA with existing methods. In Section 5, we apply the model to data from the National Health and Nutrition Examination Survey (NHANES) to describe dietary pattern association with hypertension among low-income women in the US. Finally, in Section 6, we provide concluding remarks and discussion.

2 Model

2.1 Supervised Overfitted Latent Class Analysis

Supervised overfitted latent class analysis (SOLCA) is a Bayesian nonparametric mixture model that jointly estimates latent dietary patterns through an overfitted latent class model and their associations to a binary hypertension outcome through a probit regression model. In this way, the latent patterns are informed by both the multivariate categorical diet exposure, $\bm{x}_{i\cdot}=(x_{i1},\ldots,x_{iJ})^{T}$ for $J$ food items, and the binary hypertension outcome, $y_{i}$ , which can increase precision and reduce bias. Each sampled individual $i\in\{1,\ldots,n\}$ is assigned to a dietary pattern $c_{i}\in\{1,\ldots,K\}$ , where $K$ is the number of patterns. Model parameters include: the dietary pattern prevalences, $\bm{\pi}=(\pi_{1},\ldots,\pi_{K})^{T}$ , where $\sum_{k=1}^{K}\pi_{k}=1$ ; the food item consumption probabilities characterizing the pattern compositions, $(\theta_{jc_{i}1},\ldots,\theta_{jc_{i}R_{j}})$ , where $\sum_{r=1}^{R_{j}}\theta_{jc_{i}r}=1$ for food item $j$ with $R_{j}$ consumption levels; and the probit regression coefficients, $\bm{\xi}_{c_{i}\cdot}=(\xi_{c_{i}1},\ldots,\xi_{c_{i}q})^{T}$ , corresponding to $q$ regression covariates, $\bm{v_{i}}$ , given assignment to dietary pattern $c_{i}$ , which is an unknown latent random variable simultaneously determined by the model. We use the formulation of the probit regression model introduced by Albert and Chib (1993), where $z_{i}$ is a latent Gaussian variable such that $z_{i}\sim N(\bm{v}_{i}^{T}\bm{\xi}_{c_{i}\cdot},1)$ , and is truncated depending on binary outcome $y_{i}$ so that $z_{i}>0$ when $y_{i}=1$ and $z_{i}\leq 0$ otherwise. Using this formulation, the SOLCA complete data joint distribution of ( $\bm{x}_{i},c_{i},y_{i},z_{i}$ ) is:

$\displaystyle p(\bm{x}_{i\cdot},c_{i},y_{i},z_{i}$	$\displaystyle\|\bm{v}_{i},\bm{\pi},\bm{\theta},\bm{\xi})$
	$\displaystyle=p(c_{i}\|\bm{\pi})p(\bm{x}_{i}\|c_{i},\bm{\theta})p(y_{i},z_{i}\|c_% {i},\bm{v}_{i},\bm{\xi})$
	$\displaystyle=\pi_{c_{i}}\prod_{j=1}^{J}\prod_{r=1}^{R_{j}}\theta_{jc_{i}r}^{I% (x_{ij}=r)}\frac{e^{-\frac{1}{2}(z_{i}-\bm{v}_{i}^{\intercal}\bm{\xi}_{c_{i}% \cdot})^{2}}}{\sqrt{2\pi}}\Big{\{}y_{i}I(z_{i}>0)+(1-y_{i})I(z_{i}\leq 0)\Big{% \}},$	(1)

where $\bm{\xi}=(\bm{\xi}_{1\cdot}^{T},\ldots,\bm{\xi}_{K\cdot}^{T})^{T}$ is a $K\times q$ matrix of regression coefficients, $I(A)$ is the indicator function equal to 1 if $A$ is true and 0 otherwise, and $\bm{\theta}$ is a $J\times K\times R$ array with cells $\theta_{jkr}$ , where $R=\max_{j}R_{j}$ . SOLCA assumes items are independent conditional on dietary pattern assignment and individuals with the same pattern share behaviors for all food items.

The overfitted formulation of SOLCA enables a data-driven approach to select the number of patterns, $K$ , without need for post-hoc testing (Van Havre et al., 2015). $K$ is set to a conservatively high number to allow empty patterns to drop out via a sparsity-inducing Dirichlet prior: $(\pi_{1},\ldots,\pi_{K})\sim\text{Dir}(\alpha_{1},\ldots,\alpha_{K})$ , where hyperparameters $\alpha_{k}$ moderate the rate of growth for nonempty patterns and reduce the dependence of $K$ on the sample size and data structure. Smaller values of $\alpha_{k}$ yield a slower growth rate and more sparsity.

2.2 Supervised Weighted Overfitted Latent Class Analysis for Survey Data

Supervised weighted overfitted latent class analysis (SWOLCA) is an extension of SOLCA to the survey setting. To obtain unbiased estimation of our target population, survey sampling weights for all sampled individuals are necessary. Stratification and clustering information of the survey are also needed for accurate variance estimation. We follow a weighted pseudo-likelihood approach as described in Savitsky and Toth (2016) and Kunihama et al. (2016). Survey weights are used to up-weight individual likelihood contributions proportional to the number of individuals represented in the target population. This forms a weighted pseudo-likelihood that is used in place of the likelihood in the posterior update. Estimation and inference proceed using the posterior density of model parameters. Let $w_{i}$ denote the survey weight of individual $i$ , $i\in\{1,\ldots,n\}$ . We use a normalization constant $\kappa=\sum_{i=1}^{n}w_{i}/n$ so the weights sum to $n$ to reflect sampling variability. This provides a coarse adjustment for the posterior uncertainty that can be further refined in a post-processing step described below. Denote all parameters and complete data of the unweighted SOLCA model with $\bm{\Theta}$ and $\bm{D}$ , respectively. Then, the posterior density for the weighted SWOLCA approach is

\widetilde{p}(\bm{\Theta}|\bm{D})\propto p(\bm{\Theta})\prod_{i=1}^{n}p(\bm{D}% _{i}|\bm{\Theta})^{\frac{w_{i}}{\kappa}}.

(2)

Under certain regularity conditions, the posterior is consistent under unequal probability sampling (Savitsky and Toth, 2016) and complex multi-stage sampling (Williams and Savitsky, 2021, 2020). However, posterior credible intervals will exhibit undercoverage due to clustering and population generation uncertainty that are not accounted for (León-Novelo and Savitsky, 2019; Gunawan et al., 2020). To address this, we extend the post-processing adjustment proposed in Williams and Savitsky (2021) to accommodate a mixture model setting with constrained parameters. Posterior samples are rescaled to recover the correct “sandwich” form of the asymptotic variance based on pseudo-MLE theory. Let $\widehat{\bm{\Theta}}_{m}$ denote the posterior estimates for Markov chain Monte Carlo (MCMC) sample $m$ , with mean $\overline{\bm{\Theta}}$ across all samples. The rescaled estimates are

\widehat{\bm{\Theta}}_{m}^{a}=\Big{(}\widehat{\bm{\Theta}}_{m}-\overline{\bm{% \Theta}}\Big{)}\bm{R_{2}}^{-1}\bm{R_{1}}+\overline{\bm{\Theta}},

(3)

where $\bm{R_{1}}^{T}\bm{R_{1}}$ is the correct asymptotic “sandwich” covariance of the pseudo-MLE and $\bm{R_{2}}^{T}\bm{R_{2}}$ is the asymptotic covariance of the posterior. $\bm{R_{1}}$ can be obtained using a mix of resampling and computing the posterior Hessian matrix. For the post-processing adjustment, using the normalization constant $\kappa$ for the sampling weights is not strictly necessary for correct uncertainty coverage but can improve numerical stability when computing $\bm{R_{1}}$ . Some alternative variations of $\kappa$ may lead to smaller post-processing adjustments being needed, for example using an effective sample size based on variation of the weights (Spencer, 2000).

3 Parameter Estimation

3.1 MCMC Computation

For SWOLCA parameter estimation, we implement a MCMC Gibbs sampling algorithm. We follow Moran et al. (2021) and implement sampling in a two-stages: (1) an adaptive sampler estimates the appropriate number of dietary patterns, and (2) a fixed sampler generates model estimates based on the estimated number of patterns. Derivations of the full conditionals are provided in the Supplementary Materials. Proper mixing is encouraged via a random permutation sampler that is incorporated in the MCMC sampling algorithm (Frühwirth-Schnatter, 2001). Stan (Carpenter et al., 2017) is used in the calculation of the post-processing variance adjustment, as it offers automatic differentiation capabilities to compute the posterior gradient and Hessian. Due to issues with handling discrete latent variables in mixture model settings, Stan was not implemented for parameter sampling.

3.2 Mixture Reference Coding of Parameters

A common concern with mixture models is label switching (Stephens, 2000). Under a reference cell coding scheme, label switching turns the intercept and slope coefficients into noise due to switches in the reference pattern. Alternative coding schemes that have been used do not consider pattern-by-covariate interactions and lead to restrictions of the parameter space that are difficult to interpret under a probit link function (Molitor et al., 2010; Stephenson et al., 2022). We resolve this by introducing a combination of factor variable (Buis, 2012) and reference cell coding, hereafter referred to as “mixture reference coding.” In mixture reference coding, the different dietary patterns are expressed in factor variable form, while the levels of any additional covariates are expressed in reference cell form. For example, suppose $c_{i}\in\{1,2,3\}$ is individual $i$ ’s dietary pattern assignment and $v_{i}\in\{0,1\}$ is a binary covariate. Mixture reference coding for the probit regression model is given by:

$\displaystyle\mathbb{E}(y_{i}\|c_{i},v_{i})$	$\displaystyle=\Phi\Big{\{}\xi_{11}I(c_{i}=1)+\xi_{12}I(c_{i}=1)v_{i}$
	$\displaystyle\qquad+\xi_{21}I(c_{i}=2)+\xi_{22}I(c_{i}=2)v_{i}$
	$\displaystyle\qquad+\xi_{31}I(c_{i}=3)+\xi_{32}I(c_{i}=3)v_{i}\Big{\}}.$	(4)

Essentially, each dietary pattern has its own reference parameter and corresponding regression model. All interactions between dietary pattern and additional covariates are captured, and additional interactions between covariates can be specified if desired. This balances flexibility, by allowing for interactions, with parsimony, by not forcing inclusion of all interactions between variables. It also allows any dietary pattern to be set as the reference level post-hoc. Using mixture reference coding, label switching can be resolved by adapting a post-processing hierarchical clustering relabeling approach (Krebs, 1999; Medvedovic and Sivaganesan, 2002; Stephenson et al., 2022).

4 Simulation Study

4.1 Simulation Design

We conduct a simulation study to assess whether the proposed SWOLCA is able to produce valid estimation and inference of a target population sampled under complex survey designs. Our parameters of interest are the number of dietary patterns, $K$ , their estimated prevalences, $\bm{\pi}$ , the composition of each pattern, $\bm{\theta}$ , and the associations between the patterns and the observed outcome, $\bm{\xi}$ . We compare our method to two alternatives: 1) an unweighted SOLCA that ignores survey design, and 2) a two-step approach where the first step fits an unsupervised weighted overfitted latent class analysis (WOLCA) to derive the dietary patterns (Stephenson et al., 2024), and the second step treats the pattern assignments as fixed and includes them as covariates in a survey-weighted regression model using R survey package version 4.1.1 (Lumley, 2004). All models are implemented in R version 4.2.0 (R Core Team, 2023) with C++ interface using the Rcpp package version 1.0.10 (Eddelbuettel and François, 2011). We run a Gibbs sampler for 20,000 iterations with 10,000 burn-in and thinning every 5 iterations.

Data are generated for a finite population of size $N=80,000$ with a total of $K=3$ dietary patterns that are also associated with a binary outcome. Each pattern consists of $J=30$ categorical food items, consumed at one of $R=4$ levels. Survey features in the data include clustered outcomes and two unequal-sized strata in the population, with stratum membership influencing dietary pattern membership and the probability of the outcome. Full details of the data generation process are provided in the Supplementary Materials.

Model performance is evaluated for the sampling and data generating scenarios provided in Table 1. We examine three survey designs: simple random sampling (SRS); stratified sampling with unequal sampling probabilities; or stratified cluster sampling with unequal sampling probabilities and correlated outcomes. We focus on two associations of interest: a conditional outcome model with stratum included as a covariate; or a marginal outcome model that does not condition on selection or adjust for selection bias. And we compare three different sample sizes: 1% (n = 800), 5% (n = 4000), or 10% of the population (n = 8000). Bold text indicates deviation from the default setting (scenario 2) of stratified sampling with a conditional model and sample size 4000. Model robustness is also evaluated in cases where a) additional confounders are included, b) latent patterns are defined with weak identifiability, and c) weakly separated patterns are defined with a few differing exposure variables driving the true association to the outcome. Descriptions and results for these additional scenarios are not shown here but are detailed in the Supplementary Materials.

100 simulated datasets are generated for each scenario. Models are initialized with $K=30$ and Dirichlet hyperparameter $\alpha=1/K$ for all $k\in\{1\ldots,K\}$ to encourage sparsity and moderate growth of new pattern formation (Van Havre et al., 2015). A noninformative flat Dir(1) prior is used for $\bm{\theta}_{jc_{i}\cdot}$ , and weakly informative priors are used for the regression parameters $\bm{\xi}$ . To compare model performance for parameter estimation, we examine mean absolute bias (mean absolute distance between estimated and true parameter values), variability (full width of the 95% credible interval (CI), averaged over dietary patterns), and coverage (proportion of 95% CIs that cover the true population parameter values, averaged over dietary patterns).

4.2 Simulation Results

Table 1: Absolute bias, 95% credible interval width, and coverage for the unweighted SOLCA, two-step WOLCA, and proposed SWOLCA, based on posterior MCMC samples and averaged across 100 independent draws from the population. Strat = stratified sampling, Strat Cl = stratified cluster sampling, Cond = conditional model, Marg = marginal model. Notable issues of bias, imprecision, and undercoverage are underlined to improve readability.

		Absolute Bias				CI Width			Coverage
Scenario	Model	$\bm{K}$	$\bm{\pi}$	$\bm{\theta}$	$\bm{\xi}$	$\bm{\pi}$	$\bm{\theta}$	$\bm{\xi}$	$\bm{\pi}$	$\bm{\theta}$	$\bm{\xi}$
(1) SRS, Cond, n=4000	SOLCA	0.00	0.006	0.006	0.063	0.027	0.042	0.367	0.957	0.958	0.965
	WOLCA	0.00	0.006	0.006	0.063	0.036	0.044	0.762	0.950	0.958	0.992
	SWOLCA	0.00	0.006	0.006	0.063	0.027	0.042	0.419	0.947	0.953	0.983
(2) Strat, Cond, n=4000	SOLCA	0.00	0.081	0.006	0.047	0.069	0.045	0.374	0.190	0.962	0.972
	WOLCA	0.00	0.006	0.007	0.043	0.031	0.045	0.672	0.957	0.933	0.998
	SWOLCA	0.00	0.006	0.006	0.044	0.036	0.049	0.414	0.977	0.952	0.990
(3) Strat Cl, Cond, n=4000	SOLCA	0.00	0.082	0.006	0.132	0.074	0.046	0.390	0.223	0.966	0.592
	WOLCA	0.00	0.006	0.006	0.127	0.037	0.044	1.210	0.963	0.942	0.990
	SWOLCA	0.00	0.006	0.006	0.126	0.031	0.047	0.816	0.950	0.942	0.963
(4) Strat, Marg, n=4000	SOLCA	0.00	0.008	0.006	0.203	0.062	0.043	0.162	0.963	0.958	0.063
	WOLCA	0.00	0.016	0.007	0.031	0.107	0.049	0.348	0.947	0.939	0.993
	SWOLCA	0.00	0.011	0.007	0.033	0.097	0.063	0.278	0.967	0.965	0.987
(5) Strat, Cond, n=8000	SOLCA	0.00	0.080	0.005	0.049	0.076	0.042	0.367	0.227	0.972	0.980
	WOLCA	0.06	0.010	0.011	0.038	0.044	0.044	0.519	0.920	0.908	0.960
	SWOLCA	0.00	0.004	0.005	0.030	0.029	0.038	0.373	0.967	0.953	0.997
(6) Strat, Cond, n=800	SOLCA	0.00	0.084	0.013	0.098	0.064	0.088	0.701	0.027	0.938	0.945
	WOLCA	0.00	0.013	0.014	0.099	0.060	0.095	1.371	0.933	0.919	0.983
	SWOLCA	0.00	0.013	0.014	0.097	0.062	0.099	0.687	0.947	0.922	0.947

For all models and scenarios, investigation of traceplots and autocorrelation plots showed good mixing and convergence of all model parameters. Table 1 displays a summary of simulation results for the scenarios described. As expected, under the control SRS scenario, all three models exhibit good estimation and coverage properties. For other scenarios with a variety of complex survey design and data-generating features, the proposed SWOLCA outperforms the two alternative models and is able to obtain accurate and precise estimation, as well as approximately nominal coverage, for all parameters.

The unweighted SOLCA model gives highly biased estimates of the pattern membership probabilities $\bm{\pi}$ when there is stratified sampling. When there is cluster sampling, credible intervals for the regression coefficients $\bm{\xi}$ exhibit severe undercoverage, which can result in overconfident estimation of associational effects. When the selection is associated with hypertension and a marginal model is fit, SOLCA yields biased estimates for $\bm{\xi}$ .

The two-step WOLCA model produces estimates of regression parameters $\bm{\xi}$ that have wide credible intervals and are less precise than the SWOLCA model at similar coverage levels. This inefficiency is especially true for small sample sizes and cluster sampling designs because the two-step process ignores uncertainty in the first step. This inflates interval widths and makes inference on the true associational effects difficult. WOLCA is also the most prone to undercoverage of $\theta$ due to failure to account for variability in the plug-in survey weights in the first step, and it runs into issues with estimating the number of dietary patterns, $K$ .

SWOLCA yields estimates with minimal bias and approximately nominal interval coverage for all parameters for stratified and cluster sampling designs. It is also able to use the survey weights to account for bias from selection variables that are unavailable for analysis, enabling correct marginal estimation of $\xi$ and producing outcome probability estimates that accommodate informative designs without greatly inflating uncertainty (Web Figure 2). In the cluster sampling and 1% sample size scenarios, there is slight undercoverage of $\bm{\theta}$ . This is expected given the increased variability of the data and is also seen in the SOLCA and WOLCA comparison models. These conclusions were consistent in settings with weaker patterns (mode 55%), overlap** patterns where consumption of many foods is the same for two patterns, different sample sizes, and additional regression covariates.

5 Application to NHANES Low-Income Women

5.1 Data Description and Model Setup

The National Health and Nutrition Examination Survey (NHANES) is a cross-sectional, nationally-representative survey that assesses the health and nutritional status of the non-institutionalized civilian US population (National Center for Health Statistics, 2023). The survey employs a stratified, clustered, four-stage sampling design with oversampling to increase inclusion of various age, sex, income, and racial and ethnic groups. Data are publicly available alongside survey sampling weights that take into account unequal sampling probabilities, stratification, clustering, non-response, weight trimming, and calibration (Chen et al., 2020). Data are pooled from two survey cycles, 2015-2016 and 2017-2018, in accordance with protocols outlined in the NHANES analytic guidelines (National Center for Health Statistics, 2018). We focus on dietary patterns associated with hypertension among adult females aged 20 or over who are classified as low-income (reported household income at or below 185% of the federal poverty level, consistent with eligibility requirements for federal assistance program participation (Oliveira and Frazão, 2015)). Pregnant or breastfeeding women are excluded ( $n=179$ ), resulting in a total sample size of $n=2003$ .

Dietary exposure variables are defined as 28 food item groups collected from two 24-hour dietary recalls and summarized into food pattern equivalents from the Food and Nutrition Database for Dietary Studies (Dietary Guidelines Advisory Committee, 2015; Bowman et al., 2020). Each food item is categorized as none, low, medium, or high, based on relative tertiles of positive consumption (Sotres-Alvarez et al., 2013; Stephenson and Willett, 2023). The binary observed outcome, hypertension, is defined as a composite measure of having an elevated blood pressure (BP) reading (systolic BP $>130$ or diastolic BP $>80$ ), self-reported diagnosis, or use of hypertension-controlling medication. Age, race and ethnicity, current smoking status, and physical activity are included as potential confounders in our hypertensive outcome regression model. Web Table 3 displays summaries of these demographic characteristics by hypertension in the sample.

We compare the proposed SWOLCA model and the unweighted SOLCA model in assessing diet-driven hypertension using survey data. Both models are initialized with the same priors used in the simulation study. Full details of posterior computation are provided in the Supplementary Materials. Estimation is obtained by fitting a Gibbs sampler of 20,000 iterations with 10,000 burn-in and thinning every 5 iterations, then summarized using posterior median estimates and 95% credible intervals. Both adaptive and fixed sampler are run for 20,0000 iterations each using an Apple M1 Pro computer with 8 cores, with a computation time of roughly 75 minutes. If $\widehat{K}$ is set a priori and only the fixed sampler is run, the computation time is approximately 15 minutes, assuming the same computing power.

5.2 Dietary Pattern Results

Both SWOLCA and SOLCA identify $\widehat{K}=5$ diet-hypertension patterns among low-income women in the US, displayed in Figure 1 with characterization differences indicated by black dots. For all patterns, consumption behavior is colored by the modal (i.e., highest posterior probability) consumption level (none, low, medium, or high) for each of the 28 food items. We can see differences in modal food consumption for all patterns except Pattern 2 (Healthy American), illustrating the influence of survey weights on pattern composition.

We continue by focusing on the patterns identified by SWOLCA (Figure 1(a)). We characterize the five diet-hypertension patterns as follows: 1) Multicultural, 2) Healthy American, 3) Western, 4) Restricted Vegetarian, and 5) Restricted American. The Multicultural pattern favors high consumption of organ meat, starchy and dark green vegetables, oils, and fruits. It is referred to as the Multicultural pattern given its large prevalence among those identifying as NH Asian and groups other than NH White (Table 2). The Healthy American pattern favors a higher consumption of healthier foods such as fruits, vegetables, whole grains, organ meat, and nuts, but still includes high consumption of oils, fats, and sugars prevalent in many American diets. The Western pattern favors a high consumption of refined grains, poultry, cheese, oils, solid fats, added sugars, and other vegetables. Examining Figure 2, which provides detailed consumption level probabilities by pattern for each food item, we see that individuals assigned to the Western pattern are also the most likely to consume meat and cured meats, though this consumption is heterogeneous. The Restricted Vegetarian pattern favors no consumption of many foods including meat and seafood. There is moderate consumption of refined grains and low consumption of whole grains, poultry, eggs, milk, oils, solid fats, and added sugar. This population may face significant food access issues such as residence in or near a food desert or food swamp. Finally, the Restricted American pattern favors low consumption of many foods but to a lesser extent than the Restricted Vegetarian diet and with some intake of organ meats and poultry. This diet is similar to the Healthy American diet but with relatively lower consumption, especially for legumes and vegetables.

Examining the size and distribution of the diet-hypertension patterns across demographic variables (Table 2), we see that the Western diet is most prevalent (26.7% of the population) and the Multicultural diet is least prevalent (10.7%). Those who follow the Multicultural diet tend to be younger, NH Asian, not a current smoker, and physically active. Conversely, those who follow the Restricted American diet tend to be older, NH White, and a current smoker. Those who follow the Restricted Vegetarian diet also tend to be older, current smokers, and physically inactive.

Table 2: Size and demographic distribution of dietary patterns for low-income women in the US. Estimates for

N

use posterior samples of parameter

\pi

. Column-wise mean and percentage estimates of demographic variables are calculated using sampling weights.

Variable	Level	Multicultural	Healthy	Western	Restrict Veg	Restrict	Overall
$N$ : % (posterior SD %)		10.7 (4.2)	24.4 (3.8)	26.7 (3.9)	19.9 (4.3)	18.3 (5.4)
$n$ : %		11.5	23.6	25.8	20.7	18.4
Age Group: %	[20,40)	51.7	47.1	44.8	33.2	35.3	41.9
	[40,60)	28.6	29.1	33.4	37.1	33.1	32.6
	$\geq 60$	19.6	23.9	21.9	29.7	31.6	25.5
Race and Ethnicity: %	NH White	36.9	50.1	49.0	49.7	55.0	49.3
	NH Black	18.2	14.7	17.7	15.6	16.3	16.3
	NH Asian	20.2	2.2	3.4	7.1	3.4	5.6
	Hispanic/Latino	23.6	27.2	23.8	20.4	22.2	23.6
	Other/Mixed	1.2	5.8	6.1	7.2	3.1	5.2
Smoking Status: %	Non-Smoker	85.3	72.7	75.3	69.9	69.3	73.5
	Smoker	14.7	27.3	24.7	30.1	30.7	26.5
Physical Activity: %	Inactive	41.0	47.0	45.5	47.3	41.7	45.1
	Active	59.0	53.0	54.5	52.7	58.3	54.9

5.3 Dietary Patterns and Hypertension Risk

Table 3 displays the posterior $\bm{\xi}$ estimates for the main dietary pattern effects on hypertension for SWOLCA and SOLCA. The full tables of regression estimates are provided in the Supplementary Materials. Estimates for the unweighted SOLCA differ greatly from those of SWOLCA, indicating presence of selection variables that influence the outcome but were not included as covariates. This results in inaccurate contributions of individuals to the estimation of hypertension probabilities because sampling weights have not been considered in the regression estimation process. The unweighted model produces fewer interaction effects and reduced ability to distinguish effects between the patterns. Credible intervals are also much tighter when the survey design is not incorporated. This leads to incorrect inference, as illustrated by the undercoverage shown in the simulation study.

Table 3: Main effect probit regression parameter estimates for the proposed SWOLCA and the unweighted SOLCA, adjusting for demographic confounders. Reference group: Multicultural diet, age [20,40), NH White, non-smoker, inactive.

Covariate	Estimate	95% CI	P( $\xi>0$ )	Estimate	95% CI	P( $\xi>0$ )
	SWOLCA			SOLCA
(Intercept)	-1.66	(-2.92, -0.41)	$<$ 0.01	-1.01	(-1.61, -0.41)	$<$ 0.01
Multicultural	1 (reference)	-	-	-	-	-
Healthy Amer	0.45	(-1.12, 1.96)	0.70	0.05	(-0.68, 0.81)	0.56
Western	0.49	(-0.86, 1.83)	0.75	0.02	(-0.71, 0.70)	0.53
Restricted Veg	0.67	(-0.88, 2.17)	0.79	-0.11	(-0.89, 0.64)	0.39
Restricted Amer	1.01	(-0.21, 2.15)	0.96	0.14	(-0.65, 0.91)	0.63

Focusing on the SWOLCA results, all diets appear to be associated with increased probability of hypertension compared to the Multicultural diet, with the Restricted diets showing the strongest increase (Table 3). Figure 3 displays hypertension probabilities of the diet-hypertension patterns as well as interactions effects between the patterns and socio-demographic variables. The estimated hypertension probabilities for the patterns were: 4.8% (Multicultural), 11.1% (Healthy American), 12.2% (Western), 15.9% (Restricted Vegetarian), and 25.2% (Restricted American), among those aged 20 to 39, identifying as NH White, not currently smoking, and physically inactive. Age has a strong, positive association with elevated probability of hypertension, and the differentials among patterns is most pronounced in the 40 to 60 age group, with the Multicultural diet outcome probability remaining low (13%) in the 40 to 60 age group, whereas the Restricted American diet sees a large increase in the 40 to 60 age group (74%). Among different racial and ethnic groups, probability of hypertension is higher among those identifying as NH Black compared to NH White for all patterns, and the Restricted American pattern remains at high risk across all groups. The Healthy American pattern sees the largest increase in outcome probability among smokers, and the Restricted American pattern sees the largest decrease among those who are active.

6 Discussion

In this work, we develop the supervised weighted overfitted latent class analysis (SWOLCA), which is a Bayesian joint mixture model that can be used to: 1) elicit dietary patterns informed by both a set of categorical dietary intake exposures and a binary hypertension outcome, with a data-driven approach to determine the number of patterns; 2) capture small diet-hypertension effects and interactions with covariates via mixture reference coding; and 3) obtain unbiased estimation and inference for the population by adjusting for complex survey design features such as stratification, clustering, and informative sampling. Although our method is designed for categorical exposures and a binary outcome, extensions to other outcome data types can be made by replacing the probit likelihood with a regression likelihood that can accommodate different outcome data types (e.g., multinomial, ordinal, continuous). This remains an area of active research.

Simulation studies confirmed that SWOLCA improved accuracy, precision, and coverage of parameter estimation compared to models that did not include sampling weights or relied on the two-step approach. Incorporation of survey design features was important for accurate estimation of the pattern prevalence, pattern identification, exposure-outcome association, and variance. Incorporation of the outcome into clustering improved precision of regression estimates and led to better identification of patterns. Implementation of SWOLCA to NHANES 2015-2018 data identified five diet-hypertension patterns among low-income US women. Differences between the SWOLCA and SOLCA results illustrated the importance of accounting for survey design. Failure to include survey sampling weights changed the pattern compositions because individual contributions to the consumption level probabilities during pattern formation reflected the sample composition and were not necessarily representative of the population. Our model identified strong age effects and captured substantial heterogeneity among different racial and ethnic subgroups via interaction terms.

Our work suggests several areas for further improvement. Firstly, our model does not adjust for data reliability typically encountered in dietary data, such as measurement error, recall bias, and item non-response missingness. Secondly, diet consumption heterogeneity may be better captured by adapting methods that allow demographic or behavior driven deviations of foods from the overall diet-disease pattern, and additionally incorporating survey design elements. Thirdly, our model is based on cross-sectional data and is not able to evaluate the impact of exposure changes over time or a time to event analysis. Lastly, our model relies on a probit regression component that can be limited by computational stability. Computation may be improved by incorporating hierarchical priors or by exploring other distributions, such as a unified skew normal conjugate model (Anceschi et al., 2023). We leave these opportunities for extensions to future research.

Acknowledgements

The authors are grateful to Walter Willett for helpful comments on earlier versions of this work, and to the Co-Editor, Associate Editor, and two referees for insightful comments that greatly improved the paper. This research was supported by the National Institute of Allergy and Infectious Diseases (NIAID: T32 AI007358) and the National Heart, Lung, and Blood Institute (NHLBI: R25 HL105400 awarded to Victor G. Davila-Roman and DC Rao).

Supplementary Materials

Web Appendices, Tables, and Figures, and data and code referenced in Sections 2, 4, and 5 are available with this paper at the Biometrics website on Oxford Academic. Code for replicating the simulations and data analyses in this paper is also available on GitHub at https://github.com/smwu/SWOLCA and is currently being developed into an R package.

Data Availability Statement

Data used in this paper to illustrate our findings are publicly available at https://github.com/smwu/SWOLCA/ and were derived from the following resources available in the public domain: https://wwwn.cdc.gov/nchs/nhanes/.

References

Albert and Chib (1993) Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 88, 669–679.
Anceschi et al. (2023) Anceschi, N., Fasano, A., Durante, D., and Zanella, G. (2023). Bayesian conjugacy in probit, tobit, multinomial probit and extensions: A review and new results. Journal of the American Statistical Association 118, 1451–1469.
Asparouhov (2005) Asparouhov, T. (2005). Sampling weights in latent variable modeling. Structural Equation Modeling 12, 411–434.
Bowman et al. (2020) Bowman, S., Clemens, J., Friday, J., and Moshfegh, A. (2020). Food patterns equivalents database 2017–2018: methodology and user guide. Food Surveys Research Group: Beltsville, MD .
Bray et al. (2015) Bray, B. C., Lanza, S. T., and Tan, X. (2015). Eliminating bias in classify-analyze approaches for latent class analysis. Structural Equation Modeling 22, 1–11.
Buis (2012) Buis, M. L. (2012). Stata tip 106: With or without reference. The Stata Journal 12, 162–164.
Carpenter et al. (2017) Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., et al. (2017). Stan: A probabilistic programming language. Journal of Statistical Software 76, 1.
Chen et al. (2020) Chen, T. C., Clark, J., Riddles, M. K., Mohadjer, L. K., and Fakhouri, T. H. (2020). National health and nutrition examination survey, 2015-2018: Sample design and estimation procedures. Vital and Health Statistics. Series 2, Data Evaluation and Methods Research pages 1–35.
Dietary Guidelines Advisory Committee (2015) Dietary Guidelines Advisory Committee (2015). Dietary Guidelines for Americans 2015-2020. Government Printing Office.
Eddelbuettel and François (2011) Eddelbuettel, D. and François, R. (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software 40, 1–18.
Elliott et al. (2020) Elliott, M. R., Zhao, Z., Mukherjee, B., Kanaya, A., and Needham, B. L. (2020). Methods to account for uncertainty in latent class assignments when using latent classes as predictors in regression models, with application to acculturation strategy measures. Epidemiology 31, 194–204.
Frühwirth-Schnatter (2001) Frühwirth-Schnatter, S. (2001). Markov chain monte carlo estimation of classical and dynamic switching and mixture models. Journal of the American Statistical Association 96, 194–209.
Fung et al. (2001) Fung, T. T., Willett, W. C., Stampfer, M. J., Manson, J. E., and Hu, F. B. (2001). Dietary patterns and the risk of coronary heart disease in women. Archives of Internal Medicine 161, 1857–1862.
Gunawan et al. (2020) Gunawan, D., Panagiotelis, A., Griffiths, W., and Chotikapanich, D. (2020). Bayesian weighted inference from surveys. Australian & New Zealand Journal of Statistics 62, 71–94.
Krebs (1999) Krebs, C. J. (1999). Ecological Methodology. Benjamin/Cummings, 2nd edition.
Kunihama et al. (2016) Kunihama, T., Herring, A., Halpern, C., and Dunson, D. (2016). Nonparametric bayes modeling with sample survey weights. Statistics & Probability Letters 113, 41–48.
Lazarsfeld and Henry (1968) Lazarsfeld, P. F. and Henry, N. (1968). Latent Structure Analysis. Houghton, Mifflin.
León-Novelo and Savitsky (2019) León-Novelo, L. G. and Savitsky, T. D. (2019). Fully bayesian estimation under informative sampling. Electronic Journal of Statistics 13, 1608–1645.
Lumley (2004) Lumley, T. (2004). Analysis of complex survey samples. Journal of Statistical Software 9, 1–19. R package version 2.2.
Medvedovic and Sivaganesan (2002) Medvedovic, M. and Sivaganesan, S. (2002). Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics 18, 1194–1206.
Molitor et al. (2010) Molitor, J., Papathomas, M., Jerrett, M., and Richardson, S. (2010). Bayesian profile regression with an application to the national survey of children’s health. Biostatistics 11, 484–498.
Moran et al. (2021) Moran, K. R., Dunson, D., Wheeler, M. W., and Herring, A. H. (2021). Bayesian joint modeling of chemical structure and dose response curves. The Annals of Applied Statistics 15, 1405–1430.
National Center for Health Statistics (2018) National Center for Health Statistics (2018). National health and nutrition examination survey: Analytic guidelines, 2011–2014 and 2015–2016. Technical report, Centers for Disease Control and Prevention.
National Center for Health Statistics (2023) National Center for Health Statistics (2023). National health and nutrition examination survey home page.
Oliveira and Frazão (2015) Oliveira, V. and Frazão, E. (2015). The wic program: Background, trends, and economic issues, 2015 edition. economic information bulletin number 134. Technical report, US Department of Agriculture, Economic Research Service.
Parker et al. (2022) Parker, P. A., Holan, S. H., and Janicki, R. (2022). Computationally efficient bayesian unit-level models for non-gaussian data under informative sampling with application to estimation of health insurance coverage. The Annals of Applied Statistics 16, 887–904.
Patterson et al. (2002) Patterson, B. H., Dayton, C. M., and Graubard, B. I. (2002). Latent class analysis of complex sample survey data: application to dietary data. Journal of the American Statistical Association 97, 721–741.
Pfeffermann (1996) Pfeffermann, D. (1996). The use of sampling weights for survey data analysis. Statistical Methods in Medical Research 5, 239–261.
R Core Team (2023) R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
Sacks et al. (2001) Sacks, F. M., Svetkey, L. P., Vollmer, W. M., Appel, L. J., Bray, G. A., Harsha, D., et al. (2001). Effects on blood pressure of reduced dietary sodium and the dietary approaches to stop hypertension (dash) diet. New England journal of medicine 344, 3–10.
Savitsky and Toth (2016) Savitsky, T. D. and Toth, D. (2016). Bayesian estimation under informative sampling. Electronic Journal of Statistics 10, 1677–1708.
Sotres-Alvarez et al. (2010) Sotres-Alvarez, D., Herring, A. H., and Siega-Riz, A. M. (2010). Latent class analysis is useful to classify pregnant women into dietary patterns. The Journal of Nutrition 140, 2253–2259.
Sotres-Alvarez et al. (2013) Sotres-Alvarez, D., Siega-Riz, A. M., Herring, A. H., Carmichael, S. L., Feldkamp, M. L., Hobbs, C. A., et al. (2013). Maternal dietary patterns are associated with risk of neural tube and congenital heart defects. American Journal of Epidemiology page kws349.
Spencer (2000) Spencer, B. D. (2000). An approximate design effect for unequal weighting when measurements may correlate with selection probabilities. Survey Methodology 26, 137–138.
Stephens (2000) Stephens, M. (2000). Dealing with label switching in mixture models. Journal of the Royal Statistical Society Series B: Statistical Methodology 62, 795–809.
Stephenson et al. (2022) Stephenson, B. J., Herring, A. H., and Olshan, A. F. (2022). Derivation of maternal dietary patterns accounting for regional heterogeneity. Journal of the Royal Statistical Society Series C: Applied Statistics 71, 1957–1977.
Stephenson et al. (2024) Stephenson, B. J., Wu, S. M., and Dominici, F. (2024). Identifying dietary consumption patterns from survey data: a bayesian nonparametric latent class model. Journal of the Royal Statistical Society Series A: Statistics in Society 187, 496–512.
Stephenson and Willett (2023) Stephenson, B. J. K. and Willett, W. C. (2023). Racial and ethnic heterogeneity in diets of low-income adult females in the united states: results from national health and nutrition examination surveys from 2011 to 2018. The American Journal of Clinical Nutrition 117, 625–634.
Van Havre et al. (2015) Van Havre, Z., White, N., Rousseau, J., and Mengersen, K. (2015). Overfitting bayesian mixture models with an unknown number of components. PloS One 10,.
Whelton et al. (2018) Whelton, P. K., Carey, R. M., Aronow, W. S., Casey, D. E., Collins, K. J., Dennison Himmelfarb, C., et al. (2018). 2017 acc/aha/aapa/abc/acpm/ags/apha/ash/aspc/nma/pcna guideline for the prevention, detection, evaluation, and management of high blood pressure in adults: a report of the american college of cardiology/american heart association task force on clinical practice guidelines. Journal of the American College of Cardiology 71, e127–e248.
Williams and Savitsky (2020) Williams, M. R. and Savitsky, T. D. (2020). Bayesian estimation under informative sampling with unattenuated dependence. Bayesian Analysis 15, 57–77.
Williams and Savitsky (2021) Williams, M. R. and Savitsky, T. D. (2021). Uncertainty estimation for pseudo-bayesian inference under complex sampling. International Statistical Review 89, 72–107.
Zhang et al. (2018) Zhang, F. F., Liu, J., Rehm, C. D., Wilde, P., Mande, J. R., and Mozaffarian, D. (2018). Trends and disparities in diet quality among us adults by supplemental nutrition assistance program participation status. JAMA Network Open 1, e180237–e180237.