Derivation of outcome-dependent dietary patterns for low-income women obtained from survey data using a Supervised Weighted Overfitted Latent Class Analysis

Stephanie M. Wu 1,∗, Matthew R. Williams 2,∗∗, Terrance D. Savitsky 3,∗∗∗,
Briana J.K. Stephenson 1,∗∗∗∗

1Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, U.S.A
2RTI International, Research Triangle Park, North Carolina, U.S.A
3Office of Survey Methods Research, U.S. Bureau of Labor Statistics, Washington, DC, U.S.A
*email: [email protected]
*email: [email protected]
*email: [email protected]
*email: [email protected]
Abstract

Poor diet quality is a key modifiable risk factor for hypertension and disproportionately impacts low-income women. Analyzing diet-driven hypertensive outcomes in this demographic is challenging due to the complexity of dietary data and selection bias when the data come from surveys, a main data source for understanding diet-disease relationships in understudied populations. Supervised Bayesian model-based clustering methods summarize dietary data into latent patterns that holistically capture relationships among foods and a known health outcome but do not sufficiently account for complex survey design. This leads to biased estimation and inference and lack of generalizability of the patterns. To address this, we propose a supervised weighted overfitted latent class analysis (SWOLCA) based on a Bayesian pseudo-likelihood approach that integrates sampling weights into an exposure-outcome model for discrete data. Our model adjusts for stratification, clustering, and informative sampling, and handles modifying effects via interaction terms within a Markov chain Monte Carlo Gibbs sampling algorithm. Simulation studies confirm that the SWOLCA model exhibits good performance in terms of bias, precision, and coverage. Using data from the National Health and Nutrition Examination Survey (2015-2018), we demonstrate the utility of our model by characterizing dietary patterns associated with hypertensive outcomes among low-income women in the United States.

Keywords— Bayesian clustering; Dietary pattern analysis; Latent class analysis; NHANES; Survey design

1 Introduction

Low-income women are understudied in cardiometabolic research despite being disproportionately burdened by poor diet quality and its negative health impacts (Zhang et al., 2018). Hypertension, a pervasive and major risk factor for cardiovascular disease, illustrates this gap (Whelton et al., 2018). While considerable research explores diet-hypertension links, few cohort studies focus on low-income women. Therefore, analyzing diet-hypertension patterns in this key demographic requires greater reliance on data from surveys, which allow targeted inclusion of hard-to-reach communities through techniques such as oversampling and stratification. When using survey data, analyses must properly account for complex survey design elements to avoid biased estimation and variance underestimation and to generalize beyond the sample (Pfeffermann, 1996; Parker et al., 2022; Williams and Savitsky, 2021).

Dietary scores, such as the Dietary Approaches to Stop Hypertension (DASH) score (Sacks et al., 2001), have been used to evaluate intake of key food groups according to prescriptive guidance. These metrics are standardized across populations but can lack flexibility in grou** foods in ways more reflective of population-specific dietary behavior. Alternatively, latent class analysis (LCA) is a clustering method that achieves this level of flexibility through data-driven derivation of underlying dietary consumption patterns (Lazarsfeld and Henry, 1968; Sotres-Alvarez et al., 2010). This enables additional insight into the behaviors of targeted populations and the creation of policy tailored to their dietary needs. Exploration of diet-disease relationships using LCA typically entails a two- or three-step approach. First, LCA is used to identify patterns; then, an association is measured via regression analysis using the LCA-derived pattern as a covariate, with possible bias adjustments to account for measurement error (Fung et al., 2001; Bray et al., 2015). This is useful when testing a single exposure across many outcomes; however, when interest lies in obtaining a targeted understanding of how dietary patterns influence a specific health outcome, such as hypertension, a one-step supervised approach offers advantages in identifying outcome-informed patterns and smaller diet-outcome effects while correctly propagating classification uncertainty (Stephenson et al., 2022; Molitor et al., 2010; Elliott et al., 2020).

Extensions of LCA-based approaches to account for survey design have been met with challenges. Under a frequentist setting, high-dimensionality and sparseness of diet data lead to issues with parameter stability and matrix inversion (Asparouhov, 2005; Patterson et al., 2002). Under a Bayesian setting, models lack proper variance estimation (Stephenson and Willett, 2023; Stephenson et al., 2024), inhibit classification (Gunawan et al., 2020), or ignore the survey design entirely. Without proper incorporation of design features such as informative sampling and clustering, dietary patterns can be misidentified, and characterization of the diet-hypertension relationship can be biased with incorrect posterior intervals.

This paper aims to improve analysis of diet-hypertension patterns in low-income women using survey data. We propose a supervised weighted overfitted latent class analysis (SWOLCA) that uses a Bayesian pseudo-likelihood approach to account for complex survey design and produce accurate estimation and uncertainty quantification for a multivariate categorical exposure and a binary outcome. We also introduce a mixture reference coding scheme to allow interactions between dietary patterns and other covariates. Our model enables us to: 1) uncover the prevalence and profile of dietary patterns dependent on hypertensive status amongst low-income adult women; 2) efficiently measure the association between diet and hypertension while accounting for interactions with covariates; and 3) integrate survey sampling weights to produce accurate point and interval estimation for our target population.

The remaining sections of this paper are organized as follows. In Section 2, we describe our proposed SWOLCA model along with a brief background. In Section 3, we discuss implementation considerations for parameter estimation. In Section 4, we conduct a simulation study comparing SWOLCA with existing methods. In Section 5, we apply the model to data from the National Health and Nutrition Examination Survey (NHANES) to describe dietary pattern association with hypertension among low-income women in the US. Finally, in Section 6, we provide concluding remarks and discussion.

2 Model

2.1 Supervised Overfitted Latent Class Analysis

Supervised overfitted latent class analysis (SOLCA) is a Bayesian nonparametric mixture model that jointly estimates latent dietary patterns through an overfitted latent class model and their associations to a binary hypertension outcome through a probit regression model. In this way, the latent patterns are informed by both the multivariate categorical diet exposure, 𝒙i=(xi1,,xiJ)Tsubscript𝒙𝑖superscriptsubscript𝑥𝑖1subscript𝑥𝑖𝐽𝑇\bm{x}_{i\cdot}=(x_{i1},\ldots,x_{iJ})^{T}bold_italic_x start_POSTSUBSCRIPT italic_i ⋅ end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i italic_J end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT for J𝐽Jitalic_J food items, and the binary hypertension outcome, yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which can increase precision and reduce bias. Each sampled individual i{1,,n}𝑖1𝑛i\in\{1,\ldots,n\}italic_i ∈ { 1 , … , italic_n } is assigned to a dietary pattern ci{1,,K}subscript𝑐𝑖1𝐾c_{i}\in\{1,\ldots,K\}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 1 , … , italic_K }, where K𝐾Kitalic_K is the number of patterns. Model parameters include: the dietary pattern prevalences, 𝝅=(π1,,πK)T𝝅superscriptsubscript𝜋1subscript𝜋𝐾𝑇\bm{\pi}=(\pi_{1},\ldots,\pi_{K})^{T}bold_italic_π = ( italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where k=1Kπk=1superscriptsubscript𝑘1𝐾subscript𝜋𝑘1\sum_{k=1}^{K}\pi_{k}=1∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1; the food item consumption probabilities characterizing the pattern compositions, (θjci1,,θjciRj)subscript𝜃𝑗subscript𝑐𝑖1subscript𝜃𝑗subscript𝑐𝑖subscript𝑅𝑗(\theta_{jc_{i}1},\ldots,\theta_{jc_{i}R_{j}})( italic_θ start_POSTSUBSCRIPT italic_j italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_j italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), where r=1Rjθjcir=1superscriptsubscript𝑟1subscript𝑅𝑗subscript𝜃𝑗subscript𝑐𝑖𝑟1\sum_{r=1}^{R_{j}}\theta_{jc_{i}r}=1∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_j italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 1 for food item j𝑗jitalic_j with Rjsubscript𝑅𝑗R_{j}italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT consumption levels; and the probit regression coefficients, 𝝃ci=(ξci1,,ξciq)Tsubscript𝝃subscript𝑐𝑖superscriptsubscript𝜉subscript𝑐𝑖1subscript𝜉subscript𝑐𝑖𝑞𝑇\bm{\xi}_{c_{i}\cdot}=(\xi_{c_{i}1},\ldots,\xi_{c_{i}q})^{T}bold_italic_ξ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ end_POSTSUBSCRIPT = ( italic_ξ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ξ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, corresponding to q𝑞qitalic_q regression covariates, 𝒗𝒊subscript𝒗𝒊\bm{v_{i}}bold_italic_v start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT, given assignment to dietary pattern cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which is an unknown latent random variable simultaneously determined by the model. We use the formulation of the probit regression model introduced by Albert and Chib (1993), where zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a latent Gaussian variable such that ziN(𝒗iT𝝃ci,1)similar-tosubscript𝑧𝑖𝑁superscriptsubscript𝒗𝑖𝑇subscript𝝃subscript𝑐𝑖1z_{i}\sim N(\bm{v}_{i}^{T}\bm{\xi}_{c_{i}\cdot},1)italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_N ( bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_ξ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ end_POSTSUBSCRIPT , 1 ), and is truncated depending on binary outcome yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT so that zi>0subscript𝑧𝑖0z_{i}>0italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 when yi=1subscript𝑦𝑖1y_{i}=1italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 and zi0subscript𝑧𝑖0z_{i}\leq 0italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ 0 otherwise. Using this formulation, the SOLCA complete data joint distribution of (𝒙i,ci,yi,zisubscript𝒙𝑖subscript𝑐𝑖subscript𝑦𝑖subscript𝑧𝑖\bm{x}_{i},c_{i},y_{i},z_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) is:

p(𝒙i,ci,yi,zi\displaystyle p(\bm{x}_{i\cdot},c_{i},y_{i},z_{i}italic_p ( bold_italic_x start_POSTSUBSCRIPT italic_i ⋅ end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |𝒗i,𝝅,𝜽,𝝃)\displaystyle|\bm{v}_{i},\bm{\pi},\bm{\theta},\bm{\xi})| bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_π , bold_italic_θ , bold_italic_ξ )
=p(ci|𝝅)p(𝒙i|ci,𝜽)p(yi,zi|ci,𝒗i,𝝃)absent𝑝conditionalsubscript𝑐𝑖𝝅𝑝conditionalsubscript𝒙𝑖subscript𝑐𝑖𝜽𝑝subscript𝑦𝑖conditionalsubscript𝑧𝑖subscript𝑐𝑖subscript𝒗𝑖𝝃\displaystyle=p(c_{i}|\bm{\pi})p(\bm{x}_{i}|c_{i},\bm{\theta})p(y_{i},z_{i}|c_% {i},\bm{v}_{i},\bm{\xi})= italic_p ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_italic_π ) italic_p ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_θ ) italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_ξ )
=πcij=1Jr=1RjθjcirI(xij=r)e12(zi𝒗i𝝃ci)22π{yiI(zi>0)+(1yi)I(zi0)},absentsubscript𝜋subscript𝑐𝑖superscriptsubscriptproduct𝑗1𝐽superscriptsubscriptproduct𝑟1subscript𝑅𝑗superscriptsubscript𝜃𝑗subscript𝑐𝑖𝑟𝐼subscript𝑥𝑖𝑗𝑟superscript𝑒12superscriptsubscript𝑧𝑖superscriptsubscript𝒗𝑖subscript𝝃subscript𝑐𝑖22𝜋subscript𝑦𝑖𝐼subscript𝑧𝑖01subscript𝑦𝑖𝐼subscript𝑧𝑖0\displaystyle=\pi_{c_{i}}\prod_{j=1}^{J}\prod_{r=1}^{R_{j}}\theta_{jc_{i}r}^{I% (x_{ij}=r)}\frac{e^{-\frac{1}{2}(z_{i}-\bm{v}_{i}^{\intercal}\bm{\xi}_{c_{i}% \cdot})^{2}}}{\sqrt{2\pi}}\Big{\{}y_{i}I(z_{i}>0)+(1-y_{i})I(z_{i}\leq 0)\Big{% \}},= italic_π start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_j italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I ( italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_r ) end_POSTSUPERSCRIPT divide start_ARG italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT bold_italic_ξ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG { italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_I ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 ) + ( 1 - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_I ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ 0 ) } , (1)

where 𝝃=(𝝃1T,,𝝃KT)T𝝃superscriptsuperscriptsubscript𝝃1𝑇superscriptsubscript𝝃𝐾𝑇𝑇\bm{\xi}=(\bm{\xi}_{1\cdot}^{T},\ldots,\bm{\xi}_{K\cdot}^{T})^{T}bold_italic_ξ = ( bold_italic_ξ start_POSTSUBSCRIPT 1 ⋅ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , … , bold_italic_ξ start_POSTSUBSCRIPT italic_K ⋅ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is a K×q𝐾𝑞K\times qitalic_K × italic_q matrix of regression coefficients, I(A)𝐼𝐴I(A)italic_I ( italic_A ) is the indicator function equal to 1 if A𝐴Aitalic_A is true and 0 otherwise, and 𝜽𝜽\bm{\theta}bold_italic_θ is a J×K×R𝐽𝐾𝑅J\times K\times Ritalic_J × italic_K × italic_R array with cells θjkrsubscript𝜃𝑗𝑘𝑟\theta_{jkr}italic_θ start_POSTSUBSCRIPT italic_j italic_k italic_r end_POSTSUBSCRIPT, where R=maxjRj𝑅subscript𝑗subscript𝑅𝑗R=\max_{j}R_{j}italic_R = roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. SOLCA assumes items are independent conditional on dietary pattern assignment and individuals with the same pattern share behaviors for all food items.

The overfitted formulation of SOLCA enables a data-driven approach to select the number of patterns, K𝐾Kitalic_K, without need for post-hoc testing (Van Havre et al., 2015). K𝐾Kitalic_K is set to a conservatively high number to allow empty patterns to drop out via a sparsity-inducing Dirichlet prior: (π1,,πK)Dir(α1,,αK)similar-tosubscript𝜋1subscript𝜋𝐾Dirsubscript𝛼1subscript𝛼𝐾(\pi_{1},\ldots,\pi_{K})\sim\text{Dir}(\alpha_{1},\ldots,\alpha_{K})( italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) ∼ Dir ( italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ), where hyperparameters αksubscript𝛼𝑘\alpha_{k}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT moderate the rate of growth for nonempty patterns and reduce the dependence of K𝐾Kitalic_K on the sample size and data structure. Smaller values of αksubscript𝛼𝑘\alpha_{k}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT yield a slower growth rate and more sparsity.

2.2 Supervised Weighted Overfitted Latent Class Analysis for Survey Data

Supervised weighted overfitted latent class analysis (SWOLCA) is an extension of SOLCA to the survey setting. To obtain unbiased estimation of our target population, survey sampling weights for all sampled individuals are necessary. Stratification and clustering information of the survey are also needed for accurate variance estimation. We follow a weighted pseudo-likelihood approach as described in Savitsky and Toth (2016) and Kunihama et al. (2016). Survey weights are used to up-weight individual likelihood contributions proportional to the number of individuals represented in the target population. This forms a weighted pseudo-likelihood that is used in place of the likelihood in the posterior update. Estimation and inference proceed using the posterior density of model parameters. Let wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the survey weight of individual i𝑖iitalic_i, i{1,,n}𝑖1𝑛i\in\{1,\ldots,n\}italic_i ∈ { 1 , … , italic_n }. We use a normalization constant κ=i=1nwi/n𝜅superscriptsubscript𝑖1𝑛subscript𝑤𝑖𝑛\kappa=\sum_{i=1}^{n}w_{i}/nitalic_κ = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_n so the weights sum to n𝑛nitalic_n to reflect sampling variability. This provides a coarse adjustment for the posterior uncertainty that can be further refined in a post-processing step described below. Denote all parameters and complete data of the unweighted SOLCA model with 𝚯𝚯\bm{\Theta}bold_Θ and 𝑫𝑫\bm{D}bold_italic_D, respectively. Then, the posterior density for the weighted SWOLCA approach is

p~(𝚯|𝑫)p(𝚯)i=1np(𝑫i|𝚯)wiκ.proportional-to~𝑝conditional𝚯𝑫𝑝𝚯superscriptsubscriptproduct𝑖1𝑛𝑝superscriptconditionalsubscript𝑫𝑖𝚯subscript𝑤𝑖𝜅\widetilde{p}(\bm{\Theta}|\bm{D})\propto p(\bm{\Theta})\prod_{i=1}^{n}p(\bm{D}% _{i}|\bm{\Theta})^{\frac{w_{i}}{\kappa}}.over~ start_ARG italic_p end_ARG ( bold_Θ | bold_italic_D ) ∝ italic_p ( bold_Θ ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( bold_italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_Θ ) start_POSTSUPERSCRIPT divide start_ARG italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_κ end_ARG end_POSTSUPERSCRIPT . (2)

Under certain regularity conditions, the posterior is consistent under unequal probability sampling (Savitsky and Toth, 2016) and complex multi-stage sampling (Williams and Savitsky, 2021, 2020). However, posterior credible intervals will exhibit undercoverage due to clustering and population generation uncertainty that are not accounted for (León-Novelo and Savitsky, 2019; Gunawan et al., 2020). To address this, we extend the post-processing adjustment proposed in Williams and Savitsky (2021) to accommodate a mixture model setting with constrained parameters. Posterior samples are rescaled to recover the correct “sandwich” form of the asymptotic variance based on pseudo-MLE theory. Let 𝚯^msubscript^𝚯𝑚\widehat{\bm{\Theta}}_{m}over^ start_ARG bold_Θ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denote the posterior estimates for Markov chain Monte Carlo (MCMC) sample m𝑚mitalic_m, with mean 𝚯¯¯𝚯\overline{\bm{\Theta}}over¯ start_ARG bold_Θ end_ARG across all samples. The rescaled estimates are

𝚯^ma=(𝚯^m𝚯¯)𝑹𝟐1𝑹𝟏+𝚯¯,superscriptsubscript^𝚯𝑚𝑎subscript^𝚯𝑚¯𝚯superscriptsubscript𝑹21subscript𝑹1¯𝚯\widehat{\bm{\Theta}}_{m}^{a}=\Big{(}\widehat{\bm{\Theta}}_{m}-\overline{\bm{% \Theta}}\Big{)}\bm{R_{2}}^{-1}\bm{R_{1}}+\overline{\bm{\Theta}},over^ start_ARG bold_Θ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT = ( over^ start_ARG bold_Θ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - over¯ start_ARG bold_Θ end_ARG ) bold_italic_R start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_R start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + over¯ start_ARG bold_Θ end_ARG , (3)

where 𝑹𝟏T𝑹𝟏superscriptsubscript𝑹1𝑇subscript𝑹1\bm{R_{1}}^{T}\bm{R_{1}}bold_italic_R start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_R start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT is the correct asymptotic “sandwich” covariance of the pseudo-MLE and 𝑹𝟐T𝑹𝟐superscriptsubscript𝑹2𝑇subscript𝑹2\bm{R_{2}}^{T}\bm{R_{2}}bold_italic_R start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_R start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT is the asymptotic covariance of the posterior. 𝑹𝟏subscript𝑹1\bm{R_{1}}bold_italic_R start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT can be obtained using a mix of resampling and computing the posterior Hessian matrix. For the post-processing adjustment, using the normalization constant κ𝜅\kappaitalic_κ for the sampling weights is not strictly necessary for correct uncertainty coverage but can improve numerical stability when computing 𝑹𝟏subscript𝑹1\bm{R_{1}}bold_italic_R start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT. Some alternative variations of κ𝜅\kappaitalic_κ may lead to smaller post-processing adjustments being needed, for example using an effective sample size based on variation of the weights (Spencer, 2000).

3 Parameter Estimation

3.1 MCMC Computation

For SWOLCA parameter estimation, we implement a MCMC Gibbs sampling algorithm. We follow Moran et al. (2021) and implement sampling in a two-stages: (1) an adaptive sampler estimates the appropriate number of dietary patterns, and (2) a fixed sampler generates model estimates based on the estimated number of patterns. Derivations of the full conditionals are provided in the Supplementary Materials. Proper mixing is encouraged via a random permutation sampler that is incorporated in the MCMC sampling algorithm (Frühwirth-Schnatter, 2001). Stan (Carpenter et al., 2017) is used in the calculation of the post-processing variance adjustment, as it offers automatic differentiation capabilities to compute the posterior gradient and Hessian. Due to issues with handling discrete latent variables in mixture model settings, Stan was not implemented for parameter sampling.

3.2 Mixture Reference Coding of Parameters

A common concern with mixture models is label switching (Stephens, 2000). Under a reference cell coding scheme, label switching turns the intercept and slope coefficients into noise due to switches in the reference pattern. Alternative coding schemes that have been used do not consider pattern-by-covariate interactions and lead to restrictions of the parameter space that are difficult to interpret under a probit link function (Molitor et al., 2010; Stephenson et al., 2022). We resolve this by introducing a combination of factor variable (Buis, 2012) and reference cell coding, hereafter referred to as “mixture reference coding.” In mixture reference coding, the different dietary patterns are expressed in factor variable form, while the levels of any additional covariates are expressed in reference cell form. For example, suppose ci{1,2,3}subscript𝑐𝑖123c_{i}\in\{1,2,3\}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 1 , 2 , 3 } is individual i𝑖iitalic_i’s dietary pattern assignment and vi{0,1}subscript𝑣𝑖01v_{i}\in\{0,1\}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } is a binary covariate. Mixture reference coding for the probit regression model is given by:

𝔼(yi|ci,vi)𝔼conditionalsubscript𝑦𝑖subscript𝑐𝑖subscript𝑣𝑖\displaystyle\mathbb{E}(y_{i}|c_{i},v_{i})blackboard_E ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) =Φ{ξ11I(ci=1)+ξ12I(ci=1)vi\displaystyle=\Phi\Big{\{}\xi_{11}I(c_{i}=1)+\xi_{12}I(c_{i}=1)v_{i}= roman_Φ { italic_ξ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT italic_I ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) + italic_ξ start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT italic_I ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
+ξ21I(ci=2)+ξ22I(ci=2)visubscript𝜉21𝐼subscript𝑐𝑖2subscript𝜉22𝐼subscript𝑐𝑖2subscript𝑣𝑖\displaystyle\qquad+\xi_{21}I(c_{i}=2)+\xi_{22}I(c_{i}=2)v_{i}+ italic_ξ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT italic_I ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 2 ) + italic_ξ start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT italic_I ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 2 ) italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
+ξ31I(ci=3)+ξ32I(ci=3)vi}.\displaystyle\qquad+\xi_{31}I(c_{i}=3)+\xi_{32}I(c_{i}=3)v_{i}\Big{\}}.+ italic_ξ start_POSTSUBSCRIPT 31 end_POSTSUBSCRIPT italic_I ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 3 ) + italic_ξ start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT italic_I ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 3 ) italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } . (4)

Essentially, each dietary pattern has its own reference parameter and corresponding regression model. All interactions between dietary pattern and additional covariates are captured, and additional interactions between covariates can be specified if desired. This balances flexibility, by allowing for interactions, with parsimony, by not forcing inclusion of all interactions between variables. It also allows any dietary pattern to be set as the reference level post-hoc. Using mixture reference coding, label switching can be resolved by adapting a post-processing hierarchical clustering relabeling approach (Krebs, 1999; Medvedovic and Sivaganesan, 2002; Stephenson et al., 2022).

4 Simulation Study

4.1 Simulation Design

We conduct a simulation study to assess whether the proposed SWOLCA is able to produce valid estimation and inference of a target population sampled under complex survey designs. Our parameters of interest are the number of dietary patterns, K𝐾Kitalic_K, their estimated prevalences, 𝝅𝝅\bm{\pi}bold_italic_π, the composition of each pattern, 𝜽𝜽\bm{\theta}bold_italic_θ, and the associations between the patterns and the observed outcome, 𝝃𝝃\bm{\xi}bold_italic_ξ. We compare our method to two alternatives: 1) an unweighted SOLCA that ignores survey design, and 2) a two-step approach where the first step fits an unsupervised weighted overfitted latent class analysis (WOLCA) to derive the dietary patterns (Stephenson et al., 2024), and the second step treats the pattern assignments as fixed and includes them as covariates in a survey-weighted regression model using R survey package version 4.1.1 (Lumley, 2004). All models are implemented in R version 4.2.0 (R Core Team, 2023) with C++ interface using the Rcpp package version 1.0.10 (Eddelbuettel and François, 2011). We run a Gibbs sampler for 20,000 iterations with 10,000 burn-in and thinning every 5 iterations.

Data are generated for a finite population of size N=80,000𝑁80000N=80,000italic_N = 80 , 000 with a total of K=3𝐾3K=3italic_K = 3 dietary patterns that are also associated with a binary outcome. Each pattern consists of J=30𝐽30J=30italic_J = 30 categorical food items, consumed at one of R=4𝑅4R=4italic_R = 4 levels. Survey features in the data include clustered outcomes and two unequal-sized strata in the population, with stratum membership influencing dietary pattern membership and the probability of the outcome. Full details of the data generation process are provided in the Supplementary Materials.

Model performance is evaluated for the sampling and data generating scenarios provided in Table 1. We examine three survey designs: simple random sampling (SRS); stratified sampling with unequal sampling probabilities; or stratified cluster sampling with unequal sampling probabilities and correlated outcomes. We focus on two associations of interest: a conditional outcome model with stratum included as a covariate; or a marginal outcome model that does not condition on selection or adjust for selection bias. And we compare three different sample sizes: 1% (n = 800), 5% (n = 4000), or 10% of the population (n = 8000). Bold text indicates deviation from the default setting (scenario 2) of stratified sampling with a conditional model and sample size 4000. Model robustness is also evaluated in cases where a) additional confounders are included, b) latent patterns are defined with weak identifiability, and c) weakly separated patterns are defined with a few differing exposure variables driving the true association to the outcome. Descriptions and results for these additional scenarios are not shown here but are detailed in the Supplementary Materials.

100 simulated datasets are generated for each scenario. Models are initialized with K=30𝐾30K=30italic_K = 30 and Dirichlet hyperparameter α=1/K𝛼1𝐾\alpha=1/Kitalic_α = 1 / italic_K for all k{1,K}𝑘1𝐾k\in\{1\ldots,K\}italic_k ∈ { 1 … , italic_K } to encourage sparsity and moderate growth of new pattern formation (Van Havre et al., 2015). A noninformative flat Dir(1) prior is used for 𝜽jcisubscript𝜽𝑗subscript𝑐𝑖\bm{\theta}_{jc_{i}\cdot}bold_italic_θ start_POSTSUBSCRIPT italic_j italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ end_POSTSUBSCRIPT, and weakly informative priors are used for the regression parameters 𝝃𝝃\bm{\xi}bold_italic_ξ. To compare model performance for parameter estimation, we examine mean absolute bias (mean absolute distance between estimated and true parameter values), variability (full width of the 95% credible interval (CI), averaged over dietary patterns), and coverage (proportion of 95% CIs that cover the true population parameter values, averaged over dietary patterns).

4.2 Simulation Results

Table 1: Absolute bias, 95% credible interval width, and coverage for the unweighted SOLCA, two-step WOLCA, and proposed SWOLCA, based on posterior MCMC samples and averaged across 100 independent draws from the population. Strat = stratified sampling, Strat Cl = stratified cluster sampling, Cond = conditional model, Marg = marginal model. Notable issues of bias, imprecision, and undercoverage are underlined to improve readability.
Absolute Bias CI Width Coverage
Scenario Model 𝑲𝑲\bm{K}bold_italic_K 𝝅𝝅\bm{\pi}bold_italic_π 𝜽𝜽\bm{\theta}bold_italic_θ 𝝃𝝃\bm{\xi}bold_italic_ξ 𝝅𝝅\bm{\pi}bold_italic_π 𝜽𝜽\bm{\theta}bold_italic_θ 𝝃𝝃\bm{\xi}bold_italic_ξ 𝝅𝝅\bm{\pi}bold_italic_π 𝜽𝜽\bm{\theta}bold_italic_θ 𝝃𝝃\bm{\xi}bold_italic_ξ
(1) SRS, Cond, n=4000 SOLCA 0.00 0.006 0.006 0.063 0.027 0.042 0.367 0.957 0.958 0.965
WOLCA 0.00 0.006 0.006 0.063 0.036 0.044 0.762 0.950 0.958 0.992
SWOLCA 0.00 0.006 0.006 0.063 0.027 0.042 0.419 0.947 0.953 0.983
(2) Strat, Cond, n=4000 SOLCA 0.00 0.081 0.006 0.047 0.069 0.045 0.374 0.190 0.962 0.972
WOLCA 0.00 0.006 0.007 0.043 0.031 0.045 0.672 0.957 0.933 0.998
SWOLCA 0.00 0.006 0.006 0.044 0.036 0.049 0.414 0.977 0.952 0.990
(3) Strat Cl, Cond, n=4000 SOLCA 0.00 0.082 0.006 0.132 0.074 0.046 0.390 0.223 0.966 0.592
WOLCA 0.00 0.006 0.006 0.127 0.037 0.044 1.210 0.963 0.942 0.990
SWOLCA 0.00 0.006 0.006 0.126 0.031 0.047 0.816 0.950 0.942 0.963
(4) Strat, Marg, n=4000 SOLCA 0.00 0.008 0.006 0.203 0.062 0.043 0.162 0.963 0.958 0.063
WOLCA 0.00 0.016 0.007 0.031 0.107 0.049 0.348 0.947 0.939 0.993
SWOLCA 0.00 0.011 0.007 0.033 0.097 0.063 0.278 0.967 0.965 0.987
(5) Strat, Cond, n=8000 SOLCA 0.00 0.080 0.005 0.049 0.076 0.042 0.367 0.227 0.972 0.980
WOLCA 0.06 0.010 0.011 0.038 0.044 0.044 0.519 0.920 0.908 0.960
SWOLCA 0.00 0.004 0.005 0.030 0.029 0.038 0.373 0.967 0.953 0.997
(6) Strat, Cond, n=800 SOLCA 0.00 0.084 0.013 0.098 0.064 0.088 0.701 0.027 0.938 0.945
WOLCA 0.00 0.013 0.014 0.099 0.060 0.095 1.371 0.933 0.919 0.983
SWOLCA 0.00 0.013 0.014 0.097 0.062 0.099 0.687 0.947 0.922 0.947

For all models and scenarios, investigation of traceplots and autocorrelation plots showed good mixing and convergence of all model parameters. Table 1 displays a summary of simulation results for the scenarios described. As expected, under the control SRS scenario, all three models exhibit good estimation and coverage properties. For other scenarios with a variety of complex survey design and data-generating features, the proposed SWOLCA outperforms the two alternative models and is able to obtain accurate and precise estimation, as well as approximately nominal coverage, for all parameters.

The unweighted SOLCA model gives highly biased estimates of the pattern membership probabilities 𝝅𝝅\bm{\pi}bold_italic_π when there is stratified sampling. When there is cluster sampling, credible intervals for the regression coefficients 𝝃𝝃\bm{\xi}bold_italic_ξ exhibit severe undercoverage, which can result in overconfident estimation of associational effects. When the selection is associated with hypertension and a marginal model is fit, SOLCA yields biased estimates for 𝝃𝝃\bm{\xi}bold_italic_ξ.

The two-step WOLCA model produces estimates of regression parameters 𝝃𝝃\bm{\xi}bold_italic_ξ that have wide credible intervals and are less precise than the SWOLCA model at similar coverage levels. This inefficiency is especially true for small sample sizes and cluster sampling designs because the two-step process ignores uncertainty in the first step. This inflates interval widths and makes inference on the true associational effects difficult. WOLCA is also the most prone to undercoverage of θ𝜃\thetaitalic_θ due to failure to account for variability in the plug-in survey weights in the first step, and it runs into issues with estimating the number of dietary patterns, K𝐾Kitalic_K.

SWOLCA yields estimates with minimal bias and approximately nominal interval coverage for all parameters for stratified and cluster sampling designs. It is also able to use the survey weights to account for bias from selection variables that are unavailable for analysis, enabling correct marginal estimation of ξ𝜉\xiitalic_ξ and producing outcome probability estimates that accommodate informative designs without greatly inflating uncertainty (Web Figure 2). In the cluster sampling and 1% sample size scenarios, there is slight undercoverage of 𝜽𝜽\bm{\theta}bold_italic_θ. This is expected given the increased variability of the data and is also seen in the SOLCA and WOLCA comparison models. These conclusions were consistent in settings with weaker patterns (mode 55%), overlap** patterns where consumption of many foods is the same for two patterns, different sample sizes, and additional regression covariates.

5 Application to NHANES Low-Income Women

5.1 Data Description and Model Setup

The National Health and Nutrition Examination Survey (NHANES) is a cross-sectional, nationally-representative survey that assesses the health and nutritional status of the non-institutionalized civilian US population (National Center for Health Statistics, 2023). The survey employs a stratified, clustered, four-stage sampling design with oversampling to increase inclusion of various age, sex, income, and racial and ethnic groups. Data are publicly available alongside survey sampling weights that take into account unequal sampling probabilities, stratification, clustering, non-response, weight trimming, and calibration (Chen et al., 2020). Data are pooled from two survey cycles, 2015-2016 and 2017-2018, in accordance with protocols outlined in the NHANES analytic guidelines (National Center for Health Statistics, 2018). We focus on dietary patterns associated with hypertension among adult females aged 20 or over who are classified as low-income (reported household income at or below 185% of the federal poverty level, consistent with eligibility requirements for federal assistance program participation (Oliveira and Frazão, 2015)). Pregnant or breastfeeding women are excluded (n=179𝑛179n=179italic_n = 179), resulting in a total sample size of n=2003𝑛2003n=2003italic_n = 2003.

Dietary exposure variables are defined as 28 food item groups collected from two 24-hour dietary recalls and summarized into food pattern equivalents from the Food and Nutrition Database for Dietary Studies (Dietary Guidelines Advisory Committee, 2015; Bowman et al., 2020). Each food item is categorized as none, low, medium, or high, based on relative tertiles of positive consumption (Sotres-Alvarez et al., 2013; Stephenson and Willett, 2023). The binary observed outcome, hypertension, is defined as a composite measure of having an elevated blood pressure (BP) reading (systolic BP >130absent130>130> 130 or diastolic BP >80absent80>80> 80), self-reported diagnosis, or use of hypertension-controlling medication. Age, race and ethnicity, current smoking status, and physical activity are included as potential confounders in our hypertensive outcome regression model. Web Table 3 displays summaries of these demographic characteristics by hypertension in the sample.

We compare the proposed SWOLCA model and the unweighted SOLCA model in assessing diet-driven hypertension using survey data. Both models are initialized with the same priors used in the simulation study. Full details of posterior computation are provided in the Supplementary Materials. Estimation is obtained by fitting a Gibbs sampler of 20,000 iterations with 10,000 burn-in and thinning every 5 iterations, then summarized using posterior median estimates and 95% credible intervals. Both adaptive and fixed sampler are run for 20,0000 iterations each using an Apple M1 Pro computer with 8 cores, with a computation time of roughly 75 minutes. If K^^𝐾\widehat{K}over^ start_ARG italic_K end_ARG is set a priori and only the fixed sampler is run, the computation time is approximately 15 minutes, assuming the same computing power.

5.2 Dietary Pattern Results

Both SWOLCA and SOLCA identify K^=5^𝐾5\widehat{K}=5over^ start_ARG italic_K end_ARG = 5 diet-hypertension patterns among low-income women in the US, displayed in Figure 1 with characterization differences indicated by black dots. For all patterns, consumption behavior is colored by the modal (i.e., highest posterior probability) consumption level (none, low, medium, or high) for each of the 28 food items. We can see differences in modal food consumption for all patterns except Pattern 2 (Healthy American), illustrating the influence of survey weights on pattern composition.

Refer to caption
(a) SWOLCA
Refer to caption
(b) SOLCA
Figure 1: Diet-hypertension patterns identified by the weighted SWOLCA and unweighted SOLCA models among low-income women in the US. Differences in modal consumption are indicated with black dots. Consumption levels are categorized as none, low, medium, and high. For each pattern, consumption of each food component is colored according to the modal consumption level (i.e., argmaxrθjkrsubscriptargmax𝑟subscript𝜃𝑗𝑘𝑟\text{argmax}_{r}\theta_{jkr}argmax start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_j italic_k italic_r end_POSTSUBSCRIPT for r=1,,4𝑟14r=1,\ldots,4italic_r = 1 , … , 4, j=1,,28𝑗128j=1,\ldots,28italic_j = 1 , … , 28, k=1,,5𝑘15k=1,\ldots,5italic_k = 1 , … , 5).

We continue by focusing on the patterns identified by SWOLCA (Figure 1(a)). We characterize the five diet-hypertension patterns as follows: 1) Multicultural, 2) Healthy American, 3) Western, 4) Restricted Vegetarian, and 5) Restricted American. The Multicultural pattern favors high consumption of organ meat, starchy and dark green vegetables, oils, and fruits. It is referred to as the Multicultural pattern given its large prevalence among those identifying as NH Asian and groups other than NH White (Table 2). The Healthy American pattern favors a higher consumption of healthier foods such as fruits, vegetables, whole grains, organ meat, and nuts, but still includes high consumption of oils, fats, and sugars prevalent in many American diets. The Western pattern favors a high consumption of refined grains, poultry, cheese, oils, solid fats, added sugars, and other vegetables. Examining Figure 2, which provides detailed consumption level probabilities by pattern for each food item, we see that individuals assigned to the Western pattern are also the most likely to consume meat and cured meats, though this consumption is heterogeneous. The Restricted Vegetarian pattern favors no consumption of many foods including meat and seafood. There is moderate consumption of refined grains and low consumption of whole grains, poultry, eggs, milk, oils, solid fats, and added sugar. This population may face significant food access issues such as residence in or near a food desert or food swamp. Finally, the Restricted American pattern favors low consumption of many foods but to a lesser extent than the Restricted Vegetarian diet and with some intake of organ meats and poultry. This diet is similar to the Healthy American diet but with relatively lower consumption, especially for legumes and vegetables.

Refer to caption
Figure 2: Detailed breakdown of consumption level probabilities by diet-hypertension pattern for each food component for the five diet-hypertension patterns identified by the SWOLCA model among low-income women in the US. Pattern names: 1) Multicultural, 2) Healthy American, 3) Western, 4) Restricted Vegetarian, and 5) Restricted American.

Examining the size and distribution of the diet-hypertension patterns across demographic variables (Table 2), we see that the Western diet is most prevalent (26.7% of the population) and the Multicultural diet is least prevalent (10.7%). Those who follow the Multicultural diet tend to be younger, NH Asian, not a current smoker, and physically active. Conversely, those who follow the Restricted American diet tend to be older, NH White, and a current smoker. Those who follow the Restricted Vegetarian diet also tend to be older, current smokers, and physically inactive.

Table 2: Size and demographic distribution of dietary patterns for low-income women in the US. Estimates for N𝑁Nitalic_N use posterior samples of parameter π𝜋\piitalic_π. Column-wise mean and percentage estimates of demographic variables are calculated using sampling weights.
Variable Level Multicultural Healthy Western Restrict Veg Restrict Overall
N𝑁Nitalic_N: % (posterior SD %) 10.7 (4.2) 24.4 (3.8) 26.7 (3.9) 19.9 (4.3) 18.3 (5.4)
n𝑛nitalic_n: % 11.5 23.6 25.8 20.7 18.4
Age Group: % [20,40) 51.7 47.1 44.8 33.2 35.3 41.9
[40,60) 28.6 29.1 33.4 37.1 33.1 32.6
60absent60\geq 60≥ 60 19.6 23.9 21.9 29.7 31.6 25.5
Race and Ethnicity: % NH White 36.9 50.1 49.0 49.7 55.0 49.3
NH Black 18.2 14.7 17.7 15.6 16.3 16.3
NH Asian 20.2 2.2 3.4 7.1 3.4 5.6
Hispanic/Latino 23.6 27.2 23.8 20.4 22.2 23.6
Other/Mixed 1.2 5.8 6.1 7.2 3.1 5.2
Smoking Status: % Non-Smoker 85.3 72.7 75.3 69.9 69.3 73.5
Smoker 14.7 27.3 24.7 30.1 30.7 26.5
Physical Activity: % Inactive 41.0 47.0 45.5 47.3 41.7 45.1
Active 59.0 53.0 54.5 52.7 58.3 54.9

5.3 Dietary Patterns and Hypertension Risk

Table 3 displays the posterior 𝝃𝝃\bm{\xi}bold_italic_ξ estimates for the main dietary pattern effects on hypertension for SWOLCA and SOLCA. The full tables of regression estimates are provided in the Supplementary Materials. Estimates for the unweighted SOLCA differ greatly from those of SWOLCA, indicating presence of selection variables that influence the outcome but were not included as covariates. This results in inaccurate contributions of individuals to the estimation of hypertension probabilities because sampling weights have not been considered in the regression estimation process. The unweighted model produces fewer interaction effects and reduced ability to distinguish effects between the patterns. Credible intervals are also much tighter when the survey design is not incorporated. This leads to incorrect inference, as illustrated by the undercoverage shown in the simulation study.

Table 3: Main effect probit regression parameter estimates for the proposed SWOLCA and the unweighted SOLCA, adjusting for demographic confounders. Reference group: Multicultural diet, age [20,40), NH White, non-smoker, inactive.
SWOLCA SOLCA
Covariate Estimate 95% CI P(ξ>0𝜉0\xi>0italic_ξ > 0 ) Estimate 95% CI P(ξ>0𝜉0\xi>0italic_ξ > 0 )
(Intercept) -1.66 (-2.92, -0.41) <<<0.01 -1.01 (-1.61, -0.41) <<<0.01
Multicultural 1 (reference) - - - - -
Healthy Amer 0.45 (-1.12, 1.96) 0.70 0.05 (-0.68, 0.81) 0.56
Western 0.49 (-0.86, 1.83) 0.75 0.02 (-0.71, 0.70) 0.53
Restricted Veg 0.67 (-0.88, 2.17) 0.79 -0.11 (-0.89, 0.64) 0.39
Restricted Amer 1.01 (-0.21, 2.15) 0.96 0.14 (-0.65, 0.91) 0.63

Focusing on the SWOLCA results, all diets appear to be associated with increased probability of hypertension compared to the Multicultural diet, with the Restricted diets showing the strongest increase (Table 3). Figure 3 displays hypertension probabilities of the diet-hypertension patterns as well as interactions effects between the patterns and socio-demographic variables. The estimated hypertension probabilities for the patterns were: 4.8% (Multicultural), 11.1% (Healthy American), 12.2% (Western), 15.9% (Restricted Vegetarian), and 25.2% (Restricted American), among those aged 20 to 39, identifying as NH White, not currently smoking, and physically inactive. Age has a strong, positive association with elevated probability of hypertension, and the differentials among patterns is most pronounced in the 40 to 60 age group, with the Multicultural diet outcome probability remaining low (13%) in the 40 to 60 age group, whereas the Restricted American diet sees a large increase in the 40 to 60 age group (74%). Among different racial and ethnic groups, probability of hypertension is higher among those identifying as NH Black compared to NH White for all patterns, and the Restricted American pattern remains at high risk across all groups. The Healthy American pattern sees the largest increase in outcome probability among smokers, and the Restricted American pattern sees the largest decrease among those who are active.

Refer to caption
Figure 3: Estimated probability of hypertension outcome by diet-hypertension pattern for all covariates, including interactions with pattern, for the SWOLCA model. For each covariate plot, all other covariates are set to the following baseline values: Multicultural diet, age [20,40), NH White, non-smoker, and inactive.

6 Discussion

In this work, we develop the supervised weighted overfitted latent class analysis (SWOLCA), which is a Bayesian joint mixture model that can be used to: 1) elicit dietary patterns informed by both a set of categorical dietary intake exposures and a binary hypertension outcome, with a data-driven approach to determine the number of patterns; 2) capture small diet-hypertension effects and interactions with covariates via mixture reference coding; and 3) obtain unbiased estimation and inference for the population by adjusting for complex survey design features such as stratification, clustering, and informative sampling. Although our method is designed for categorical exposures and a binary outcome, extensions to other outcome data types can be made by replacing the probit likelihood with a regression likelihood that can accommodate different outcome data types (e.g., multinomial, ordinal, continuous). This remains an area of active research.

Simulation studies confirmed that SWOLCA improved accuracy, precision, and coverage of parameter estimation compared to models that did not include sampling weights or relied on the two-step approach. Incorporation of survey design features was important for accurate estimation of the pattern prevalence, pattern identification, exposure-outcome association, and variance. Incorporation of the outcome into clustering improved precision of regression estimates and led to better identification of patterns. Implementation of SWOLCA to NHANES 2015-2018 data identified five diet-hypertension patterns among low-income US women. Differences between the SWOLCA and SOLCA results illustrated the importance of accounting for survey design. Failure to include survey sampling weights changed the pattern compositions because individual contributions to the consumption level probabilities during pattern formation reflected the sample composition and were not necessarily representative of the population. Our model identified strong age effects and captured substantial heterogeneity among different racial and ethnic subgroups via interaction terms.

Our work suggests several areas for further improvement. Firstly, our model does not adjust for data reliability typically encountered in dietary data, such as measurement error, recall bias, and item non-response missingness. Secondly, diet consumption heterogeneity may be better captured by adapting methods that allow demographic or behavior driven deviations of foods from the overall diet-disease pattern, and additionally incorporating survey design elements. Thirdly, our model is based on cross-sectional data and is not able to evaluate the impact of exposure changes over time or a time to event analysis. Lastly, our model relies on a probit regression component that can be limited by computational stability. Computation may be improved by incorporating hierarchical priors or by exploring other distributions, such as a unified skew normal conjugate model (Anceschi et al., 2023). We leave these opportunities for extensions to future research.

Acknowledgements

The authors are grateful to Walter Willett for helpful comments on earlier versions of this work, and to the Co-Editor, Associate Editor, and two referees for insightful comments that greatly improved the paper. This research was supported by the National Institute of Allergy and Infectious Diseases (NIAID: T32 AI007358) and the National Heart, Lung, and Blood Institute (NHLBI: R25 HL105400 awarded to Victor G. Davila-Roman and DC Rao).

Supplementary Materials

Web Appendices, Tables, and Figures, and data and code referenced in Sections 2, 4, and 5 are available with this paper at the Biometrics website on Oxford Academic. Code for replicating the simulations and data analyses in this paper is also available on GitHub at https://github.com/smwu/SWOLCA and is currently being developed into an R package.

Data Availability Statement

Data used in this paper to illustrate our findings are publicly available at https://github.com/smwu/SWOLCA/ and were derived from the following resources available in the public domain: https://wwwn.cdc.gov/nchs/nhanes/.

References

  • Albert and Chib (1993) Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 88, 669–679.
  • Anceschi et al. (2023) Anceschi, N., Fasano, A., Durante, D., and Zanella, G. (2023). Bayesian conjugacy in probit, tobit, multinomial probit and extensions: A review and new results. Journal of the American Statistical Association 118, 1451–1469.
  • Asparouhov (2005) Asparouhov, T. (2005). Sampling weights in latent variable modeling. Structural Equation Modeling 12, 411–434.
  • Bowman et al. (2020) Bowman, S., Clemens, J., Friday, J., and Moshfegh, A. (2020). Food patterns equivalents database 2017–2018: methodology and user guide. Food Surveys Research Group: Beltsville, MD .
  • Bray et al. (2015) Bray, B. C., Lanza, S. T., and Tan, X. (2015). Eliminating bias in classify-analyze approaches for latent class analysis. Structural Equation Modeling 22, 1–11.
  • Buis (2012) Buis, M. L. (2012). Stata tip 106: With or without reference. The Stata Journal 12, 162–164.
  • Carpenter et al. (2017) Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., et al. (2017). Stan: A probabilistic programming language. Journal of Statistical Software 76, 1.
  • Chen et al. (2020) Chen, T. C., Clark, J., Riddles, M. K., Mohadjer, L. K., and Fakhouri, T. H. (2020). National health and nutrition examination survey, 2015-2018: Sample design and estimation procedures. Vital and Health Statistics. Series 2, Data Evaluation and Methods Research pages 1–35.
  • Dietary Guidelines Advisory Committee (2015) Dietary Guidelines Advisory Committee (2015). Dietary Guidelines for Americans 2015-2020. Government Printing Office.
  • Eddelbuettel and François (2011) Eddelbuettel, D. and François, R. (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software 40, 1–18.
  • Elliott et al. (2020) Elliott, M. R., Zhao, Z., Mukherjee, B., Kanaya, A., and Needham, B. L. (2020). Methods to account for uncertainty in latent class assignments when using latent classes as predictors in regression models, with application to acculturation strategy measures. Epidemiology 31, 194–204.
  • Frühwirth-Schnatter (2001) Frühwirth-Schnatter, S. (2001). Markov chain monte carlo estimation of classical and dynamic switching and mixture models. Journal of the American Statistical Association 96, 194–209.
  • Fung et al. (2001) Fung, T. T., Willett, W. C., Stampfer, M. J., Manson, J. E., and Hu, F. B. (2001). Dietary patterns and the risk of coronary heart disease in women. Archives of Internal Medicine 161, 1857–1862.
  • Gunawan et al. (2020) Gunawan, D., Panagiotelis, A., Griffiths, W., and Chotikapanich, D. (2020). Bayesian weighted inference from surveys. Australian & New Zealand Journal of Statistics 62, 71–94.
  • Krebs (1999) Krebs, C. J. (1999). Ecological Methodology. Benjamin/Cummings, 2nd edition.
  • Kunihama et al. (2016) Kunihama, T., Herring, A., Halpern, C., and Dunson, D. (2016). Nonparametric bayes modeling with sample survey weights. Statistics & Probability Letters 113, 41–48.
  • Lazarsfeld and Henry (1968) Lazarsfeld, P. F. and Henry, N. (1968). Latent Structure Analysis. Houghton, Mifflin.
  • León-Novelo and Savitsky (2019) León-Novelo, L. G. and Savitsky, T. D. (2019). Fully bayesian estimation under informative sampling. Electronic Journal of Statistics 13, 1608–1645.
  • Lumley (2004) Lumley, T. (2004). Analysis of complex survey samples. Journal of Statistical Software 9, 1–19. R package version 2.2.
  • Medvedovic and Sivaganesan (2002) Medvedovic, M. and Sivaganesan, S. (2002). Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics 18, 1194–1206.
  • Molitor et al. (2010) Molitor, J., Papathomas, M., Jerrett, M., and Richardson, S. (2010). Bayesian profile regression with an application to the national survey of children’s health. Biostatistics 11, 484–498.
  • Moran et al. (2021) Moran, K. R., Dunson, D., Wheeler, M. W., and Herring, A. H. (2021). Bayesian joint modeling of chemical structure and dose response curves. The Annals of Applied Statistics 15, 1405–1430.
  • National Center for Health Statistics (2018) National Center for Health Statistics (2018). National health and nutrition examination survey: Analytic guidelines, 2011–2014 and 2015–2016. Technical report, Centers for Disease Control and Prevention.
  • National Center for Health Statistics (2023) National Center for Health Statistics (2023). National health and nutrition examination survey home page.
  • Oliveira and Frazão (2015) Oliveira, V. and Frazão, E. (2015). The wic program: Background, trends, and economic issues, 2015 edition. economic information bulletin number 134. Technical report, US Department of Agriculture, Economic Research Service.
  • Parker et al. (2022) Parker, P. A., Holan, S. H., and Janicki, R. (2022). Computationally efficient bayesian unit-level models for non-gaussian data under informative sampling with application to estimation of health insurance coverage. The Annals of Applied Statistics 16, 887–904.
  • Patterson et al. (2002) Patterson, B. H., Dayton, C. M., and Graubard, B. I. (2002). Latent class analysis of complex sample survey data: application to dietary data. Journal of the American Statistical Association 97, 721–741.
  • Pfeffermann (1996) Pfeffermann, D. (1996). The use of sampling weights for survey data analysis. Statistical Methods in Medical Research 5, 239–261.
  • R Core Team (2023) R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Sacks et al. (2001) Sacks, F. M., Svetkey, L. P., Vollmer, W. M., Appel, L. J., Bray, G. A., Harsha, D., et al. (2001). Effects on blood pressure of reduced dietary sodium and the dietary approaches to stop hypertension (dash) diet. New England journal of medicine 344, 3–10.
  • Savitsky and Toth (2016) Savitsky, T. D. and Toth, D. (2016). Bayesian estimation under informative sampling. Electronic Journal of Statistics 10, 1677–1708.
  • Sotres-Alvarez et al. (2010) Sotres-Alvarez, D., Herring, A. H., and Siega-Riz, A. M. (2010). Latent class analysis is useful to classify pregnant women into dietary patterns. The Journal of Nutrition 140, 2253–2259.
  • Sotres-Alvarez et al. (2013) Sotres-Alvarez, D., Siega-Riz, A. M., Herring, A. H., Carmichael, S. L., Feldkamp, M. L., Hobbs, C. A., et al. (2013). Maternal dietary patterns are associated with risk of neural tube and congenital heart defects. American Journal of Epidemiology page kws349.
  • Spencer (2000) Spencer, B. D. (2000). An approximate design effect for unequal weighting when measurements may correlate with selection probabilities. Survey Methodology 26, 137–138.
  • Stephens (2000) Stephens, M. (2000). Dealing with label switching in mixture models. Journal of the Royal Statistical Society Series B: Statistical Methodology 62, 795–809.
  • Stephenson et al. (2022) Stephenson, B. J., Herring, A. H., and Olshan, A. F. (2022). Derivation of maternal dietary patterns accounting for regional heterogeneity. Journal of the Royal Statistical Society Series C: Applied Statistics 71, 1957–1977.
  • Stephenson et al. (2024) Stephenson, B. J., Wu, S. M., and Dominici, F. (2024). Identifying dietary consumption patterns from survey data: a bayesian nonparametric latent class model. Journal of the Royal Statistical Society Series A: Statistics in Society 187, 496–512.
  • Stephenson and Willett (2023) Stephenson, B. J. K. and Willett, W. C. (2023). Racial and ethnic heterogeneity in diets of low-income adult females in the united states: results from national health and nutrition examination surveys from 2011 to 2018. The American Journal of Clinical Nutrition 117, 625–634.
  • Van Havre et al. (2015) Van Havre, Z., White, N., Rousseau, J., and Mengersen, K. (2015). Overfitting bayesian mixture models with an unknown number of components. PloS One 10,.
  • Whelton et al. (2018) Whelton, P. K., Carey, R. M., Aronow, W. S., Casey, D. E., Collins, K. J., Dennison Himmelfarb, C., et al. (2018). 2017 acc/aha/aapa/abc/acpm/ags/apha/ash/aspc/nma/pcna guideline for the prevention, detection, evaluation, and management of high blood pressure in adults: a report of the american college of cardiology/american heart association task force on clinical practice guidelines. Journal of the American College of Cardiology 71, e127–e248.
  • Williams and Savitsky (2020) Williams, M. R. and Savitsky, T. D. (2020). Bayesian estimation under informative sampling with unattenuated dependence. Bayesian Analysis 15, 57–77.
  • Williams and Savitsky (2021) Williams, M. R. and Savitsky, T. D. (2021). Uncertainty estimation for pseudo-bayesian inference under complex sampling. International Statistical Review 89, 72–107.
  • Zhang et al. (2018) Zhang, F. F., Liu, J., Rehm, C. D., Wilde, P., Mande, J. R., and Mozaffarian, D. (2018). Trends and disparities in diet quality among us adults by supplemental nutrition assistance program participation status. JAMA Network Open 1, e180237–e180237.