Corrected Correlation Estimates for Meta-Analysis

Alexander Johnson-Vázquez The authors gratefully acknowledge the Bill & Melinda Gates Foundation. Institute for Health Metrics and Evaluation, Seattle Department of Applied Mathematics, University of Washington, Seattle Alexander W. Hsu Institute for Health Metrics and Evaluation, Seattle Department of Applied Mathematics, University of Washington, Seattle Aleksandr Aravkin Institute for Health Metrics and Evaluation, Seattle Department of Applied Mathematics, University of Washington, Seattle Peng Zheng Institute for Health Metrics and Evaluation, Seattle

Abstract

Meta-analysis aggregates estimates and uncertainty across multiple studies, summarizing individual reports into aggregate results that are frequently used to inform health policy and recommendations. When a given study reports multiple estimates, such as log odds ratios (ORs) or log relative risks (RRs) across different exposure groups, accounting for within-study estimate correlations is improves both efficiency of meta-analytic estimates and provides more accurate estimates of uncertainty. The canonical approaches of Greenland and Longnecker (1992) and Hamling et al. (2008) construct pseudo-cases and non-cases for exposure groups to estimate correlations of reported within-study estimates. However, currently availble implementations for both methods can fail on simple examples.

We review both GL and Hamling methods through the lens of optimization. For ORs, we provide modifications of each approach that ensure convergence for any feasible inputs. For GL, this is achieved through a new connection to entropic minimization. For Hamling, a modification leads to a provably solvable equivalent set of equations given a specific initialization. For each, we provide implementations guaranteed to work for any feasible input.

For RRs, we show the new GL approach is always guaranteed to succeed. We derive counter-examples where the Hamling approach does not admit any solutions. For the special RR case where the variances are all equal, we derive a necessary and sufficient condition for success.

Keywords: meta-analysis, correlated observations, convex optimization, nonlinear equations

1 Introduction

Meta-analysis combines results reported by multiple studies to obtain aggregate results and estimate between-study heterogeneity Haidich (2010). Meta-analytic results inform public health recommendations, underscoring the importance of accuracy in meta-analytic methods (Deeks et al., 2019)[Chapter 10]. Understanding dose-response relationships across different ranges of exposure poses particular challenges Orsini et al. (2012); Liu et al. (2009); Crippa et al. (2019); Zheng et al. (2022). Dose-response meta-analysis seeks to quantify the impact of a continuous risk, such as systolic blood pressure (Razo et al., 2022), smoking (Dai et al., 2022), meat (Lescinsky et al., 2022) or vegetables (Stanaway et al., 2022) consumed, on the risk of an outcome, e.g. lung cancer or heart disease, by aggregating available estimates for different exposure groups across many studies.

Two of the most common types of estimates are adjusted odds ratios and relative risks (Schmidt and Kohlmann, 2008). Because these estimates always share a common reference group, the estimates for different exposure levels are correlated. Estimating relationships without correcting for these correlations is inefficient and under-estimates the variance of the resulting coefficients (Greenland and Longnecker, 1992, Appendix (1)). We show the potential impact of the adjustment, as well as a real-world example, in Section 2.

In short, it is crucial for meta-analyses to adjust for within-study correlation. Since we are blind to the adjustment mechanism of reported odds ratios (ORs) and relative risks (RRs), we do not have access to the true underlying covariance matrix between reported estimates. If the adjusted estimates are produced through a regression, then an estimated covariance matrix would be available. However, this estimated covariance matrix is generally not reported. As we have access only to the reported metadata, we must accurately construct this covariance matrix. In their groundbreaking work, Greenland and Longnecker (1992) showed that it is possible to estimate within-study correlations, and use them to approximate the covariance matrix. The GL approach requires the modeler to provide the total number of subjects at each exposure level (both treatment and control), the total number of cases, and adjusted treatment effects at each exposure level, such as log ORs or log RRs. Using this information, the GL approach uses a root-finding algorithm to obtain pseudo-case counts for every exposure that match reported estimates, and then uses the pseudo-counts to estimate asymptotic within-study correlations. These correlations inform downstream analyses, accounting for the impact of a common reference group explicitly before estimating study-specific random effects through mixed-effects modeling.

Following the work of Greenland and Longnecker (1992), Hamling et al. (2008) also use reported estimates to get pseudo-counts of cases versus non-cases. However, Hamling et al. (2008) directly use the standard errors of the reported estimates rather than requiring modelers to obtain subject counts at each exposure level. The Hamling approach requires only two additional pieces of information beyond the estimates and their variances: the ratio of unexposed controls to total exposed controls, and the ratio of all controls to all cases. Hamling et al. (2008) fit pseudo-cell counts to the available data, and given pseudo-cell counts, the correlation estimators are the same as those of GL.

These methods are widely used in the community; for example, the meta-analysis R package dosresmeta (Crippa and Orsini, 2016) implements both correlation estimators in their Covariance function that creates the within-study covariance matrix. Despite the wide use of both methods, past research stopped short of providing guarantees of success given feasible inputs. In fact, both Greenland and Longnecker (1992) and Hamling et al. (2008) discussed numerical instability, citing occasional failures and the need to re-initialize as needed. As originally presented, and as currently implemented in Crippa and Orsini (2016), both methods fail on simple modifications to the input data from working examples.

Here, we fill the current gap, providing robust GL and Hamling methods guaranteed to work for all feasible inputs on the OR problem, including our generated failure modes that can break the current implementation Crippa and Orsini (2016). To do this, we study each approach using an optimization perspective. For GL, we show the root-finding problem of Greenland and Longnecker (1992) is equivalent to a convex minimization problem in both the OR and RR settings. Convexity allows us to prove existence and uniqueness of results, and use disciplined convex programming (DCP) (Boyd and Vandenberghe, 2004) to remove any decisions by the user regarding initialization and to provide state-of-the-art numerical solving techniques. We provide an implementation using cvxpy that is guaranteed to return the unique solution (Diamond and Boyd, 2016; Agrawal et al., 2018). For Hamling, in the case of OR, we develop an equivalent set of nonlinear equations, and prove that these equivalent formulations are always solvable. We provide a Python implementation that, in practice, converges for all inputs. For RR, we show that in fact the Hamling approach may fail, provide a counter-example where there is no solution, and provide a sufficient condition on solvability for reported RR’s in the case where reported variances are all equal. Our implementation also covers the RR case but provides an informative warning to the modeler should the model fail to find a root.

Roadmap.

In Section 2, we provide theoretical and empirical motivation for adjusting for within-study correlation, which may be useful for readers new to the topic. We review the work of Greenland and Longnecker (1992) and Hamling et al. (2008) in Section 3. We develop the necessary innovations to robustify each method and provide theoretical guarantees in Sections 4 and 5. Finally, in Section 6, we present numerical illustrations showing our methods provide identical results to those of Greenland and Longnecker (1992) and Hamling et al. (2008) when the original methods converge, and provide correct results for inputs that break currently available implementation. We also present a counter-example in the RR regime that has no solution for the Hamling approach.

2 Motivation for Correlation Correction

Before we review existing methods and introduce our updated techniques for correcting for within-study correlation, we motivate the necessity of such methods. We show that considering differences in means with respect to a reference group always induces a nonzero correlation reported estimates. Building on this example, we construct a toy simulation that shows the potential impact of failing to account for this correlation. We show a simple example from a real-world study in peripheral artery disease in which we observe high correlation between estimates, leading to a significant difference in the slope of the dose-response relationship between adjusted and unadjusted estimates. Finally, we briefly describe implications of the correlation correction for meta-analysis.

2.1 Theoretical Motivation

Consider measurements ${\left\{x_{1}^{i}\right\}}_{i=1}^{n_{1}}$ , ${\left\{x_{2}^{j}\right\}}_{j=1}^{n_{2}}$ from two different treatment groups and measurements from a reference group, ${\left\{x_{0}^{l}\right\}}_{l=1}^{n_{0}}$ , where $n_{k}$ is the number of samples in group $k\in\{0,1,2\}$ . Here, we assume that each $x_{k}^{i}$ is independently distributed according to a Gaussian distribution distinct for each $k$ , i.e.,

x_{k}^{i}\overset{\mathrm{iid}}{\sim}\mathcal{N}(\mu_{k},\sigma_{k}^{2})

for nonzero $\mu_{k},\sigma_{k}$ . Without loss of generality, assume $\hat{\mu}_{2}>\hat{\mu}_{0}$ and $\hat{\mu}_{1}>\hat{\mu}_{0}$ .

We define the empirical mean estimator as

\hat{\mu}_{k}=\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}x_{k}^{i}

and seek to estimate the difference in means between the treatment groups and reference group, constructing the estimators $\hat{\eta}_{1}=(\hat{\mu}_{1}-\hat{\mu}_{0})$ and $\hat{\eta}_{2}=(\hat{\mu}_{2}-\hat{\mu}_{0})$ . The reference group induces a positive correlation between these estimators, as shown below:

	$\displaystyle\mathrm{Cov}(\hat{\eta}_{1},\hat{\eta}_{2})$	$\displaystyle=\mathbb{E}{\left[\hat{\eta}_{1}\hat{\eta}_{2}\right]}-\mathbb{E}% {\left[\hat{\eta}_{1}\right]}\mathbb{E}{\left[\hat{\eta}_{2}\right]}$
		$\displaystyle=\mathbb{E}{\left[(\hat{\mu}_{1}-\hat{\mu}_{0})(\hat{\mu}_{2}-% \hat{\mu}_{0})\right]}-\mathbb{E}{\left[(\hat{\mu}_{1}-\hat{\mu}_{0})\right]}% \mathbb{E}{\left[(\hat{\mu}_{2}-\hat{\mu}_{0})\right]}$
		$\displaystyle=\mathbb{E}{\left[\hat{\mu}_{1}\hat{\mu}_{2}-\hat{\mu}_{1}\hat{% \mu}_{0}-\hat{\mu}_{2}\hat{\mu}_{0}+\hat{\mu}_{0}^{2}\right]}-(\mu_{1}-\mu_{0}% )(\mu_{2}-\mu_{0})$
		$\displaystyle=\mathbb{E}{\left[\hat{\mu}_{1}\hat{\mu}_{2}\right]}-\mathbb{E}{% \left[\hat{\mu}_{1}\hat{\mu}_{0}\right]}-\mathbb{E}{\left[\hat{\mu}_{2}\hat{% \mu}_{0}\right]}+\mathbb{E}{\left[\hat{\mu}_{0}^{2}\right]}-{\left(\mu_{1}\mu_% {2}-\mu_{1}\mu_{0}-\mu_{2}\mu_{0}+\mu_{0}^{2}\right)}$
		$\displaystyle=\mathbb{E}{\left[\hat{\mu}_{0}^{2}\right]}-\mu_{0}^{2}$
		$\displaystyle=\sigma_{0}^{2}/n_{0}.$

The correlation is driven by the variance of the mean of the reference group. By independence, the variance of the estimators themselves is given by

\mathbb{V}[\hat{\eta}_{1}]=\frac{\sigma_{0}^{2}}{n_{0}}+\frac{\sigma_{1}^{2}}{% n_{1}},\quad\mathbb{V}[\hat{\eta}_{2}]=\frac{\sigma_{0}^{2}}{n_{0}}+\frac{% \sigma_{2}^{2}}{n_{2}}

where $\mathbb{V}$ is the variance operator. Thus, the smaller the reference group, and the larger its intrinsic variance, the larger the induced correlation between $\hat{\eta}_{1}$ and $\hat{\eta}_{2}$ .

Using this data and assuming there is a true, linear effect across groups $\beta$ , we may seek to estimate $\beta$ through least-squares regression. The two methods we observe are generalized least squares (GLS) and ordinary least squares (OLS). We set $X$ to be the appended vector $X={\left({\left\{x_{0}^{l}\right\}}_{l=1}^{n_{0}},{\left\{x_{1}^{i}\right\}}_{% i=1}^{n_{1}},{\left\{x_{2}^{k}\right\}}_{k=1}^{n_{2}}\right)}^{\top}$ and also set $\hat{\eta}={\left(\hat{\eta}_{1},\hat{\eta}_{2}\right)}$ . Thus, we may construct the estimator $\hat{\beta}_{\mathrm{cor}}$ for $\beta$ to be

\hat{\beta}_{\mathrm{cor}}={\left(X^{\top}C^{-1}X\right)}^{-1}X^{\top}C^{-1}% \hat{\eta}

to be the GLS estimate, where we are accounting for correlation between $\hat{\eta}_{1},\hat{\eta}_{2}$ , which we know must exist (Kariya and Kurata, 2004). Here, $C$ is the covariance matrix of the estimates $\hat{\eta}_{1},\hat{\eta}_{2}$ with entries defined as above. Similarly, we construct the estimate $\hat{\beta}_{\mathrm{OLS}}$ to $\beta$ to be

\hat{\beta}_{\mathrm{OLS}}={\left(X^{\top}X\right)}^{-1}X^{\top}\hat{\eta}.

Note that this OLS estimator does not account for correlation and amounts to assuming the independence of $\hat{\eta}_{1},\hat{\eta}_{2}$ .

In evaluating these two estimators, it is easy to show that both $\hat{\beta}_{\mathrm{cor}},\hat{\beta}_{\mathrm{OLS}}$ are unbiased. By construction, we have

	$\displaystyle\mathbb{V}{\left[\hat{\beta}_{\mathrm{cor}}\right]}$	$\displaystyle={\left(X^{\top}C^{-1}X\right)}^{-1}$
	$\displaystyle\mathbb{V}{\left[\hat{\beta}_{\mathrm{OLS}}\right]}$	$\displaystyle={\left(X^{\top}X\right)}^{-1}X^{\top}CX{\left(X^{\top}X\right)}^% {-1}$

as the variance estimators. From the generalized Gauss-Markov theorem (Kariya and Kurata, 2004), it follows that $\hat{\beta}_{\mathrm{cor}}$ is optimal among all linear, unbiased estimators and asymptotically efficient. In particular, $\mathbb{V}{\left[\hat{\beta}_{\mathrm{cor}}\right]}\leq\mathbb{V}{\left[\hat{% \beta}_{\mathrm{OLS}}\right]}$ , with strict inequality whenever $C$ is not diagonal. This explains the advantage of GLS estimation according to problems of this class. A theme of the present work is that the covariance matrix $C$ is not always known. This further illustrates the necessity of develo** good approximation techniques to $C$ so that estimators downstream remain more precise.

In the next section we illustrate the impact of this correlation on the efficiency of the estimator for the overall relationship computed from multiple reported estimates. In the context of meta regression, we would often consider both of the exposure effect estimates $\hat{\eta}_{1}$ , $\hat{\eta}_{2}$ in conjunction with data from other studies to estimate effects as a function of exposure level.

2.2 Numerical Illustration

Refer to caption — Figure 1: The empirical distribution of the $\hat{\beta}$ estimate in a numerical simulation when estimated using GLS with a weighting covariance matrix (blue; $\hat{\beta}_{\mathrm{cor}}$ ) versus when estimated using OLS, assuming no correlation (orange; $\hat{\beta}_{\mathrm{OLS}}$ ). The true value of $\beta=1$ prescribed before the simulation is given by the vertical line.

Here, we create a simplified simulation to show the impact of adjusting for the correlation between the mean estimator levels $\hat{\eta}_{i}$ . In our simulation, using the notation from above, we have $n_{0}=1$ with 4 exposure levels and $n_{1}=\dots=n_{4}=10$ . We prescribe a true value of $\beta=1$ and seek to estimate this population value according to the data.

After assigning all the initial count data, we construct our estimates $\hat{\eta}_{1},\dots,\hat{\eta}_{4}$ . Using the exposure levels as the standard exogenous variable in the regression, we then have all the relevant data. We compare the estimates $\hat{\beta}_{\mathrm{cor}}$ and $\hat{\beta}_{\mathrm{OLS}}$ to $\beta$ using the GLS and OLS formulations as constructed above, with the relevant dimensional differences applied. The results from 5,000 realizations are shown in Figure 1. Both estimators are unbiased, but $\hat{\beta}_{\mathrm{cor}}$ is has a much smaller variance than $\hat{\beta}_{\mathrm{OLS}}$ .

The simulation is relevant to the situations considered by Greenland and Longnecker (1992) and Hamling et al. (2008), since $\log$ OR and RR estimates are created with respect to the same reference group per study. The main differences are that we have explicit access to the full covariance between our reported estimators, while in meta-analytic settings the correlation is hidden and must be inferred–this is the core problem that correlation correction estimators seek to solve.

We end the section with a real-world example where the adjustment makes a big difference to summarizing the study.

2.3 Real example: blood pressure and peripheral artery disease.

We provide a brief example of a real-world study where the correlation-adjusted estimates are significantly different from the estimates obtained when independence of the estimates are assumed. Itoga et al. (2018) study the impact of blood pressure on peripheral artery disease (PAD), reporting results by subgroups of exposure. We assume a linear relationship between SBP and relative risk of PAD, and visualize the weighted least squares (WLS) and correlation-corrected GLS regressions in Figure 2. In the meta-analytic setting, studies typically report standard errors. We compare to a naive WLS estimator $\hat{\beta}_{\mathrm{WLS}}$ where residuals are weighted by the inverse of the reported standard errors. Mathematically, setting $V$ to be the diagonal matrix whose diagonal elements are the reported variances for each exposure level, we have the following estimator:

\hat{\beta}_{\mathrm{WLS}}={\left(X^{\top}V^{-1}X\right)}^{-1}X^{\top}V^{-1}% \hat{\eta},

where, in this case, $\hat{\eta}$ is the vector of log ORs for each exposure. Using the metadata, we do not have access to the true covariance matrix $C$ ; we estimate the covariance matrix used in GLS by the method of Greenland and Longnecker (1992).

The $x$ -axis shows SBP, while the $y$ -axis gives the log relative risk. The blue dots show reported adjusted odds ratios, plotted at the mid-points of the exposure groups reported by the paper.

The solid lines correspond to $\hat{\beta}_{\mathrm{cor}}$ (blue) and $\hat{\beta}_{\mathrm{WLS}}$ (olive) in Figure 2, which are required to pass through the red ‘origin’ point corresponding to the reference group with midpoint at SBP. The WLS estimate $\hat{\beta}_{\mathrm{WLS}}$ appears to fit the data more closely than $\hat{\beta}_{\mathrm{cor}}$ . However, the line produced by $\hat{\beta}_{\mathrm{cor}}$ provides a better estimate for the slope. We can think of it as ‘adjusting’ for the fact that variance in the reference group results propagates to all non-reference points, shifting them up and down together. We illustrate this by including the dashed line in Figure 2. This is the correlation-corrected estimate shifted to the non-reference data–it should capture the trend of the data more accurately than the WLS estimate.

In the case of SBP vs. PAD, adjusting for within-study correlation would give a higher estimate of overall risk for that study. While it is impossible to know ‘truth’ for any given study, the simulation in Figure 1 serves as a reminder that although both the WLS and GLS estimates are unbiased, the WLS has much higher variance.

We proceed to consider the case of meta-analysis where multiple studies are observed and discuss implications of correcting for correlation in that setting.

2.4 Implications for Meta-Analysis

A general description of likelihood formulations for meta-analysis is developed by Zheng et al. (2021). Taking the simplest example, consider the statistical model for aggregating multiple reported result vectors $\eta_{i}$ with specific effects for study $i$ :

\hat{\eta}_{i}=X_{i}\beta+\mathbf{1}u_{i}+\epsilon_{i},

where $\epsilon_{i}\sim\mathcal{N}(0,V_{i})$ describes the observation errors for study $i$ , while $u_{i}$ is a scalar realization of a random effect distributed as $\mathcal{N}(0,\gamma)$ where $\gamma$ represents between-study heterogeneity. The $\epsilon_{i}$ and $u_{i}$ are independent across $i$ , and also from each other. This model applies the realization of the specific effect to all observations from study $i$ , hence the vector 1 that copies $u_{i}$ to impact every element of $\hat{\eta}_{i}$ . The variance of the error term is given by:

\mathbb{V}[\mathbf{1}u_{i}+\epsilon_{i}]=V_{i}+\gamma\mathbf{1}\mathbf{1}^{T}.

The maximum likelihood estimate for $\beta$ and $\gamma$ is then given by solving

\min_{\beta,\gamma}\sum_{i}(X_{i}\beta-\hat{\eta}_{i})^{T}\left(V_{i}+\gamma% \mathbf{1}\mathbf{1}^{T}\right)(X_{i}\beta-\hat{\eta}_{i})+\frac{1}{2}\log|V_{% i}+\gamma\mathbf{1}\mathbf{1}^{T}|

From the likelihood expression, we can observe that meta-analysis effectively quantifies the extent to which the reported variances $V_{i}$ do not represent the inherent variance in the data, and adjust through augmenting with the between-study heterogeneity variance $\gamma$ . Using only the reported variances corresponds to assuming that each reported $V_{i}$ is diagonal, so any correlation is left to meta-analysis to discover, with a single parameter $\gamma$ . In fact, as shown in the previous sections, within-study correlations are induced by the shared reference group, and the extent this happens can vary by study (for example, a study with a very large reference group will have less correlation than a study with a small reference group). As a result, providing correlated $V_{i}$ will leave $\gamma$ to capture the variances of the unknown study-specific effects, exactly as intended, rather than trying to capture all the residual correlations.

With the motivation established, we proceed to review methods that are actually used to estimate and compute the correlation used for the correction in this section. For the remainder of the paper, we focus on reliability and accuracy of the correlation correction methods.

3 Methods of GL and Hamling

In this section, we present the approaches of GL and Hamling. In this review section, we focus on log ORs to vastly simplify presentation; however our robust methods in Sections 4 and 5 cover both log ORs and log RRs. Special challenges and counter-examples for the Hamling approach in the RR case are also presented in Section 5.

We start by defining key variables following original notation, see Table 1.

Table 1: Notation and method requirements table.

Variable	Dimension	Definition	Used by
$n$	$1$	number of alternative exposure levels	-
$x$	$n$	alternative exposure levels	-
$N$	$n+1$	total subjects at all exposures	GL
$M_{1}$	$1$	total cases	GL
$L$	$n$	estimates of log-odds	GL, H
$V$	$n$	reported variances for log-odds	H
$R$	$n$	estimates of log-risks	GL, H
$V^{R}$	$n$	reported variances for log-risks	H
$A$	$n$	cases for alternative exposures	-
$a_{0}$	$1$	cases for reference exposure	-
$B$	$n$	non-cases for alternative exposures	-
$b_{0}$	$1$	non-cases for reference exposure	-
$p$	$1$	ratio of unexposed controls to total controls	H
$z$	$1$	ratio of total controls to total cases	H

$M_{1}$ is the sum of all elements of $A$ and $a_{0}$ . For both GL and Hamling, the goal is to estimate $A,a_{0},B$ , and $b_{0}$ . Following Greenland and Longnecker (1992) and Hamling et al. (2008), we refer to the first element in the vector $N$ as $n_{0}$ and the remaining elements as $N_{+}$ . We always have that $A+B=N_{+}$ and $a_{0}+b_{0}=n_{0}$ . We also include the data requirements by each study. More details are given in the following sections for how the data are used. Here, H is shorthand for the Hamling method. With notation established, we summarize the main goal of the GL and Hamling methods.

3.1 Correlation and Covariance

The main goal of both GL and Hamling methods is to obtain a variance-covariance matrix, replacing a diagonal matrix of reported variances with an updated variance-covariance matrix with the same variances and estimated correlations. In particular, both methods estimate the correlation for two log ORs at two different exposures $x_{i}$ and $x_{j}$ by

r_{x_{i},x_{j}}=\frac{1/a_{0}+1/b_{0}}{\sqrt{1/a_{0}+1/b_{0}+1/A_{i}+1/B_{i}}% \sqrt{1/a_{0}+1/b_{0}+1/A_{j}+1/B_{j}}}

(1)

where $B_{i}$ represent controls, and the correlation for two log RRs at these exposures by

r_{x_{i},x_{j}}=\frac{1/a_{0}-1/b_{0}}{\sqrt{1/a_{0}-1/b_{0}+1/A_{i}-1/B_{i}}% \sqrt{1/a_{0}-1/b_{0}+1/A_{j}-1/B_{j}}}

(2)

where $B_{i}$ represent totals. The final variance-covariance matrix is obtained by appropriately scaling these correlations using the reported variances. There is a degree of freedom in the pseudo-counts that factors out of the correlation formulas: all pseudo-counts can be multiplied by a constant value and the correlations would not change in either the OR or the RR case.

Finally it may help to alert the reader to the key difference between the Hamling and GL approaches by observing that by construction of the Hamling approach, the pseudo-counts successfully obtained by that method (for either RRs or ORs) satisfy

r_{x_{i},x_{j}}=\frac{1/a_{0}+1/b_{0}}{\sqrt{V_{i}V_{j}}}

where $V_{i}$ , $V_{j}$ are the variances reported for the estimates. This equality need not hold for the pseudo-counts inferred by the GL approach, which uses group counts in place of reported variances. This difference is discussed explicitly in the following sections.

3.2 GL Newton Method

Algorithm 1 Greenland and Longnecker Algorithm

M_{1},N,L

, Initialize

A

\mathrm{difference}\leftarrow 1

2: while

\mathrm{difference}\geq 1e-4

A_{+}\leftarrow\mathrm{sum}(A)

a_{0}\leftarrow M_{1}-A_{+}

b_{0}\leftarrow n_{0}-a_{0}

B\leftarrow N_{+}-A

c_{0}\leftarrow\frac{1}{a_{0}}+\frac{1}{b_{0}}

c\leftarrow\frac{1}{A}+\frac{1}{B}

{Element-wise inverse}

e\leftarrow L+\log(a_{0})+\log(B)-\log(A)-\log(b_{0})

{Element-wise

\log

}

10:

H\leftarrow

matrix of size

n\times n

whose diagonal elements are

c+c_{0}

and whose off-diagonal elements are

c_{0}

11:

A\leftarrow A+H^{-1}e

12:

\mathrm{difference}\leftarrow{\left\|H^{-1}e\right\|}_{2}

13: end while

The GL approach uses reported estimates, total counts, and the total number of cases to find pseudo-counts in each category to match reported log-OR or log-RR estimates using an iterative root-finding method given in Algorithm 1. Indeed, Algorithm 1 is exactly Newton’s method for root-finding, applied to find pseudo-counts such that plug-in estimates from the pseudo-counts match those of the adjusted estimates reported in the original study. Once $A,B,a_{0},b_{0}$ are found, the Greenland and Longnecker (1992) uses these values to calculate the correlation coefficient $r_{ij}$ on log OR estimates $L_{i}$ and $L_{j}$ using (1), as well as covariances

C_{ij}=r_{ij}{\left(V_{i}V_{j}\right)}^{1/2}.

For an arbitrary multi-variable function $f:\mathbb{R}^{n}\to\mathbb{R}^{n}$ , the Newton iteration is given by

x_{k+1}=x_{k}-{\left[J_{f}(x_{k})\right]}^{-1}f(x_{k})

(3)

where $J_{f}$ is defined to be the Jacobian matrix of $f$ , comprising partial derivatives (Gautschi, 1997). Newton’s method is locally convergent; meaning that when the initial iterate $x_{0}$ is “close enough” to a root, (3) will eventually find it; however, getting close enough can be tricky (Süli_Mayers_2003). Global convergence refers to be ability of the algorithm to converge regardless of initialization. Greenland and Longnecker (1992) do not prove global convergence guarantees; and in fact as given in Greenland and Longnecker (1992) and summarized in Algorithm 1, the method can break depending on initialization.

The function $g:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}$ whose zero we are searching for appears in line 11 of Algorithm 1 and is given by

g(A)=-L-\log(a_{0}(A))\mathbf{1}-\log(B(A))+\log(A)+\log(b_{0}(A))\mathbf{1}.

(4)

where $\mathbf{1}\in\mathbb{R}^{n}$ is the vector of ones of the right dimension, copying the values of the scalar quantity to all coordinates. By construction, $a_{0},B$ , and $b_{0}$ are all functions of $A$ . The Jacobian matrix $H$ is contains all the partial derivatives of $g(A)$ and is computed in Algorithm 1. Greenland and Longnecker (1992) suggest using crude estimates to initialize $A$ if available, and otherwise using the null expected value: $M_{1}\frac{N_{+}}{\mathrm{sum}(N)}$ . A priori, convergence is not guaranteed. In Section 6, we explore failure modes of existing implementations.

We show in Section (4) that the function $g$ in (4) is the gradient of a convex function and recast the rootfinding problem for $g$ into a convex optimization problem which allows us to robustly compute the GL estimator. This leads to a variety of algorithms with global convergence guarantees, and more simply, to a DCP approach that does not need user-specified initialization and leverages the state-of-the-art open source optimization software cvxpy (Diamond and Boyd, 2016); we make this available to the community.

3.3 Hamling Method

Hamling et al. (2008) extended the work of Greenland and Longnecker (1992), and also construct pseudo-counts $A,B,b_{0},$ and $a_{0}$ using an iterative root-finding method. Once the pseudo-counts are obtained, the correlations across treatment effect exposures and overall covariance matrix are calculated exactly the same as by Greenland and Longnecker (1992). The main difference is that Hamling et al. (2008) only requires estimates and their variances, along with $p$ and $z$ from Table 1, discussed in detail below.

The two pieces of information that Hamling requires in addition to estimates and variances are $p$ and $z$ , which correspond to the ratio of unexposed controls to total number of controls, and the ratio of total number of controls to total number of cases, respectively. These quantities can be obtained by using crude reported estimates from the study, or from another pathway (e.g. literature) if the study did not report the quantities.

(Hamling et al., 2008, Appendix A) solve for $A,B,p^{\prime},z^{\prime}$ in terms of $a_{0}$ and $b_{0}$ :

\displaystyle\begin{split}A_{i}&=\frac{1+\frac{a_{0}L_{i}}{b_{0}}}{V_{i}-\frac% {1}{a_{0}}-\frac{1}{b_{0}}},\quad B_{i}=\frac{1+\frac{b_{0}}{a_{0}L_{i}}}{V_{i% }-\frac{1}{a_{0}}-\frac{1}{b_{0}}},\quad p^{\prime}=\frac{b_{0}}{\sum_{i=1}^{n% }B_{i}},\quad z^{\prime}=\frac{\sum_{i=1}^{n}B_{i}}{\sum_{i=1}^{n}A_{i}}.\end{split}

(5)

The quantities $p^{\prime}$ and $z^{\prime}$ are functions of $(a_{0},b_{0})$ , and the main idea of the Hamling method is to match $p^{\prime}$ , $z^{\prime}$ to the $p,z$ values provided by the study, minimizing the squared differences:

{\left(\frac{p-p^{\prime}}{p}\right)}^{2}+{\left(\frac{z-z^{\prime}}{z}\right)% }^{2}

(6)

The iteration of Hamling, summarized in Algorithm 2, update $a_{0}$ and $b_{0}$ through the equations (5). Once $(a_{0},b_{0})$ are found, equations (5) yield all needed pseudo-counts. It is not obvious from Hamling et al. (2008), but a consequence of our work here is equivalent to showing that (6) can always be brought to $0$ , for all feasible inputs.

Algorithm 2 Hamling Algorithm

p,z,L,v

, Initialize

a_{0},b_{0}

1: error

\leftarrow

1.0

2: while

\mathrm{error}\geq 1e-4

A_{i}(a_{0},b_{0})\leftarrow\left(1+\frac{a_{0}L_{i}}{b_{0}}\right)/\left(V_{i% }-\frac{1}{a_{0}}-\frac{1}{b_{0}}\right)

B_{i}(a_{0},b_{0})\leftarrow\left(1+\frac{b_{0}}{a_{0}L_{i}}\right)/\left(V_{i% }-\frac{1}{a_{0}}-\frac{1}{b_{0}}\right)

p^{\prime}(a_{0},b_{0})\leftarrow b_{0}/\left(\sum_{i=1}^{n}B_{i}(a_{0},b_{0})\right)

z^{\prime}(a_{0},b_{0})\leftarrow\left(\sum_{i=1}^{n}B_{i}(a_{0},b_{0})\right)% /\left(\sum_{i=1}^{n}A_{i}(a_{0},b_{0})\right)

\mathrm{error}\leftarrow\left(\frac{p-p^{\prime}}{p}\right)^{2}+\left(\frac{z-% z^{\prime}}{z}\right)^{2}

a_{0},b_{0}\leftarrow

Update {Black Box Optimization routine to shrink error, e.g. Excel or Stata}

9: end while

Hamling et al. (2008) suggest using the Excel Solve function as a black-box optimizer. However, the solution method turns out to be less important than the choice of equations and their initialization. In Section 5, we show that a modified but equivalent system of nonlinear equations for OR always has a solution. In contrast, the original formulation does not have any such guarantees, and Hamling et al. (2008) discuss the need to use different starting points to ensure converge in specific instances. In Section 6, we give specific, simple examples where the method as given in Algorithm (2) fail to converge to a solution for $a_{0}$ and $b_{0}$ (returning negative counts $A$ or $B$ ), while the method of Section 5 succeeds. In the RR case, we show that it is in fact possible for the Hamling approach to catastrophically fail, which we discuss in detail in Section 5.

4 Convex Optimization Formulation of GL

In this section, we develop a robust GL approach by establishing that $g(A)$ used in the root-finding Newton method of Algorithm (1) is the gradient of a convex function. We show that the convex model of interest is a sum of entropic distance functions for both log ORs and log-RRs. We begin with log ORs.

4.1 GL: Odds Ratios

Recall the function $g(A)$ that is the focus of the Newton’s root finding method proposed by Greenland and Longnecker (1992):

g(A)=-L-\log(a_{0}(A))\mathbf{1}-\log(B(A))+\log(A)+\log(b_{0}(A))\mathbf{1}.

We can find the integral $G$ of $g$ and obtain an objective that corresponds to this gradient:

\displaystyle\begin{split}G(A)=-L^{\top}A&+{\left(a_{0}(A)\log(a_{0}(A))-a_{0}% (A)\right)}+\sum_{i=1}^{n}{\left(B_{i}(A)\log(B_{i}(A))-B_{i}(A)\right)}\\ &+\sum_{i=1}^{n}{\left(A_{i}\log(A_{i})-A_{i}\right)}+{\left(b_{0}(A)\log(b_{0% }(A))-b_{0}(A)\right)}.\end{split}

(7)

From here, note that we may equivalently solve for the optimal $A$ by minimizing the integrated $G$ . This is equivalent to finding roots of $g$ as GL does, since $\nabla G(A)=g(A)$ and a function is at an optimal value precisely when its gradient is zero.

Recall that a convex function $G$ satisfies (Boyd and Vandenberghe, 2004)

G(\lambda A_{1}+(1-\lambda)A_{2})\leq\lambda G(A_{1})+(1-\lambda)G(A_{2})\quad% \mbox{for all}\quad 0<\lambda<1,\quad\mbox{and}\quad A_{1},A_{2}\in\mathbb{R}^% {n}.

(8)

A closely related property called strict convexity requires strict inequality in (8) for $A_{1}\neq A_{2}$ . For a function with continuous derivative, as in our case, the convex property and first-order Taylor series expansion of $G$ yield the differential characterization of convexity

G(A_{2})\geq G(A_{1})+(A_{2}-A_{1})^{\top}\nabla G(A_{1})\quad\mbox{for all}% \quad A_{1},A_{2}\in\mathbb{R}^{n}.

(9)

The characterization (9) means that if $\nabla G(A_{1})=0$ , then necessarily

G(A_{2})\geq G(A_{1})\quad\mbox{for all}\quad A_{2}\in\mathbb{R}^{n},

that is, $g(A_{1})=0$ guarantees $A_{1}$ must be the global minimizer of $G$ . Moreover, a strictly convex function $G$ cannot have more than one global minimum; otherwise, given two such minima, we can use the strict version of (8) to get a point with a lower value for e.g. $\lambda=1/2$ .

Finally, for a function with second continuous derivative, non-negative eigenvalues of the Hessian for any $A$ in the domain is a sufficient condition for convexity. As already discussed in Section 3.2, the Jacobian matrix $H$ of $g$ , which is exactly the Hessian of $G$ , is symmetric positive definite, meaning all eigenvalues are actually positive, which means $G$ must be strictly convex (Boyd and Vandenberghe, 2004).

Putting these facts together, the root-finding problem for $g$ (4) is equivalent to minimizing a strictly convex minimization problem with objective $G$ (7). This perspective reveals that the original GL method can be strengthened by using additional structure and safeguards provided by $G$ . For example, the simplest safeguard for Newton’s method when minimizing $G$ is a step size search that moves in the Newton direction just enough to guarantee a proportional decrease $G$ , and adding this element to Algorithm 1 would already provide global convergence guarantees. The optimization problem is given by

\min_{0\leq A\leq N}\quad G(A)

(10)

where $G(A)$ is given in (7). This formulation implicitly maintains domain constraints, that is, non-negativity of $A$ , $N-A$ , as well as non-negativity of $a_{0}$ and $b_{0}$ , since the logarithm is only defined on $\mathbb{R}_{+}$ . The key element in (7) is the entropic distance function $f:[0,\infty)^{m}\to\mathbb{R}$ :

f(x)=x\log(x)-x.

(11)

As we approach $0$ , $x\log(x)$ goes to $0$ , as can be easily seen by using L’Hôpital’s rule. As $x$ grows large, $x\log(x)$ grows faster than $x$ , so $f(x)\to\infty$ as $x\to\infty$ . Finally the entropic function has positive second derivative on its domain

f^{\prime\prime}(x)=1/x

so it is strictly convex. Since the sum of convex and strictly convex functions are strictly convex by definition, the entire objective $G$ (7) is strictly convex. This implies that any minimizer of $G$ (7) must be unique, and it remains to show only that a minimizer exists for $G$ .

Theorem 4.1

Suppose $N_{+}>A,M_{1}>1^{\top}A,$ and $n_{0}>a_{0}$ according to the variables defined in Table (1). Let $L$ represent log ORs for the necessary exposure levels and take the elements of $L$ to be finite. Then the function $G(A)$ (7) always has a unique global minimizer.

For the proof, please see the Appendix 8. From Theorem 4.1, $G$ always has a unique minimizer for feasible inputs, undergirding the approach of Greenland and Longnecker (1992). A unique global minimum exists under simple assumptions about problem data, and standard optimization solvers (including gradient, Gauss-Newton, quasi-Newton, and Newton), when properly safeguarded by trust region or line search, will converge to the unique global minimum of $G$ for any feasible initialization of $A_{0}$ . In particular, we use disciplined convex programming (Grant et al., 2006) to solve the problem. In Section 6, we show that the root-finding scheme of Greenland and Longnecker (1992) is fragile with respect to initialization, but the new approach is guaranteed to work.

4.2 GL: Relative Risk

We now discuss the changes to apply the approach to log-RR scores. The overall approach and notation (see Table (1)) largely follow the development in the preceding section. $R$ , the log RR score, is a function of problem data as given by Greenland and Longnecker (1992):

\exp(R)=\frac{An_{0}}{N_{+}a_{0}},\quad R=\log(A)-\log(N_{+})-\log(a_{0})+\log% (n_{0}).

Here, $N_{+}$ and $n_{0}$ are treated as known quantities, again following Greenland and Longnecker (1992). To recover the pseudo-counts, we look for $A,a_{0}$ that are roots of

h(A)=-R+\log(A)-\log(N_{+})-\log(a_{0}(A))+\log(n_{0}).

(12)

Greenland and Longnecker (1992) suggest an algorithm similar to Algorithm (1) to construct cell counts for $A$ and $a_{0}$ . Just as in Section 4.1, we cast this root-finding method as a way to solve a convex optimization program based on entropic distance, analogous to (7). Integrating Equation (12), we obtain

H(A)=A^{\top}{\left(-L_{R}-\log(N_{+})+\textbf{1}\log(n_{0})\right)}+\sum_{i=1% }^{n}A_{i}\log(A_{i})-A_{i}+a_{0}(A)\log(a_{0}(A))-a_{0}(A).

(13)

The function $H$ is strictly convex, since it is the sum of three linear terms, and $n+1$ entropic distance functions (see the discussion in Section 4.1). We prove a theorem analogous to Theorem 4.1, showing the existence of a solution under simple assumptions; uniqueness follows from strict convexity.

Theorem 4.2

Suppose $M_{1}>1^{\top}A$ . Let $L_{R}$ represent log RR ratios for the necessary exposure levels such that $L_{R}$ is finite. Then the function $H(A)$ (13) always has a unique minimizer.

The proof for Theorem 4.2 is in the appendix. In this way, we may construct the optimization problem

\min_{A\in\mathbb{R}_{+}^{n}}\quad H(A)

(14)

where $H(A)$ is given in (13). By Theorem 4.2, problem (14) must have a minimizer. A solution to the optimization problem (14) may be found by using any number of optimization methods, and in particular, we can also use disciplined convex programming (Grant et al., 2006) to solve (14), just as in Section 4.1.

It may seem a natural fact that root finding here corresponds to a convex objective, but in our experience this is an exception rather than the rule. To be clear, while minimizing a smooth convex function is often solved by a root-finding procedure on the gradient, the converse rarely holds, that is, a typical root finding problem rarely turns out to correspond to the gradient of a convex model. Case in point: when we consider the Hamling method, we do not have a convex interpretation, and as a result have to essentially use brute force to derive theoretical convergence guarantees. It is also quite fortunate that the convex reformulation works in a very similar way for the GL approach for both RR and OR. Again returning to Hamling, in the case of OR, we can find a counter-example guaranteed to fail. The contrast of GL with Hamling here underscores the rarity of the discovered relationship of the GL approach to convex minimization.

5 Solvabililty of Hamling Method

In Section 3.3, we gave a brief overview of the method of Hamling et al. (2008), which involved formulating and solving nonlinear equations (5) for $A_{i}$ and $B_{i}$ . The approach relies on the reported variances rather than group totals to infer pseudo-counts. Besides the estimates and variances, the Hamling approach needs only $p$ and $z$ , see Table 1. However, the parametrization using variances make the nonlinear equations of Hamling far more difficult to analyze than the GL approach. The original work Hamling et al. (2008) did not provide any guarantees, and in fact the authors’ numerical examples suggest initialization may be quite important. In this section, we prove that for the OR case, the equations always have a unique positive solution, and when properly initialized, the solution can always be found. In the RR case, the situation is more difficult; we present a counter-example where a solution to the Hamling equations cannot exist, and a partial theoretical result by deriving a sufficient condition for the existence of a solution to Hamling RR in the equivariant case.

5.1 Hamling: Odds Ratios

The quantities that Hamling et al. (2008) use, as functions of the underlying pseudo-counts, are given by:

\displaystyle R_{i}=\frac{A_{i}b_{0}}{a_{0}B_{i}},\quad V_{i}=\frac{1}{a_{0}}+% \frac{1}{b_{0}}+\frac{1}{A_{i}}+\frac{1}{B_{i}},\quad p=\frac{b_{0}}{\sum_{i=0% }^{n}B_{i}},\quad z=\frac{\sum_{i=0}^{n}B_{i}}{\sum_{i=0}^{n}A_{i}}.

(15)

Using the substitution $B_{i}=\frac{A_{i}B_{0}}{A^{0}R_{i}}$ Hamling et al. (2008) obtains $B_{i}$ and $A_{i}$ in terms of $a_{0}$ , $b_{0}$ , $R_{i}$ and $V_{i}$ :

	$\displaystyle B_{i}(a_{0},b_{0})$	$\displaystyle=\left(1+\frac{b_{0}}{a_{0}R_{i}}\right)/\left(V_{i}-\frac{1}{a_{% 0}}-\frac{1}{b_{0}}\right)$
	$\displaystyle A_{i}(a_{0},b_{0})$	$\displaystyle=\left(1+\frac{a_{0}R_{i}}{b_{0}}\right)/\left(V_{i}-\frac{1}{a_{% 0}}-\frac{1}{b_{0}}\right)$

Note that these equations for $A_{i},B_{i}$ are the equations that Hamling et al. (2008) solve for, in terms of $a_{0},b_{0}$ , in order to match the variances of the pseudo-counts to the reported variances. Though, the authors do not solve these equations explicitly, instead using Algorithm 2 to estimate the changing parameter values $a_{0},b_{0}$ and update pseudo-counts accordingly.

B_{+}=\sum_{i=1}^{n}B_{i},\quad A_{+}=\sum_{i=1}^{n}A_{i}.

Summing across each set of equations for $A_{i}$ and $B_{i}$ we get

	$\displaystyle B_{+}$	$\displaystyle=\sum_{i=1}^{n}\left(1+\frac{b_{0}}{a_{0}R_{i}}\right)/\left(V_{i% }-\frac{1}{a_{0}}-\frac{1}{b_{0}}\right)$
	$\displaystyle A_{+}$	$\displaystyle=\sum_{i=1}^{n}\left(1+\frac{a_{0}R_{i}}{b_{0}}\right)/\left(V_{i% }-\frac{1}{a_{0}}-\frac{1}{b_{0}}\right)$

From the definitions of $p$ and $z$ we have

\displaystyle B_{+}=\frac{1-p}{p}b_{0},\quad A_{+}=\frac{1}{z(1-p)}B_{+}-a_{0}% =\frac{1}{zp}b_{0}-a_{0}.

Combining these equations together, we get a system of two explicit equations for unknowns $a_{0}$ and $b_{0}$ :

	$\displaystyle\frac{1-p}{p}b_{0}$	$\displaystyle=\sum_{i=1}^{n}\left(1+\frac{b_{0}}{a_{0}R_{i}}\right)/\left(V_{i% }-\frac{1}{a_{0}}-\frac{1}{b_{0}}\right)$		(16)
	$\displaystyle\frac{1}{zp}b_{0}-a_{0}$	$\displaystyle=\sum_{i=1}^{n}\left(1+\frac{a_{0}R_{i}}{b_{0}}\right)/\left(V_{i% }-\frac{1}{a_{0}}-\frac{1}{b_{0}}\right)$		(16)

The approach developed here is similar in nature to that of Hamling et al. (2008), but equations (16) are not derived in Hamling et al. (2008). The explicit form of (16) is used to prove the results below, namely that equations (16) always have a solution.

First, we show that a unique positive solution to (16) exists when all the variances are identical, that is, all $V_{i}=v$ . The theorem for this case serves as a base case for the induction in the general result, and also is of interest since the proof technique is direct; we actually find the closed form of the solution.

Theorem 5.1

Suppose all of the $V_{i}$ are equal to the scalar $v>0$ . Then there is a unique positive solution of the equations (16) for any value $p\in(0,1)$ and any value of $z>0$ , and any set of positive estimates $R_{i}$ .

Let $c=\frac{a_{0}}{b_{0}}$ . Let $r_{1}=\sum_{i=1}^{n}\frac{1}{R_{i}}$ and $r_{2}=\sum_{i=1}^{n}R_{i}$ . Then the positive solution to Hamling is given by

c=\frac{npz-nz+n-pr_{1}z+\sqrt{D}}{2z(np+(1-p)r_{2})}

where

D=n^{2}p^{2}z^{2}-2n^{2}pz^{2}+2n^{2}pz+n^{2}z^{2}-2n^{2}z+n^{2}-2np^{2}r_{1}z% ^{2}+2npr_{1}z^{2}+2npr_{1}z+p^{2}r_{1}^{2}z^{2}-4pr_{1}r_{2}z+4r_{1}r_{2}z.

Once we have $c$ , the solutions to (16) are given by

b_{0}=\frac{1}{v}\left(\frac{p}{1-p}\left(n+\frac{r_{1}}{c}\right)+1+\frac{1}{% c}\right),\quad a_{0}=cb_{0}.

We provide a proof in Section 8.3 in the Appendix. The crux of the proof is to show that $D$ is always positive, for any feasible inputs $(n,p,z,r_{1},r_{2})$ . An interesting consequence of the proof is that in addition to the unique positive solution for $c$ (and hence $a_{0})$ , there is also a unique negative solution for $c$ and $a_{0}$ , obtained by taking the negative branch in the quadratic formula. From our numerical experience with Hamling, both our implementation and the one in dosresmeta can find the negative $a_{0}$ solution, leading to infeasible pseudo-counts, when incorrectly initialized.

We now show by induction that equations (16) always have a unique feasible solution in the general case.

Theorem 5.2

For any set of positive $V_{i}$ , positive $R_{i}$ , $p\in(0,1)$ and $z>0$ , the equations (16) have a positive solution with $a_{0}>0$ and $b_{0}>0$ .

See the Appendix, Section 8.4 for a proof of Theorem (5.2). The proof proceeds by induction, as it is impossible to find a closed form solution in the general case. This result ensures convergence to a tuple $(a_{0},b_{0})$ that can be used to construct the cell counts $A$ and $B$ according to equations (5). Our presentation of the nonlinear system in the form of equations (16) provides robustness to the method of Hamling et al. (2008), guaranteeing solutions for any choice of positive $V_{i}$ . The inductive step of the theorem shown in Section (8.4) of the Appendix uses the structure of the equations to show the existence of the solution.

To find the solution in practice, we minimize the squared norm of the equations (16), similar to the approach shown in Algorithm 2. The construction of the proof assumes the positivity of the denominators $V_{i}-\frac{1}{a_{0}}-\frac{1}{b_{0}}$ throughout. This guides our initialization strategy to ensure that $a_{0}$ and $b_{0}$ are large enough that positivity holds for the smallest reported variance,

V_{\min}=\min_{i}V_{i}.

From a theoretical standpoint, the nonlinear constraint

a_{0}+b_{0}\leq\epsilon a_{0}b_{0}V_{\min}

may be needed and can be maintained via line search, in practice the method always converges as long as the constraint holds at initialization. This is a markedly different strategy than the one suggested by Hamling et al. (2008), who focus on $p^{\prime}$ and $z^{\prime}$ computed from total counts in the data. Their strategy, as implemented by Crippa and Orsini (2016), fails for cases where the reported variances are small, and is discussed in Section 6.

5.2 Hamling: Relative Risk

We now consider the log RR scores. We use the exact same notation as what has been described in the current section, except we now use $L_{R}$ to imply the log RRs instead of log ORs. We have

R_{i}=\frac{A_{i}b_{0}}{a_{0}B_{i}}

where, following the notation of Hamling et al. (2008), $b_{0}$ now indicates total subjects in the reference group, and $B_{i}$ indicates total subjects in each risk group, with $a_{0}$ reference non-cases and $A_{i}$ cases across exposure groups. Thus we have $b_{0}>a_{0}$ and $B_{i}>A_{i}$ . Moreover, from classic results we have

V_{i}=\frac{1}{a_{0}}-\frac{1}{b_{0}}+\frac{1}{A_{i}}-\frac{1}{B_{i}}.

This becomes the key difference that underlies the construction of our new equations. Indeed, we obtain

\displaystyle\begin{split}A_{i}&=\frac{1-\frac{a_{0}{R_{i}}}{b_{0}}}{V_{i}-% \frac{1}{a_{0}}+\frac{1}{b_{0}}}\\ B_{i}&=\frac{\frac{b_{0}}{a_{0}{R_{i}}}-1}{V_{i}-\frac{1}{a_{0}}+\frac{1}{b_{0% }}}.\end{split}

(17)

The constraint that $B_{i}>A_{i}$ doesn’t give any new information, since it is equivalent to

\frac{b_{0}}{a_{0}{R_{i}}}+\frac{a_{0}{R_{i}}}{b_{0}}>2.

The sum of any positive quantity and reciprocal is always greater than or equal to $2$ , with the minimum attained when the quantity is exactly $1$ . We do however know something about $z$ . Recall the formulas

p=\frac{b_{0}}{\sum_{i=0}^{n}B_{i}},\quad z=\frac{\sum_{i=0}^{n}B_{i}}{\sum_{i% =0}^{n}A_{i}}.

For the relative risk case, by definition we have $z\geq 1$ .

Using equations (17) and formulas for $p$ and $z$ , we construct the two nonlinear equations that are analogous to equations (16):

	$\displaystyle\frac{1-p}{p}b_{0}$	$\displaystyle=\sum_{i=1}^{n}\left(\frac{b_{0}}{a_{0}R_{i}}-1\right)/\left(V_{i% }-\frac{1}{a_{0}}+\frac{1}{b_{0}}\right)$		(18)
	$\displaystyle\frac{1}{zp}b_{0}-a_{0}$	$\displaystyle=\sum_{i=1}^{n}\left(1-\frac{a_{0}R_{i}}{b_{0}}\right)/\left(V_{i% }-\frac{1}{a_{0}}+\frac{1}{b_{0}}\right).$		(18)

We now give analogous results to Theorems (5.1) and (5.2), which are proved in the Appendix.

Theorem 5.3

Suppose all of the $V_{i}$ are equal to the scalar $v>0$ . Then there is a unique positive solution of the equations (18) for any value $p\in(0,1)$ and any value of $z>0$ , and any set of positive estimates $R_{i}$ , if and only if

(1-p)z\geq\left(\frac{1-p}{p}\right)^{2}4r_{2},\quad(1-p)z\geq 1.

Using the notation of Theorem 5.1, when the conditions above are satisfied the positive solution $c=\frac{a_{0}}{b_{0}}$ is given by

c=\frac{n(z-pz+1)+pr_{1}z+\sqrt{D}}{2z(np+(1-p)r_{2})}

where

D=(n(pz-z-1)-r_{1}zp)^{2}-4r_{1}(nzp+r_{2}z-r_{2}pz).

Once $c$ is found, we have

b_{0}=\frac{1}{Vc}\left(\frac{p}{1-p}(r_{1}-cn)+(1-c)\right),\quad a_{0}=cb_{0}.

A simple counter-example that violates the two inequalities required by Therorem 5.3, and for which there is no solution, is given by

R_{1}=0.9328,R_{2}=0.062,p=0.1,z=1.1.

These values have no solution in the RR example for any equal variance values $V_{1}=V_{2}=v$ . In Section 6, we show that available implementations return nonsensical results, and in fact cannot solve the defining equations, which makes sense, given that $D<0$ in this case. In contrast to the previous section, there is no way to fix this issue, a solution simply cannot exist. The best we can do in such a case is to suggest the modeler check their inputs $R_{i},V_{i},p,z$ or consider using the GL approach, which is always guaranteed to work.

6 Numerical Examples

In this section, we review detailed examples of the implementation and results of our proposed methods as described in Sections 4 and 5. First, we show that the corrected methods we proposed reproduce the results of Greenland and Longnecker (1992) and Hamling et al. (2008) for the canonical examples in these papers. Second, we show failure modes for Greenland and Longnecker (1992) and Hamling et al. (2008) and correct estimates from the robust implementations using the results in this work. For GL, we leverage the connection to convex optimization to provide software using disciplined convex programming libraries CVXPY for our modification of the approach of Greenland and Longnecker (1992). For Hamling, we use SciPy optimization routines with a theoretically justified initialization to solve the root finding problem. To demonstrate the failure modes, we use the R library dosresmeta (Crippa and Orsini, 2016), which implements both GL and Hamling methods.

6.1 Results Comparison to GL and Hamling: Canonical Examples

We use the data from (Greenland and Longnecker, 1992, Table 1) as a simple example showing that the optimized GL reproduces the same results as regular Greenland and Longnecker (1992) and Hamling et al. (2008) for simple problems. In the example given by Greenland and Longnecker (1992), the authors fit the linear-logistic model

\lambda(x,z)=\alpha+\beta x.

In this case, the model is giving the log-odds of a subject being a case, and we want to estimate $\beta$ . The data $x$ represent alcohol intake as exposure levels. We present a summary of the adjusted estimates we obtain using our convex formulation for the objective of Greenland and Longnecker (1992) and solutions to the modified system of equations originally in Hamling et al. (2008), and showing the coefficient value $\hat{\beta}$ estimate along with the variance estimate for each method.

In Table 2 we present the least-squares estimates generated from the four different types of pseudo-count fitting techniques described in this study. Denote by “Unadjusted” as using reported variances with the independence assumption. Denote by “GL” the least-squares and variance estimates obtained by the cell-fitting procedure of Greenland and Longnecker (1992). Denote by “Hamling” the estimates produced from the method of Hamling et al. (2008). Denote by “Convex GL” as the estimates obtained from our fitting procedure that modifies the method of Greenland and Longnecker (1992) as described in Section 4. Denote by “Solved Hamling” as the estimates obtained from our fitting procedure that modifies the method of Hamling et al. (2008) as described in Section 5.

Table 2: Estimates and variances table–log-odds ratios.

Method	$\hat{\beta}$	Variance
Unadjusted	0.0334	0.000349
GL	0.0454	0.000427
Convex GL	0.0454	0.000427
Hamling	0.04588	0.000421
Solved Hamling	0.04588	0.000421

The Convex GL method produces the same results as the original GL approach Greenland and Longnecker (1992) when the latter succeeds. Additionally, our Solved Hamling method produces the same results as the standard Hamling method when the latter succeeds. There are numerical differences in variance results for corresponding methods; for the Convex GL approach that uses DCP, we use a high degree of precision in the solver, so these results correspond to solving the equations to a greater degree of precision. The estimates obtained by Hamling vs. GL differ, but this is to be expected, as discussed in Section 3.1.

We next include a summary of the pseudo-counts only of cases generated by each method in Table 3. We follow the same notation used in Table 1 for cases.

Table 3: Pseudo-count table–log-odds ratios.

Method	$a_{0}$	$A_{1}$	$A_{2}$	$A_{3}$
GL	160.4702	70.2046	95.4696	124.8556
Convex GL	160.5064	70.3304	95.4857	124.6776
Hamling	96.2699	50.9684	57.2220	67.7043
Solved Hamling	96.2653	50.9654	57.2180	67.6989

On this simple example, the pseudo-counts for cases generated by our methods match closely those generated by the original methods. Again counts generated by GL methods differ from those generated by Hamling methods, since the GL relies on group counts, whereas Hamling uses reported variances. The pseudo-counts are an intermediate result whose main purpose is to obtain the covariance matrix, and we compare the covariance matrices obtained by these methods in Figure 3.

There are small differences between the individual entries in Figure 3. These differences are in fact what cause the estimates in Table 2 to vary slightly between the GL and Hamling-based methods. Note that the Convex GL covariance matrix has different entries, whereas the entries of the (Solved) Hamling covariance matrix are identical. This is due to the variance model in the construction of the Hamling estimators.

Next, we run a similar test on RRs, using the alcohol and colorectal cancer data and results in Orsini et al. (2012). We present a summary of the adjusted estimates obtained by our methods and by the methods of GL and Hamling. We use data directly from dosresmeta, specifically the alcohol_crc dataframe, and analyze the subset id author atm. In Table (4) we present the least-squares estimates, similar to what was shown above.

Table 4: Estimates and variances table–log-relative risks.

Method	$\hat{\beta}$	Variance
Unadjusted	-0.00294	1.5865e-05
GL	0.0071	1.5176e-05
Convex GL	0.0071	1.5166e-05
Hamling	0.0063	1.5490e-05
Solved Hamling	0.0063	1.5436e-05

Once again we see small numerical differences in variance estimates, with our estimates using high precision on the equation solves. We also see a larger difference between the estimates obtained by GL vs. Hamling, a direct consequence of the different parametrizations.

We provide a summary of case pseudo-counts generated by each method in Table 5. The pseudo-count estimates within method families are close; while counts between GL and Hamling methods match in some groups but differ in others, causing the differences observed in estimate values in Table 4.

Table 5: Pseudo-count table–log-relative risk.

Method	$a_{0}$	$A_{1}$	$A_{2}$	$A_{3}$	$A_{4}$	$A_{5}$
GL	26.5957	34.0061	42.8532	33.3584	17.9492	29.2359
Convex GL	26.5973	34.0061	42.8532	33.3583	17.9492	29.2359
Hamling	26.4495	39.5129	44.2940	31.6140	15.3332	22.6277
Solved Hamling	26.4087	39.4526	44.2234	31.5706	15.3105	22.5738

Covariance matrices obtained from pseudo-counts generated by our Convex GL and Solved Hamling methods are shown in Figure 4. We again see identical entries in the covariance matrix produced from Hamling. The differences in covariance values between the matrices explain the differences in estimates values in Table 4.

We now continue to the avoidable failure modes, providing simple OR examples where the original GL and Hamling methods fail but our Convex GL and Solved Hamling methods succeed.

6.2 Original method failure and Corrected success

In this section we produce simple failure modes for original GL and Hamling methods, and show that new methods work on these cases, as expected from the theoretical results. This is reassuring to practitioners running many analyses; the need to re-initialize current methods and potential quiet failures of the Hamling method can both be avoided with straightforward modifications. To demonstrate the failure modes, we perturb the alcohol_cvd data from dosresmeta.

6.2.1 GL method failure

Using the method of GL first, we change the number of subjects at each exposure level in the alcohol_cvd dataset to be a function of the number of cases in the same dataset at the corresponding exposure levels. Namely, we modify the number of subjects to be

N=A+t

for integer values $t=\{1,\dots,20\}$ . The lower the $t$ , the more extreme the situation, corresponding to very few controls in each group. We use each $N$ as input data to the standard GL routine to construct pseudo-counts using the GL method. For $t\leq 13$ , the original GL method in dosresmeta fails.

In the cases of failure, even though the initial $A$ is feasible, GL iterations run afoul of the logarithmic terms in the dosresmeta implementation for low $t$ . For $t\geq 14$ , this issue disappears. The entire problem is avoided when we use the convex GL approach, which succeeds in all cases.

The new Convex GL method succeeds even in the extreme case when $t=1$ . We compare the covariance matrix for $N^{1}=A+1$ compared to the covariance matrix GL obtains on the original data in Figure 5. We see that the covariance matrices returned for the original and perturbed data are fairly close, suggesting that correlations are well-behaved in such cases and underscoring the need for a robust method. In other words, results for GL will likely be useful even for small studies when we have very few controls.

6.2.2 Hamling method failure

The Hamling method fails when default initialization fails to guarantee positivity of all denominators $V_{i}-\frac{1}{a_{0}}-\frac{1}{b_{0}}$ . For example, the Hamling initialization used by dosresmeta can break when the input $V_{i}$ are very small. In this case, the dosresmeta Hamling approach returns negative pseudo-counts, and correlations computed using these counts.

To show this failure mode, we alter the alcohol_cvd dataset in dosresmeta by changing the reported variances to be

\hat{V}={\left(\mathrm{NA},0.001,0.01,0.2,0.9\right)}

where the NA is a placeholder for the reference exposure level. Passing this data into the hamling method in dosresmeta, we obtain negative values in the estimated counts for cases and non-cases at the first level of exposure, as shown in Table 6.

Table 6: Pseudo-count table–broken Hamling Example .

Method	$a_{0}$	$A_{1}$	$A_{2}$	$A_{3}$	$A_{4}$
Hamling	189.7	-207.5	514.2	8.6	2.2
Solved Hamling	2897.8	2976.1	157.2	31.5706	9.2

The method of Hamling fails silently, since it then uses the negative values to compute the covariance matrix. To study the downstream effects, we compare the covariance matrices constructed by dosresmeta from the wrong pseudo-counts generated by the original Hamling method with those generated by the solved Hamling method in Figure 6. The solved Hamling method obtains an order of magnitude smaller correlation across the subgroups. This means that when Hamling fails quietly, it will provide estimates that deviate further from the uncorrected estimates compared to the correctly solved formulation.

We extend this example to study the range of the failure mode as a function of the scale of variance values. For simplicity we vary only the first element of $\hat{V}$ . We then assess whether there are any negative values in the constructed pseudo-counts for cases $A$ by the Hamling method. As can easily be verified, the variance values below $10^{-3}$ in the first numerical coordinate produce negative values in $A$ . Obviously for smaller estimates the method still fails, but such small variances correspond to huge sample sizes that are unlikely to occur in practice. The variance values greater than $10^{-3}$ produce only positive values, even beyond 1. This shows clearly how the Hamling method fails for small enough variance values given default initialization in dosresmeta, and can be fixed easily by using the strategy discussed in Section 5.

To fix the problem we use the initialization suggested by the theoretical analysis. Specifically, we construct the initialization parameters $a_{0},b_{0}$ as

(a_{0},b_{0})={\left(\frac{10}{\min(v)},\frac{10}{\min(v)}\right)}.

The underlying idea is that the large initialization ensures the denominators of $A_{i}$ and $B_{i}$ in equations (5) remain positive, ensuring all counts are positive. This works well numerically, and does not break regardless of the $v_{i}$ values. This provides empirical support for the proof technique given in Theorem 5.2.

In the next section, we study the unavoidable failure mode of Hamling for RRs.

6.3 Hamling Failure for RR

We review the counter-example presented in Section 5

R_{1}=0.9328,R_{2}=0.062,p=0.1,z=1.1.

This example was obtained by violating the conditions presented in Theorem 5.3 for the equivariant case. The failure corresponds to obtaining a negative discriminant in the quadratic formula for the ratio $c=\frac{a_{0}}{b_{0}}$ , and means that a solution cannot exist, regardless of reported (equal) variances. To see this bear out in practice we make a simple choice

v_{1}=v_{2}=1.0.

Running dosresmeta on this example gives us results in Table 7.

Table 7: Hamling Results for RR Counter-Example.

$A$	$N$
1.4	1.3
$-1.1\times 10^{-5}$	$-1.1\times 10^{-5}$
$0.88$	$13.3$

We see negative values for $A$ and $N$ , a problem for any situation, and $a_{0}>b_{0}$ , which is impossible for RR. These issues still can still occur for a candidate solution to the equations (18). However, the claim we made is stronger, that is, a solution that satisfies the six equations corresponding to $p,z,R_{1},R_{2},v_{1},v_{2}$ cannot exist. When we review the dosresmeta result with respect to these six equations, we find that in fact, two of the six are not satisfied:

	$\displaystyle R_{1}(A,N)$	$\displaystyle=0.93,\quad R_{2}(A,N)=0.062,\quad v_{2}(A,N)=1.0,\quad p(A,N)=0.1;$
	$\displaystyle v_{1}(A,N)$	$\displaystyle=-0.05;\quad z(A,N)=6.33.$

In contrast to the previous examples, there is no way to fix this; we know from the proof of Theorem 5.3 that no solutions can exist to this example.

7 Conclusion

In this paper we have taken a closer look at the methods of Greenland and Longnecker (1992) and Hamling et al. (2008).

We have shown that the GL approach lends itself to a reformulation to minimizing a convex model, for both ORs and RRs. In both cases we can avoid all numerical difficulty and guarantee convergence to the unique optimal point for any feasible data inputs. This was a rather surprising finding that initially motivated us to write the paper. The convex loss that emerged when we integrated the optimality conditions is the entropic distance function, an object that appears in other areas of mathematics and statistics. An unexplored consequence of the connection to convex models is that it is now easy to include side information (if such information is available to modelers) through the use of linear equality and inequality constraints on the pseudo-counts $A$ . As long as there is a feasible $A$ , the proof theory in this paper guarantees a unique solution, and modifying the formulation is straightforward in cvxpy. We leave further exploration of this idea to future work.

For the Hamling method, the story is more complicated. In the case of OR, we were able to show that the Hamling equations always have a solution. In fact we obtained a closed form solution for the equivariant case (all reported variances equal) and provided a proof by induction for the general case. This means that literally for any observed ORs, variances, $p$ , and $z$ , we can always find a solution.

In contrast, for RR, there is no guarantee that Hamling will work. We presented a counter-example when there are only two alternative groups. Counter-examples are by nature odd, but nonetheless there is a fundamental difference between RR and OR for Hamling stemming from relying on reported variances. This is curious. Between the methods of GL and Hamling, when faced with many meta-analyses we find the Hamling approach more appealing, since it only needs $p$ and $z$ in addition to reported estimates and variances. Based on the RR failure, we should keep the GL method available should an unavoidable failure mode arise.

We have done our best to make the results as interpretable and clear as possible. We have an implementation for GL and Hamling methods publicly available¹¹1https://github.com/ihmeuw-msca/CorrelationCorrection; and we have shown simple cases where we can break the widely used dosresmeta package using simple examples. Using the insights in this paper, safeguarding estimates available in other packages is a straightforward task. For GL, it is a matter of providing standard optimization guardrails, such as a line search. For Hamling, it is a change in the initialization strategy based on the minimum reported variance.

References

Agrawal et al. [2018] Akshay Agrawal, Robin Verschueren, Steven Diamond, and Stephen Boyd. A rewriting system for convex optimization problems. Journal of Control and Decision, 5(1):42–60, 2018.
Boyd and Vandenberghe [2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
Crippa and Orsini [2016] Alessio Crippa and Nicola Orsini. Multivariate dose-response meta-analysis: The dosresmeta R package. Journal of Statistical Software, Code Snippets, 72(1):1–15, 2016. doi: 10.18637/jss.v072.c01.
Crippa et al. [2019] Alessio Crippa, Andrea Discacciati, Matteo Bottai, Donna Spiegelman, and Nicola Orsini. One-stage dose–response meta-analysis for aggregated data. Statistical methods in medical research, 28(5):1579–1596, 2019.
Dai et al. [2022] Xiaochen Dai, Gabriela F Gil, Marissa B Reitsma, Noah S Ahmad, Jason A Anderson, Catherine Bisignano, Sinclair Carr, Rachel Feldman, Simon I Hay, Jiawei He, et al. Health effects associated with smoking: a burden of proof study. Nature medicine, 28(10):2045–2055, 2022.
Deeks et al. [2019] Jonathan J Deeks, Julian PT Higgins, Douglas G Altman, and Cochrane Statistical Methods Group. Analysing data and undertaking meta-analyses. Cochrane handbook for systematic reviews of interventions, pages 241–284, 2019.
Diamond and Boyd [2016] Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
Gautschi [1997] Walter Gautschi. Numerical analysis: an introduction. Birkhauser Boston Inc., USA, 1997. ISBN 0817638954.
Grant et al. [2006] Michael Grant, Stephen Boyd, and Yinyu Ye. Disciplined convex programming. Springer, 2006.
Greenland and Longnecker [1992] Sander Greenland and Matthew P. Longnecker. Methods for trend estimation from summarized dose-response data, with applications to meta-analysis. American journal of epidemiology, 135 11:1301–9, 1992. URL https://api.semanticscholar.org/CorpusID:31135711.
Haidich [2010] Anna-Bettina Haidich. Meta-analysis in medical research. Hippokratia, 14(Suppl 1):29, 2010.
Hamling et al. [2008] Jan Hamling, Peter Lee, Rolf Weitkunat, and Mathias Ambühl. Facilitating meta-analyses by deriving relative effect and precision estimates for alternative comparisons from a set of estimates presented by exposure level or disease category. Statistics in medicine, 27(7):954–970, 2008.
Itoga et al. [2018] Nathan K Itoga, Daniel S Tawfik, Charles K Lee, Satoshi Maruyama, Nicholas J Leeper, and Tara I Chang. Association of blood pressure measurements with peripheral artery disease events: reanalysis of the allhat data. Circulation, 138(17):1805–1814, 2018.
Kariya and Kurata [2004] Takeaki Kariya and Hiroshi Kurata. Generalized least squares. John Wiley & Sons, 2004.
Lescinsky et al. [2022] Haley Lescinsky, Ashkan Afshin, Charlie Ashbaugh, Catherine Bisignano, Michael Brauer, Giannina Ferrara, Simon I Hay, Jiawei He, Vincent Iannucci, Laurie B Marczak, et al. Health effects associated with consumption of unprocessed red meat: a burden of proof study. Nature Medicine, 28(10):2075–2082, 2022.
Liu et al. [2009] Qin Liu, Nancy R Cook, Anna Bergström, and Chung-Cheng Hsieh. A two-stage hierarchical regression model for meta-analysis of epidemiologic nonlinear dose–response data. Computational Statistics & Data Analysis, 53(12):4157–4167, 2009.
Orsini et al. [2012] Nicola Orsini, Ruifeng Li, Alicja Wolk, Polyna Khudyakov, and Donna Spiegelman. Meta-analysis for linear and nonlinear dose-response relations: examples, an evaluation of approximations, and software. American journal of epidemiology, 175(1):66–73, 2012.
Razo et al. [2022] Christian Razo, Catherine A Welgan, Catherine O Johnson, Susan A McLaughlin, Vincent Iannucci, Anthony Rodgers, Nelson Wang, Kate E LeGrand, Reed JD Sorensen, Jiawei He, et al. Effects of elevated systolic blood pressure on ischemic heart disease: a burden of proof study. Nature medicine, 28(10):2056–2065, 2022.
Rudin et al. [1964] Walter Rudin et al. Principles of mathematical analysis, volume 3. McGraw-hill New York, 1964.
Schmidt and Kohlmann [2008] Carsten Oliver Schmidt and Thomas Kohlmann. When to use the odds ratio or the relative risk? International journal of public health, 53(3):165, 2008.
Stanaway et al. [2022] Jeffrey D Stanaway, Ashkan Afshin, Charlie Ashbaugh, Catherine Bisignano, Michael Brauer, Giannina Ferrara, Vanessa Garcia, Demewoz Haile, Simon I Hay, Jiawei He, et al. Health effects associated with vegetable consumption: a burden of proof study. Nature medicine, 28(10):2066–2074, 2022.
Zheng et al. [2021] Peng Zheng, Ryan Barber, Reed JD Sorensen, Christopher JL Murray, and Aleksandr Y Aravkin. Trimmed constrained mixed effects models: formulations and algorithms. Journal of Computational and Graphical Statistics, 30(3):544–556, 2021.
Zheng et al. [2022] Peng Zheng, Aleksandr Aravkin, Christopher Murray, et al. The burden of proof studies: assessing the evidence of risk. Nature Medicine, 28(10):2038–2044, 2022.

8 Appendix

In this section, we provide proofs of theorems presented throughout this work.

8.1 Proof of Theorem 4.1

Take $G$ defined as in equation (7). $G$ is continuous on its domain $[0,\infty)^{n}$ . First, we show that $G$ is proper, i.e., for some positive values of $A,a_{0},B,b_{0}$ , $G(A)\not\equiv+\infty$ and that for any $X\in[0,\infty)^{n},G(X)>-\infty$ . For this fact, we need the hypothesis in the statement of the theorem. Let $A$ is the vector of ones of length $n$ , i.e., $A=[1,\dots,1]^{\top}$ . Since, by hypothesis, $N_{+}>A$ and $n_{0}>a_{0}$ , $G(A)<\infty$ inspection. Also by inspection, $G$ is not equal to $-\infty$ for any $A$ in its domain.

Next, $G$ is optimized over the compact set $0\leq A\leq N$ . Since, by hypothesis, $N_{+}>A$ and $n_{0}>a_{0}$ , $G(A)<\infty$ inspection. Also by inspection, $G$ is not equal to $-\infty$ for any $A$ in its domain. Since $G$ is continuous on the compact domain $0\leq A\leq N$ , it attains its minimum and maximum values. Since $G$ is strictly convex, this minimizer must be unique. This completes the proof.

8.2 Proof of Theorem 4.2

Take $H$ as defined in equation (13). This proof will follow the same structure as the proof for Theorem 4.1. We need only show $H$ is proper and that it has compact sublevel sets since $H$ is clearly continuous on the domain $[0,\infty)^{n}$ . To show that $H$ is proper, similar to the proof of Theorem (4.1), consider the case when $A$ is the vector of ones of length $n$ . By the hypothesis in the statement of Theorem 4.2, $n_{0}>a_{0}$ , so that $H$ is finite by inspection. Also by inspection, $H$ is never equal to $-\infty$ on any point in its domain.

Next, we show that $H$ has compact sublevel sets, that is,

\mathcal{A}_{\alpha}:=\{A:H(A)\leq\alpha\}

are closed and bounded. The closed prpoerty follows immediately by continuity. Next, for a sequence of $X\in[0,\infty)^{n}$ , $H(X)\to\infty$ as ${\left\|X\right\|}\to\infty$ since $H$ is a sum of affine functions and entropic distance functions in all coordinates, see equation (11). As ${\left\|X\right\|}\to\infty$ , the $x\log x$ terms in $H$ increase faster than linear terms. This implies directly that any sublevel set of $H$ must have an upper bound. Thus, $H$ has compact sublevel sets. In particular, $H$ attains its minimum and maximum for any choice of sublevel set, so in particular we can consider $\alpha=H(1)$ , the vector of all ones discussed in the previous paragraph. Once we know $H$ attains its minimum, we also know that the minimum is unique by strict convexity of the entropic distance.

8.3 Proof of Theorem 5.1

To prove the result, we simplify and rewrite the equations

	$\displaystyle\left(V-\frac{1}{a_{0}}-\frac{1}{b_{0}}\right)\frac{1-p}{p}b_{0}$	$\displaystyle=\sum_{i=1}^{n}\left(1+\frac{b_{0}}{a_{0}R_{i}}\right)=n+\frac{b_% {0}}{a_{0}}\sum_{i=1}^{n}\frac{1}{R_{i}}=n+\frac{b_{0}}{a_{0}}r_{1}$
	$\displaystyle\left(V-\frac{1}{a_{0}}-\frac{1}{b_{0}}\right)\left(\frac{1}{zp}b% _{0}-a_{0}\right)$	$\displaystyle=\sum_{i=1}^{n}\left(1+\frac{a_{0}R_{i}}{b_{0}}\right)=n+\frac{a_% {0}}{b_{0}}\sum_{i=1}^{n}R_{i}=n+\frac{a_{0}}{b_{0}}r_{2}$

Dividing the equations we obtain

\frac{n+\frac{b_{0}}{a_{0}}r_{1}}{n+\frac{a_{0}}{b_{0}}r_{2}}=\frac{\frac{1-p}% {p}b_{0}}{\frac{1}{zp}b_{0}-a_{0}}=\frac{\frac{1-p}{p}}{\frac{1}{zp}-\frac{a_{% 0}}{b_{0}}}

Defining now $c=\frac{a_{0}}{b_{0}}$ we have

\frac{n+\frac{r_{1}}{c}}{n+cr_{2}}=\frac{\frac{1-p}{p}}{\frac{1}{zp}-c}

Multiplying by $c$ we have

\frac{r_{1}+nc}{n+r_{2}c}=\frac{(1-p)zc}{1-pzc}

The solution is given by

c=\frac{npz-nz+n-pr_{1}z\pm\sqrt{D}}{2z(np-pr_{2}+r_{2})}

where

	$\displaystyle D$	$\displaystyle=n^{2}p^{2}z^{2}-2n^{2}pz^{2}+2n^{2}pz+n^{2}z^{2}-2n^{2}z+n^{2}-2% np^{2}r_{1}z^{2}+2npr_{1}z^{2}+2npr_{1}z+p^{2}r_{1}^{2}z^{2}-4pr_{1}r_{2}z+4r_% {1}r_{2}z$
		$\displaystyle=n^{2}\left(p^{2}z^{2}-2pz^{2}+2pz+z^{2}-z+1\right)$
		$\displaystyle+n\left(-2p^{2}r_{1}z^{2}+2pr_{1}z^{2}+2pr_{1}z\right)$
		$\displaystyle+p^{2}r_{1}^{2}z^{2}-4pr_{1}r_{2}z+4r_{1}r_{2}z$

We want to show that each piece is $\geq 0$ . In fact we have

p^{2}z^{2}-2pz^{2}+2pz+z^{2}-z+1=z\left((p-1)^{2}z+2p-1\right)+1

The minimum with respect to $p$ of the inside expression occurs at $p-1=\frac{-1}{z}$ . Plugging in, that gives us

z\left(\frac{1}{z}+1-\frac{2}{z}\right)+1=z,

so as a result we have

n^{2}\left(p^{2}z^{2}-2pz^{2}+2pz+z^{2}-z+1\right)\geq n^{2}z.

Next, we have

n\left(-2p^{2}r_{1}z^{2}+2pr_{1}z^{2}+2pr_{1}z\right)=n(2r_{1}z)\left((1-p)pz+% p\right)\geq 2nr_{1}zp.

Finally, we have

p^{2}r_{1}^{2}z^{2}-4pr_{1}r_{2}z+4r_{1}r_{2}z=p^{2}r_{1}^{2}z^{2}+4(1-p)r_{1}% r_{2}z\geq p^{2}r_{1}^{2}z^{2}

Putting everything together, we get

D\geq n^{2}z+2nr_{1}zp+p^{2}r_{1}^{2}z^{2}=z(n^{2}+2nr_{1}p+p^{2}r_{1}^{2}z)% \geq 0.

Thus a solution always exists.

To see that only one solution is positive, recall the form of the solution:

c=\frac{npz-nz+n-pr_{1}z\pm\sqrt{D}}{2z(np-pr_{2}+r_{2})}

We can observe that

n^{2}\left(p^{2}z^{2}-2pz^{2}+2pz+z^{2}-z+1\right)-(n(z(p-1)+1))^{2}=n^{2}z

and as a result

\sqrt{D}-(npz-nz+n-pr_{1}z)\geq 0.

That means we have

c_{2}=\frac{npz-nz+n-pr_{1}z-\sqrt{D}}{2z(np+(1-p)r_{2})}<0<c_{1}=\frac{npz-nz% +n-pr_{1}z+\sqrt{D}}{2z(np+(1-p)r_{2})}.

Plugging $c_{1}$ in to the first equation, we have

b_{0}=\frac{1}{V}\left(\frac{p}{1-p}\left(n+\frac{r_{1}}{c}\right)+1+\frac{1}{% c_{1}}\right),\quad a_{0}=c_{1}b_{0}.

and we have found the unique positive solution. This completes the proof.

8.4 Proof of Theorem 5.2

We prove this theorem by induction. For the base case, when $n=1$ , the existence of a unique positive solution follows immediately from Theorem 5.1. For the inductive hypothesis, suppose that for a given $n$ for the dimension of our vectors $V$ and $L$ , we have the positive solution pair $a_{0}^{n},b_{0}^{n}$ that simultaneously satisfy the system (16). Thus, we have that

	$\displaystyle\frac{1-p}{p}b_{0}^{n}$	$\displaystyle=\sum_{i=1}^{n}\left(1+\frac{b_{0}^{n}}{a_{0}^{n}R_{i}}\right)/% \left(V_{i}-\frac{1}{a_{0}^{n}}-\frac{1}{b_{0}^{n}}\right)$
	$\displaystyle\frac{1}{zp}b_{0}^{n}-a_{0}^{n}$	$\displaystyle=\sum_{i=1}^{n}\left(1+\frac{a_{0}^{n}R_{i}}{b_{0}^{n}}\right)/% \left(V_{i}-\frac{1}{a_{0}^{n}}-\frac{1}{b_{0}^{n}}\right).$

If we continue to the step $n+1$ , we add strictly positive terms to the right hand side, and hence we have strict inequalities

\displaystyle\begin{split}\frac{1-p}{p}b_{0}^{n}&<\sum_{i=1}^{n+1}\left(1+% \frac{b_{0}^{n}}{a_{0}^{n}R_{i}}\right)/\left(V_{i}-\frac{1}{a_{0}^{n}}-\frac{% 1}{b_{0}^{n}}\right)\\ \frac{1}{zp}b_{0}^{n}-a_{0}^{n}&<\sum_{i=1}^{n+1}\left(1+\frac{a_{0}^{n}R_{i}}% {b_{0}^{n}}\right)/\left(V_{i}-\frac{1}{a_{0}^{n}}-\frac{1}{b_{0}^{n}}\right)% \end{split}

(19)

and without loss of generality, we may assume that $V_{n+1}\geq\min_{i}V_{i}$ so that $V_{n+1}>\frac{1}{a_{0}^{n}}+\frac{1}{b_{0}^{n}}$ . Otherwise, we can suitably reorder the terms and apply the inductive hypothesis.

Define the functions $f_{1},f_{2}$ by

	$\displaystyle f_{1}(a_{0},b_{0})$	$\displaystyle=\frac{1-p}{p}b_{0}-\sum_{i=1}^{n+1}\left(1+\frac{b_{0}}{a_{0}R_{% i}}\right)/\left(V_{i}-\frac{1}{a_{0}}-\frac{1}{b_{0}}\right)$
	$\displaystyle f_{2}(a_{0},b_{0})$	$\displaystyle=\frac{1}{zp}b_{0}-a_{0}-\sum_{i=1}^{n+1}\left(1+\frac{a_{0}R_{i}% }{b_{0}}\right)/\left(V_{i}-\frac{1}{a_{0}}-\frac{1}{b_{0}}\right).$

The remaining work is focused on finding the $a_{0}^{n+1},b_{0}^{n+1}$ such that $f_{1}(a_{0}^{n+1},b_{0}^{n+1})=f_{2}(a_{0}^{n+1},b_{0}^{n+1})=0$ , and is separated into two steps:

Step 1

Show that we can find points $(a_{0}^{1},b_{0}^{1})$ , $(a_{0}^{2},b_{0}^{2})$ , $(a_{0}^{3},b_{0}^{3})$ with

f_{1}(a_{0}^{1},b_{0}^{1})>0,\quad f_{2}(a_{0}^{1},b_{0}^{1})>0,

f_{1}(a_{0}^{2},b_{0}^{2})>0,\quad f_{2}(a_{0}^{2},b_{0}^{2})<0,

and

f_{1}(a_{0}^{3},b_{0}^{3})<0,\quad f_{2}(a_{0}^{3},b_{0}^{3})>0.

These points, along with $(a_{n},b_{n})$ from the inductive hypothesis, are shown in Figure 7 and the four points set up continuation arguments used in Step 2.

Step 2

Show that, by continuity of the solution maps, either case above leads to the existence of $(a_{0}^{n+1},b_{0}^{n+1})$ simultaneously satisfying $f_{1}=f_{2}=0$ .

Step 1 proof:

First, we observe that

\displaystyle\lim_{a_{0}\uparrow\infty}f_{1}(a_{0},b_{0})=\frac{1-p}{p}b_{0}-% \sum_{i=1}^{n+1}\frac{1}{V_{i}-\frac{1}{b_{0}}}.

As long as we take $b_{0}^{1}>2\max\left(V,\frac{p}{1-p}2\sum_{i=1}^{n+1}\frac{1}{V_{i}}\right)$ where $V=\max_{i}V_{i}$ , we can then find a large enough value $a_{0}^{1}$ satisfying $f_{1}(a_{0}^{1},b_{0}^{1})>0$ . Next, we have

\lim_{a_{0}\uparrow\infty}f_{2}(a_{0},b_{0})=-\infty

for any $b_{0}$ , so in particular we can select a large enough $a_{0}^{1}$ with $f_{2}(a_{0}^{1},b_{0}^{1})<0$ and $f_{1}(a_{0}^{1},b_{0}^{1})>0$ . This gives us a point in the lower-right quadrant of Figure 7.

Now we observe that

\lim_{b_{0}\uparrow\infty}f_{2}(a_{0},b_{0})=\infty

along the path $a_{0}=\frac{1}{2}b_{0}$ . Along this same path, we have

\lim_{b_{0}\uparrow\infty}f_{1}(a_{0},b_{0})=\infty

as well. We can thus select a large enough value $b_{0}^{2}$ and $a_{0}^{2}=\frac{1}{2}b_{0}^{2}$ for which $f_{1}(a_{0}^{2},b_{0}^{2})>0$ and $f_{2}(a_{0}^{2},b_{0}^{2})>0$ . This gives us the point in the upper-right quadrant of Figure 7.

Next, consider $0<\epsilon<<1$ and take

b_{0}=\frac{1}{\epsilon^{2}},\quad a_{0}=\frac{1}{V_{\min}-\epsilon-\epsilon^{% 2}}

With these definitions, we have

f_{2}(\epsilon)\geq\frac{1}{zp\epsilon^{2}}-\frac{1}{V_{\min}-\epsilon-% \epsilon^{2}}-(n+1)\left(1+\frac{\epsilon^{2}R_{i}}{V_{\min}-\epsilon-\epsilon% ^{2}}\right)\frac{1}{\epsilon}>0

f_{1}(\epsilon)<\frac{1-p}{p\epsilon}-\sum_{i=1}^{n+1}\left(1+\frac{V_{\min}-% \epsilon-\epsilon^{2}}{\epsilon^{2}}\right)\frac{1}{\epsilon}<0

for small $\epsilon$ . Thus for $0<\epsilon<<1$ we get a point $(a_{0}^{3},b_{0}^{3})$ with $f_{2}>0$ and $f_{1}<0$ . This gives us a point in the upper left quadrant of Figure 7.

Finally, by the inductive hypothesis, we have $f_{1}(a_{0}^{n},b_{0}^{n})<0$ and $f_{2}(a_{0}^{n},b_{0}^{n})<0$ , which gives us a point in the lower left quadrant of Figure 7.

Figure 7: Note the generic form (

a_{0},b_{0}

) is shorthand for

(f_{1}(a_{0},b_{0}),f_{2}(a_{0},b_{0}))

. The point

(a_{0}^{n+1},b_{0}^{n+1})

serves as desired solution point to complete the proof.

Step 2 Proof:

To show that there is a point $(a_{0}^{n+1},b_{0}^{n+1})$ such that the inequalities (19) become equalities, we create separate interpolations relying on the intermediate value theorem (IVT) Rudin et al. [1964].

Consider the points $(a_{0}^{n},b_{0}^{n})$ and $(a_{0}^{1},b_{0}^{1})$ and the convex combination

p_{\lambda}=\lambda(a_{0}^{n},b_{0}^{n})+(1-\lambda)(a_{0}^{1},b_{0}^{1})

We have $f_{1}(p_{1})<0,f_{1}(p_{0})>0$ , so by the IVT there is a $\lambda\in(0,1)$ with $f_{1}(p_{\lambda})=0$ .

If $f_{2}(p_{\lambda})>0$ , we proceed to Case 1 below. If $f_{2}(p_{\lambda})<0$ , we have a point of intersection below the $f_{1}$ axis, as shown in Figure 8. We then apply IVT to $(a_{0}^{3},b_{0}^{3})$ and $(a_{0}^{2},b_{0}^{2})$ . If the crossing point obtained from the IVT is above the $f_{1}$ axis, we proceed to Case 2 below, and if it is below the $f_{1}$ axis, we proceed to Case 1 below.

Case 1:

$f_{2}(p_{\lambda})>0$ or IVT applied to $(a_{0}^{3},b_{0}^{3})$ and $(a_{0}^{2},b_{0}^{2})$ yields a point below the $f_{1}$ axis. In either case, applying the IVT twice, we obtain two points along the $f_{1}$ axis, as shown in Figure 9, with opposite signs along $f_{1}$ . The constraint $f_{2}=0$ is easily incorporated into $f_{1}$ , which becomes

	$\displaystyle f_{3}(a_{0},b_{0})$	$\displaystyle=(1-p)z\left(a_{0}+\sum_{i=1}^{n+1}\left(1+\frac{a_{0}R_{i}}{b_{0% }}\right)/\left(V_{i}-\frac{1}{a_{0}}-\frac{1}{b_{0}}\right)\right)$
		$\displaystyle-\sum_{i=1}^{n+1}\left(1+\frac{a_{0}R_{i}}{b_{0}}\right)/\left(V_% {i}-\frac{1}{a_{0}}-\frac{1}{b_{0}}\right)$

Clearly $f_{3}$ has opposite signs for the two points of intersection in Figure 9, and applying IVT again we find the point $(a_{0}^{n+1},b_{0}^{n+1})$ .

Case 3:

In this case, we have successfully found two points with $f_{1}=0$ , and opposite signs with respect to $f_{2}$ . Just as in the previous case, we can explicitly incorporate the constraint $f_{1}=0$ into $f_{2}$ , to obtain

	$\displaystyle f_{4}(a_{0},b_{0})$	$\displaystyle=\frac{1}{zp(1-p)}\sum_{i=1}^{n+1}\left(1+\frac{b_{0}}{a_{0}R_{i}% }\right)/\left(V_{i}-\frac{1}{a_{0}}-\frac{1}{b_{0}}\right)$
		$\displaystyle-a_{0}-\sum_{i=1}^{n+1}\left(1+\frac{a_{0}R_{i}}{b_{0}}\right)/% \left(V_{i}-\frac{1}{a_{0}}-\frac{1}{b_{0}}\right)$

Clearly $f_{4}$ has opposite signs now for the two crossing points shown on the $f_{2}$ axis of Figure 8, and by IVT, we have existence of $(a_{0}^{n+1},b_{0}^{n+1})$ .

Figure 8: The points and their associated continuous deformations (the colored lines) according to Case 1. Note the generic form (

a_{0},b_{0}

) is shorthand for

(f_{1}(a_{0},b_{0}),f_{2}(a_{0},b_{0}))

Figure 9: The points and their associated continuous deformations (the colored lines) according to Case 2 in the

f_{1}-f_{2}

plane. Note the generic form

(a_{0},b_{0})

is shorthand for

(f_{1}(a_{0},b_{0}),f_{2}(a_{0},b_{0}))

8.5 Proof of Theorem 5.3

To prove the result, we simplify and rewrite the equations

	$\displaystyle\left(V_{i}-\frac{1}{a_{0}}+\frac{1}{b_{0}}\right)\frac{1-p}{p}b_% {0}$	$\displaystyle=\sum_{i=1}^{n}\left(\frac{b_{0}}{a_{0}R_{i}}-1\right)=\frac{b_{0% }}{a_{0}}\sum_{i=1}^{n}\frac{1}{R_{i}}-n=\frac{b_{0}}{a_{0}}r_{1}-n$
	$\displaystyle\left(V_{i}-\frac{1}{a_{0}}+\frac{1}{b_{0}}\right)\left(\frac{1}{% zp}b_{0}-a_{0}\right)$	$\displaystyle=\sum_{i=1}^{n}\left(1-\frac{a_{0}R_{i}}{b_{0}}\right)=n-\frac{a_% {0}}{b_{0}}\sum_{i=1}^{n}R_{i}=n-\frac{a_{0}}{b_{0}}r_{2}$

Dividing the equations we obtain

\frac{\frac{b_{0}}{a_{0}}r_{1}-n}{n-\frac{a_{0}}{b_{0}}r_{2}}=\frac{\frac{1-p}% {p}b_{0}}{\frac{1}{zp}b_{0}-a_{0}}=\frac{\frac{1-p}{p}}{\frac{1}{zp}-\frac{a_{% 0}}{b_{0}}}

Defining now $c=\frac{a_{0}}{b_{0}}$ we have

\frac{\frac{r_{1}}{c}-n}{n-cr_{2}}=\frac{\frac{1-p}{p}}{\frac{1}{zp}-c}

with the inherited constraint that $c<1$ , since $a_{0}<b_{0}$ by definition. Multiplying by $c$ we have

\frac{r_{1}-nc}{n-r_{2}c}=\frac{(1-p)zc}{1-pzc}

The solution is given by

c=\frac{n(z-pz+1)+pr_{1}z\pm\sqrt{D}}{2z(np+(1-p)r_{2})}

where

D=(n(pz-z-1)-r_{1}zp)^{2}-4r_{1}(nzp+r_{2}z-r_{2}pz).

By inspection the coefficient for $n^{2}$ is given by

(pz-z-1)^{2},

which is always positive. The coefficient for $n$ is given by

(-2(pz-z-1)-4)zpr_{1}=2zpr_{1}(z-pz-1)

This term is non-negative exactly when $(1-p)z\geq 1$ . Finally, the constant term is given by

r_{1}^{2}z^{2}p^{2}+4r_{1}r_{2}z(p-1).

The condition for when this term is non-negative can also be written in terms of $(1-p)z$ :

(1-p)z\geq\left(\frac{1-p}{p}\right)^{2}4r_{2}.

It is easy to find a counter-example when these conditions are violated, and where $D<0$ .

n=2,\quad p=0.1,\quad z=1.1,\quad r_{1}=31.9,\quad r_{2}=1.

This yields $D=-33.4$ , so there is no solution. In this case, $(1-p)z=0.99$ , while $\frac{(1-p)^{2}}{p^{2}}4r_{2}=324$ , so both inequalities are violated, the latter significantly. The result is easily achievable, since we just want to find $\alpha$ with

\frac{2}{1-\alpha}+\frac{2}{\alpha}=31.9.

This gives us

R_{1}=.9328,R_{2}=.0672.