-
Estimation methods for estimands using the treatment policy strategy; a simulation study based on the PIONEER 1 Trial
Authors:
James Bell,
Thomas Drury,
Tobias Mütze,
Christian Bressen Pipper,
Lorenzo Guizzaro,
Marian Mitroiu,
Khadija Rerhou Rantell,
Marcel Wolbers,
David Wright
Abstract:
Estimands using the treatment policy strategy for addressing intercurrent events are common in Phase III clinical trials. One estimation approach for this strategy is retrieved dropout whereby observed data following an intercurrent event are used to multiply impute missing data. However, such methods have had issues with variance inflation and model fitting due to data sparsity. This paper introd…
▽ More
Estimands using the treatment policy strategy for addressing intercurrent events are common in Phase III clinical trials. One estimation approach for this strategy is retrieved dropout whereby observed data following an intercurrent event are used to multiply impute missing data. However, such methods have had issues with variance inflation and model fitting due to data sparsity. This paper introduces likelihood-based versions of these approaches, investigating and comparing their statistical properties to the existing retrieved dropout approaches, simpler analysis models and reference-based multiple imputation. We use a simulation based upon the data from the PIONEER 1 Phase III clinical trial in Type II diabetics to present complex and relevant estimation challenges. The likelihood-based methods display similar statistical properties to their multiple imputation equivalents, but all retrieved dropout approaches suffer from high variance. Retrieved dropout approaches appear less biased than reference-based approaches, resulting in a bias-variance trade-off, but we conclude that the large degree of variance inflation is often more problematic than the bias. Therefore, only the simpler retrieved dropout models appear appropriate as a primary analysis in a clinical trial, and only where it is believed most data following intercurrent events will be observed. The jump-to-reference approach may represent a more promising estimation approach for symptomatic treatments due to its relatively high power and ability to fit in the presence of much missing data, despite its strong assumptions and tendency towards conservative bias. More research is needed to further develop how to estimate the treatment effect for a treatment policy strategy.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
Modeling the Machine Learning Multiverse
Authors:
Samuel J. Bell,
Onno P. Kampman,
Jesse Dodge,
Neil D. Lawrence
Abstract:
Amid mounting concern about the reliability and credibility of machine learning research, we present a principled framework for making robust and generalizable claims: the multiverse analysis. Our framework builds upon the multiverse analysis (Steegen et al., 2016) introduced in response to psychology's own reproducibility crisis. To efficiently explore high-dimensional and often continuous ML sea…
▽ More
Amid mounting concern about the reliability and credibility of machine learning research, we present a principled framework for making robust and generalizable claims: the multiverse analysis. Our framework builds upon the multiverse analysis (Steegen et al., 2016) introduced in response to psychology's own reproducibility crisis. To efficiently explore high-dimensional and often continuous ML search spaces, we model the multiverse with a Gaussian Process surrogate and apply Bayesian experimental design. Our framework is designed to facilitate drawing robust scientific conclusions about model performance, and thus our approach focuses on exploration rather than conventional optimization. In the first of two case studies, we investigate disputed claims about the relative merit of adaptive optimizers. Second, we synthesize conflicting research on the effect of learning rate on the large batch training generalization gap. For the machine learning community, the multiverse analysis is a simple and effective technique for identifying robust claims, for increasing transparency, and a step toward improved reproducibility.
△ Less
Submitted 12 October, 2022; v1 submitted 13 June, 2022;
originally announced June 2022.
-
Iterative Construction of Gaussian Process Surrogate Models for Bayesian Inference
Authors:
Leen Alawieh,
Jonathan Goodman,
John B. Bell
Abstract:
A new algorithm is developed to tackle the issue of sampling non-Gaussian model parameter posterior probability distributions that arise from solutions to Bayesian inverse problems. The algorithm aims to mitigate some of the hurdles faced by traditional Markov Chain Monte Carlo (MCMC) samplers, through constructing proposal probability densities that are both, easy to sample and that provide a bet…
▽ More
A new algorithm is developed to tackle the issue of sampling non-Gaussian model parameter posterior probability distributions that arise from solutions to Bayesian inverse problems. The algorithm aims to mitigate some of the hurdles faced by traditional Markov Chain Monte Carlo (MCMC) samplers, through constructing proposal probability densities that are both, easy to sample and that provide a better approximation to the target density than a simple Gaussian proposal distribution would. To achieve that, a Gaussian proposal distribution is augmented with a Gaussian Process (GP) surface that helps capture non-linearities in the log-likelihood function. In order to train the GP surface, an iterative approach is adopted for the optimal selection of points in parameter space. Optimality is sought by maximizing the information gain of the GP surface using a minimum number of forward model simulation runs. The accuracy of the GP-augmented surface approximation is assessed in two ways. The first consists of comparing predictions obtained from the approximate surface with those obtained through running the actual simulation model at hold-out points in parameter space. The second consists of a measure based on the relative variance of sample weights obtained from sampling the approximate posterior probability distribution of the model parameters. The efficacy of this new algorithm is tested on inferring reaction rate parameters in a 3-node and 6-node network toy problems, which imitate idealized reaction networks in combustion applications.
△ Less
Submitted 17 November, 2019;
originally announced November 2019.
-
Private Protocols for U-Statistics in the Local Model and Beyond
Authors:
James Bell,
Aurélien Bellet,
Adrià Gascón,
Tejas Kulkarni
Abstract:
In this paper, we study the problem of computing $U$-statistics of degree $2$, i.e., quantities that come in the form of averages over pairs of data points, in the local model of differential privacy (LDP). The class of $U$-statistics covers many statistical estimates of interest, including Gini mean difference, Kendall's tau coefficient and Area under the ROC Curve (AUC), as well as empirical ris…
▽ More
In this paper, we study the problem of computing $U$-statistics of degree $2$, i.e., quantities that come in the form of averages over pairs of data points, in the local model of differential privacy (LDP). The class of $U$-statistics covers many statistical estimates of interest, including Gini mean difference, Kendall's tau coefficient and Area under the ROC Curve (AUC), as well as empirical risk measures for machine learning problems such as ranking, clustering and metric learning. We first introduce an LDP protocol based on quantizing the data into bins and applying randomized response, which guarantees an $ε$-LDP estimate with a Mean Squared Error (MSE) of $O(1/\sqrt{n}ε)$ under regularity assumptions on the $U$-statistic or the data distribution. We then propose a specialized protocol for AUC based on a novel use of hierarchical histograms that achieves MSE of $O(α^3/nε^2)$ for arbitrary data distribution. We also show that 2-party secure computation allows to design a protocol with MSE of $O(1/nε^2)$, without any assumption on the kernel function or data distribution and with total communication linear in the number of users $n$. Finally, we evaluate the performance of our protocols through experiments on synthetic and real datasets.
△ Less
Submitted 2 March, 2020; v1 submitted 9 October, 2019;
originally announced October 2019.
-
Differentially Private Summation with Multi-Message Shuffling
Authors:
Borja Balle,
James Bell,
Adria Gascon,
Kobbi Nissim
Abstract:
In recent work, Cheu et al. (Eurocrypt 2019) proposed a protocol for $n$-party real summation in the shuffle model of differential privacy with $O_{ε, δ}(1)$ error and $Θ(ε\sqrt{n})$ one-bit messages per party. In contrast, every local model protocol for real summation must incur error $Ω(1/\sqrt{n})$, and there exist protocols matching this lower bound which require just one bit of communication…
▽ More
In recent work, Cheu et al. (Eurocrypt 2019) proposed a protocol for $n$-party real summation in the shuffle model of differential privacy with $O_{ε, δ}(1)$ error and $Θ(ε\sqrt{n})$ one-bit messages per party. In contrast, every local model protocol for real summation must incur error $Ω(1/\sqrt{n})$, and there exist protocols matching this lower bound which require just one bit of communication per party. Whether this gap in number of messages is necessary was left open by Cheu et al.
In this note we show a protocol with $O(1/ε)$ error and $O(\log(n/δ))$ messages of size $O(\log(n))$ per party. This protocol is based on the work of Ishai et al.\ (FOCS 2006) showing how to implement distributed summation from secure shuffling, and the observation that this allows simulating the Laplace mechanism in the shuffle model.
△ Less
Submitted 21 August, 2019; v1 submitted 20 June, 2019;
originally announced June 2019.
-
The Privacy Blanket of the Shuffle Model
Authors:
Borja Balle,
James Bell,
Adria Gascon,
Kobbi Nissim
Abstract:
This work studies differential privacy in the context of the recently proposed shuffle model. Unlike in the local model, where the server collecting privatized data from users can track back an input to a specific user, in the shuffle model users submit their privatized inputs to a server anonymously. This setup yields a trust model which sits in between the classical curator and local models for…
▽ More
This work studies differential privacy in the context of the recently proposed shuffle model. Unlike in the local model, where the server collecting privatized data from users can track back an input to a specific user, in the shuffle model users submit their privatized inputs to a server anonymously. This setup yields a trust model which sits in between the classical curator and local models for differential privacy. The shuffle model is the core idea in the Encode, Shuffle, Analyze (ESA) model introduced by Bittau et al. (SOPS 2017). Recent work by Cheu et al. (EUROCRYPT 2019) analyzes the differential privacy properties of the shuffle model and shows that in some cases shuffled protocols provide strictly better accuracy than local protocols. Additionally, Erlingsson et al. (SODA 2019) provide a privacy amplification bound quantifying the level of curator differential privacy achieved by the shuffle model in terms of the local differential privacy of the randomizer used by each user. In this context, we make three contributions. First, we provide an optimal single message protocol for summation of real numbers in the shuffle model. Our protocol is very simple and has better accuracy and communication than the protocols for this same problem proposed by Cheu et al. Optimality of this protocol follows from our second contribution, a new lower bound for the accuracy of private protocols for summation of real numbers in the shuffle model. The third contribution is a new amplification bound for analyzing the privacy of protocols in the shuffle model in terms of the privacy provided by the corresponding local randomizer. Our amplification bound generalizes the results by Erlingsson et al. to a wider range of parameters, and provides a whole family of methods to analyze privacy amplification in the shuffle model.
△ Less
Submitted 2 June, 2019; v1 submitted 7 March, 2019;
originally announced March 2019.
-
Iterative importance sampling algorithms for parameter estimation
Authors:
Matthias Morzfeld,
Marcus S. Day,
Ray W. Grout,
George Shu Heng Pau,
Stefan A. Finsterle,
John B. Bell
Abstract:
In parameter estimation problems one computes a posterior distribution over uncertain parameters defined jointly by a prior distribution, a model, and noisy data. Markov Chain Monte Carlo (MCMC) is often used for the numerical solution of such problems. An alternative to MCMC is importance sampling, which can exhibit near perfect scaling with the number of cores on high performance computing syste…
▽ More
In parameter estimation problems one computes a posterior distribution over uncertain parameters defined jointly by a prior distribution, a model, and noisy data. Markov Chain Monte Carlo (MCMC) is often used for the numerical solution of such problems. An alternative to MCMC is importance sampling, which can exhibit near perfect scaling with the number of cores on high performance computing systems because samples are drawn independently. However, finding a suitable proposal distribution is a challenging task. Several sampling algorithms have been proposed over the past years that take an iterative approach to constructing a proposal distribution. We investigate the applicability of such algorithms by applying them to two realistic and challenging test problems, one in subsurface flow, and one in combustion modeling. More specifically, we implement importance sampling algorithms that iterate over the mean and covariance matrix of Gaussian or multivariate t-proposal distributions. Our implementation leverages massively parallel computers, and we present strategies to initialize the iterations using "coarse" MCMC runs or Gaussian mixture models.
△ Less
Submitted 14 November, 2017; v1 submitted 5 August, 2016;
originally announced August 2016.
-
Detecting mutations in mixed sample sequencing data using empirical Bayes
Authors:
Omkar Muralidharan,
Georges Natsoulis,
John Bell,
Hanlee Ji,
Nancy R. Zhang
Abstract:
We develop statistically based methods to detect single nucleotide DNA mutations in next generation sequencing data. Sequencing generates counts of the number of times each base was observed at hundreds of thousands to billions of genome positions in each sample. Using these counts to detect mutations is challenging because mutations may have very low prevalence and sequencing error rates vary dra…
▽ More
We develop statistically based methods to detect single nucleotide DNA mutations in next generation sequencing data. Sequencing generates counts of the number of times each base was observed at hundreds of thousands to billions of genome positions in each sample. Using these counts to detect mutations is challenging because mutations may have very low prevalence and sequencing error rates vary dramatically by genome position. The discreteness of sequencing data also creates a difficult multiple testing problem: current false discovery rate methods are designed for continuous data, and work poorly, if at all, on discrete data. We show that a simple randomization technique lets us use continuous false discovery rate methods on discrete data. Our approach is a useful way to estimate false discovery rates for any collection of discrete test statistics, and is hence not limited to sequencing data. We then use an empirical Bayes model to capture different sources of variation in sequencing error rates. The resulting method outperforms existing detection approaches on example data sets.
△ Less
Submitted 28 September, 2012;
originally announced September 2012.