-
Urban map** in Dar es Salaam using AJIVE
Authors:
Rachel J. Carrington,
Ian L. Dryden,
Madeleine Ellis,
James O. Goulding,
Simon P. Preston,
David J. Sirl
Abstract:
Map** deprivation in urban areas is important, for example for identifying areas of greatest need and planning interventions. Traditional ways of obtaining deprivation estimates are based on either census or household survey data, which in many areas is unavailable or difficult to collect. However, there has been a huge rise in the amount of new, non-traditional forms of data, such as satellite…
▽ More
Map** deprivation in urban areas is important, for example for identifying areas of greatest need and planning interventions. Traditional ways of obtaining deprivation estimates are based on either census or household survey data, which in many areas is unavailable or difficult to collect. However, there has been a huge rise in the amount of new, non-traditional forms of data, such as satellite imagery and cell-phone call-record data, which may contain information useful for identifying deprivation. We use Angle-Based Joint and Individual Variation Explained (AJIVE) to jointly model satellite imagery data, cell-phone data, and survey data for the city of Dar es Salaam, Tanzania. We first identify interpretable low-dimensional structure from the imagery and cell-phone data, and find that we can use these to identify deprivation. We then consider what is gained from further incorporating the more traditional and costly survey data. We also introduce a scalar measure of deprivation as a response variable to be predicted, and consider various approaches to multiview regression, including using AJIVE scores as predictors.
△ Less
Submitted 13 March, 2024;
originally announced March 2024.
-
Pareto Optimal Learning for Estimating Large Language Model Errors
Authors:
Theodore Zhao,
Mu Wei,
J. Samuel Preston,
Hoifung Poon
Abstract:
Large Language Models (LLMs) have shown impressive abilities in many applications. When a concrete and precise answer is desired, it is important to have a quantitative estimation of the potential error rate. However, this can be challenging due to the text-in-text-out nature of generative models. We present a method based on Pareto optimization that generates a risk score to estimate the probabil…
▽ More
Large Language Models (LLMs) have shown impressive abilities in many applications. When a concrete and precise answer is desired, it is important to have a quantitative estimation of the potential error rate. However, this can be challenging due to the text-in-text-out nature of generative models. We present a method based on Pareto optimization that generates a risk score to estimate the probability of error in an LLM response by integrating multiple sources of information. We prove theoretically that the error estimator optimized in our framework aligns with the LLM and the information sources in an Pareto optimal manner. Experimental results show that the risk scores estimated by our method are well correlated with the true LLM error rate, thus facilitating error correction. By dynamically combining with prompting strategies such as self-verification and information retrieval, we demonstrate the proposed method can be utilized to increase the performance of an LLM, surpassing state-of-the-art task specific models.
△ Less
Submitted 22 May, 2024; v1 submitted 28 June, 2023;
originally announced June 2023.
-
Empirical quantification of predictive uncertainty due to model discrepancy by training with an ensemble of experimental designs: an application to ion channel kinetics
Authors:
Joseph G. Shuttleworth,
Chon Lok Lei,
Dominic G. Whittaker,
Monique J. Windley,
Adam P. Hill,
Simon P. Preston,
Gary R. Mirams
Abstract:
When mathematical biology models are used to make quantitative predictions for clinical or industrial use, it is important that these predictions come with a reliable estimate of their accuracy (uncertainty quantification). Because models of complex biological systems are always large simplifications, model discrepancy arises - where a mathematical model fails to recapitulate the true data generat…
▽ More
When mathematical biology models are used to make quantitative predictions for clinical or industrial use, it is important that these predictions come with a reliable estimate of their accuracy (uncertainty quantification). Because models of complex biological systems are always large simplifications, model discrepancy arises - where a mathematical model fails to recapitulate the true data generating process. This presents a particular challenge for making accurate predictions, and especially for making accurate estimates of uncertainty in these predictions. Experimentalists and modellers must choose which experimental procedures (protocols) are used to produce data to train their models. We propose to characterise uncertainty owing to model discrepancy with an ensemble of parameter sets, each of which results from training to data from a different protocol. The variability in predictions from this ensemble provides an empirical estimate of predictive uncertainty owing to model discrepancy, even for unseen protocols. We use the example of electrophysiology experiments, which are used to investigate the kinetics of the hERG potassium ion channel. Here, 'information-rich' protocols allow mathematical models to be trained using numerous short experiments performed on the same cell. Typically, assuming independent observational errors and training a model to an individual experiment results in parameter estimates with very little dependence on observational noise. Moreover, parameter sets arising from the same model applied to different experiments often conflict - indicative of model discrepancy. Our methods will help select more suitable mathematical models of hERG for future studies, and will be widely applicable to a range of biological modelling problems.
△ Less
Submitted 19 February, 2024; v1 submitted 6 February, 2023;
originally announced February 2023.
-
The Bayesian Spatial Bradley--Terry Model: Urban Deprivation Modeling in Tanzania
Authors:
R. G. Seymour,
D. Sirl,
S. Preston,
I. L. Dryden,
M. J. A. Ellis,
B. Perrat,
J. Goulding
Abstract:
Identifying the most deprived regions of any country or city is key if policy makers are to design successful interventions. However, locating areas with the greatest need is often surprisingly challenging in develo** countries. Due to the logistical challenges of traditional household surveying, official statistics can be slow to be updated; estimates that exist can be coarse, a consequence of…
▽ More
Identifying the most deprived regions of any country or city is key if policy makers are to design successful interventions. However, locating areas with the greatest need is often surprisingly challenging in develo** countries. Due to the logistical challenges of traditional household surveying, official statistics can be slow to be updated; estimates that exist can be coarse, a consequence of prohibitive costs and poor infrastructures; and mass urbanisation can render manually surveyed figures rapidly out-of-date. Comparative judgement models, such as the Bradley--Terry model, offer a promising solution. Leveraging local knowledge, elicited via comparisons of different areas' affluence, such models can both simplify logistics and circumvent biases inherent to house-hold surveys. Yet widespread adoption remains limited, due to the large amount of data existing approaches still require. We address this via development of a novel Bayesian Spatial Bradley--Terry model, which substantially decreases the amount of data comparisons required for effective inference. This model integrates a network representation of the city or country, along with assumptions of spatial smoothness that allow deprivation in one area to be informed by neighbouring areas. We demonstrate the practical effectiveness of this method, through a novel comparative judgement data set collected in Dar es Salaam, Tanzania.
△ Less
Submitted 28 October, 2021; v1 submitted 27 October, 2020;
originally announced October 2020.
-
Non-parametric regression for networks
Authors:
Katie E. Severn,
Ian L. Dryden,
Simon P. Preston
Abstract:
Network data are becoming increasingly available, and so there is a need to develop suitable methodology for statistical analysis. Networks can be represented as graph Laplacian matrices, which are a type of manifold-valued data. Our main objective is to estimate a regression curve from a sample of graph Laplacian matrices conditional on a set of Euclidean covariates, for example in dynamic networ…
▽ More
Network data are becoming increasingly available, and so there is a need to develop suitable methodology for statistical analysis. Networks can be represented as graph Laplacian matrices, which are a type of manifold-valued data. Our main objective is to estimate a regression curve from a sample of graph Laplacian matrices conditional on a set of Euclidean covariates, for example in dynamic networks where the covariate is time. We develop an adapted Nadaraya-Watson estimator which has uniform weak consistency for estimation using Euclidean and power Euclidean metrics. We apply the methodology to the Enron email corpus to model smooth trends in monthly networks and highlight anomalous networks. Another motivating application is given in corpus linguistics, which explores trends in an author's writing style over time based on word co-occurrence networks.
△ Less
Submitted 30 September, 2020;
originally announced October 2020.
-
Invariance and identifiability issues for word embeddings
Authors:
Rachel Carrington,
Karthik Bharath,
Simon Preston
Abstract:
Word embeddings are commonly obtained as optimizers of a criterion function $f$ of a text corpus, but assessed on word-task performance using a different evaluation function $g$ of the test data. We contend that a possible source of disparity in performance on tasks is the incompatibility between classes of transformations that leave $f$ and $g$ invariant. In particular, word embeddings defined by…
▽ More
Word embeddings are commonly obtained as optimizers of a criterion function $f$ of a text corpus, but assessed on word-task performance using a different evaluation function $g$ of the test data. We contend that a possible source of disparity in performance on tasks is the incompatibility between classes of transformations that leave $f$ and $g$ invariant. In particular, word embeddings defined by $f$ are not unique; they are defined only up to a class of transformations to which $f$ is invariant, and this class is larger than the class to which $g$ is invariant. One implication of this is that the apparent superiority of one word embedding over another, as measured by word task performance, may largely be a consequence of the arbitrary elements selected from the respective solution sets. We provide a formal treatment of the above identifiability issue, present some numerical examples, and discuss possible resolutions.
△ Less
Submitted 6 November, 2019;
originally announced November 2019.
-
Manifold valued data analysis of samples of networks, with applications in corpus linguistics
Authors:
Katie E. Severn,
Ian L. Dryden,
Simon P. Preston
Abstract:
Networks arise in many applications, such as in the analysis of text documents, social interactions and brain activity. We develop a general framework for extrinsic statistical analysis of samples of networks, motivated by networks representing text documents in corpus linguistics. We identify networks with their graph Laplacian matrices, for which we define metrics, embeddings, tangent spaces, an…
▽ More
Networks arise in many applications, such as in the analysis of text documents, social interactions and brain activity. We develop a general framework for extrinsic statistical analysis of samples of networks, motivated by networks representing text documents in corpus linguistics. We identify networks with their graph Laplacian matrices, for which we define metrics, embeddings, tangent spaces, and a projection from Euclidean space to the space of graph Laplacians. This framework provides a way of computing means, performing principal component analysis and regression, and carrying out hypothesis tests, such as for testing for equality of means between two samples of networks. We apply the methodology to the set of novels by Jane Austen and Charles Dickens.
△ Less
Submitted 16 September, 2020; v1 submitted 21 February, 2019;
originally announced February 2019.
-
Quantifying Age and Model Uncertainties in Paleoclimate Data and Dynamical Climate Models with a Joint Inferential Analysis
Authors:
Jake Carson,
Michel Crucifix,
Simon P. Preston,
Richard D. Wilkinson
Abstract:
A major goal in paleoclimate science is to reconstruct historical climates using proxies for climate variables such as those observed in sediment cores, and in the process learn about climate dynamics. This is hampered by uncertainties in how sediment core depths relate to ages, how proxy quantities relate to climate variables, how climate models are specified, and the values of parameters in clim…
▽ More
A major goal in paleoclimate science is to reconstruct historical climates using proxies for climate variables such as those observed in sediment cores, and in the process learn about climate dynamics. This is hampered by uncertainties in how sediment core depths relate to ages, how proxy quantities relate to climate variables, how climate models are specified, and the values of parameters in climate models. Quantifying these uncertainties is key in drawing well founded conclusions. Analyses are often performed in separate stages with, for example, a sediment core's depth-age relation being estimated as stage one, then fed as an input to calibrate climate models as stage two. Here, we show that such "multi-stage" approaches can lead to misleading conclusions. We develop a joint inferential approach for climate reconstruction, model calibration, and age model estimation. We focus on the glacial-interglacial cycle over the past 780 kyr, analysing two sediment cores that span this range. Our age estimates are largely in agreement with previous studies, but provides the full joint specification of all uncertainties, estimation of model parameters, and the model evidence. By sampling plausible chronologies from the posterior distribution, we demonstrate that downstream scientific conclusions can differ greatly both between different sampled chronologies, and in comparison with conclusions obtained in the complete joint inferential analysis. We conclude that multi-stage analyses are insufficient when dealing with uncertainty, and that to draw sound conclusions the full joint inferential analysis must be performed.
△ Less
Submitted 17 April, 2019; v1 submitted 22 March, 2018;
originally announced March 2018.
-
The extended power distribution: A new distribution on $(0, 1)$
Authors:
Chibueze E. Ogbonnaya,
Simon P. Preston,
Andrew T. A. Wood
Abstract:
We propose a two-parameter bounded probability distribution called the extended power distribution. This distribution on $(0, 1)$ is similar to the beta distribution, however there are some advantages which we explore. We define the moments and quantiles of this distribution and show that it is possible to give an $r$-parameter extension of this distribution ($r>2$). We also consider its complemen…
▽ More
We propose a two-parameter bounded probability distribution called the extended power distribution. This distribution on $(0, 1)$ is similar to the beta distribution, however there are some advantages which we explore. We define the moments and quantiles of this distribution and show that it is possible to give an $r$-parameter extension of this distribution ($r>2$). We also consider its complementary distribution and show that it has some flexibility advantages over the Kumaraswamy and beta distributions. This distribution can be used as an alternative to the Kumaraswamy distribution since it has a closed form for its cumulative function. However, it can be fitted to data where there are some samples that are exactly equal to 1, unlike the Kumaraswamy and beta distributions which cannot be fitted to such data or may require some censoring. Applications considered show the extended power distribution performs favourably against the Kumaraswamy distribution in most cases.
△ Less
Submitted 7 November, 2017;
originally announced November 2017.
-
Classification and clustering for observations of event time data using non-homogeneous Poisson process models
Authors:
Duncan Barrack,
Simon Preston
Abstract:
Data of the form of event times arise in various applications. A simple model for such data is a non-homogeneous Poisson process (NHPP) which is specified by a rate function that depends on time. We consider the problem of having access to multiple independent observations of event time data, observed on a common interval, from which we wish to classify or cluster the observations according to the…
▽ More
Data of the form of event times arise in various applications. A simple model for such data is a non-homogeneous Poisson process (NHPP) which is specified by a rate function that depends on time. We consider the problem of having access to multiple independent observations of event time data, observed on a common interval, from which we wish to classify or cluster the observations according to their rate functions. Each rate function is unknown but assumed to belong to a finite number of rate functions each defining a distinct class. We model the rate functions using a spline basis expansion, the coefficients of which need to be estimated from data. The classification approach consists of using training data for which the class membership is known, to calculate maximum likelihood estimates of the coefficients for each group, then assigning test observations to a group by a maximum likelihood criterion. For clustering, by analogy to the Gaussian mixture model approach for Euclidean data, we consider mixtures of NHPP and use the expectation-maximisation algorithm to estimate the coefficients of the rate functions for the component models and group membership probabilities for each observation. The classification and clustering approaches perform well on both synthetic and real-world data sets. Code associated with this paper is available at https://github.com/duncan-barrack/NHPP .
△ Less
Submitted 20 June, 2018; v1 submitted 6 March, 2017;
originally announced March 2017.
-
Nonparametric hypothesis testing for equality of means on the simplex
Authors:
Michail Tsagris,
Simon Preston,
Andrew T. A. Wood
Abstract:
In the context of data that lie on the simplex, we investigate use of empirical and exponential empirical likelihood, and Hotelling and James statistics, to test the null hypothesis of equal population means based on two independent samples. We perform an extensive numerical study using data simulated from various distributions on the simplex. The results, taken together with practical considerati…
▽ More
In the context of data that lie on the simplex, we investigate use of empirical and exponential empirical likelihood, and Hotelling and James statistics, to test the null hypothesis of equal population means based on two independent samples. We perform an extensive numerical study using data simulated from various distributions on the simplex. The results, taken together with practical considerations regarding implementation, support the use of bootstrap-calibrated James statistic.
△ Less
Submitted 4 August, 2016; v1 submitted 27 July, 2016;
originally announced July 2016.
-
Bayesian model selection for the glacial-interglacial cycle
Authors:
Jake Carson,
Michel Crucifix,
Simon Preston,
Richard D. Wilkinson
Abstract:
A prevailing viewpoint in palaeoclimate science is that a single palaeoclimate record contains insufficient information to discriminate between most competing explanatory models. Results we present here suggest the contrary. Using SMC^2 combined with novel Brownian bridge type proposals for the state trajectories, we show that even with relatively short time series it is possible to estimate Bayes…
▽ More
A prevailing viewpoint in palaeoclimate science is that a single palaeoclimate record contains insufficient information to discriminate between most competing explanatory models. Results we present here suggest the contrary. Using SMC^2 combined with novel Brownian bridge type proposals for the state trajectories, we show that even with relatively short time series it is possible to estimate Bayes factors to sufficient accuracy to be able to select between competing models. The results show that Monte Carlo methodology and computer power have now advanced to the point where a full Bayesian analysis for a wide class of conceptual climate models is now possible. The results also highlight a problem with estimating the chronology of the climate record prior to further statistical analysis, a practice which is common in palaeoclimate science. Using two datasets based on the same record but with different estimated chronologies results in conflicting conclusions about the importance of the orbital forcing on the glacial cycle, and about the internal dynamics generating the glacial cycle, even though the difference between the two estimated chronologies is consistent with dating uncertainty. This highlights a need for chronology estimation and other inferential questions to be addressed in a joint statistical procedure.
△ Less
Submitted 11 November, 2015;
originally announced November 2015.
-
Improved classification for compositional data using the $α$-transformation
Authors:
Michail Tsagris,
Simon Preston,
Andrew T. A. Wood
Abstract:
In compositional data analysis an observation is a vector containing non-negative values, only the relative sizes of which are considered to be of interest. Without loss of generality, a compositional vector can be taken to be a vector of proportions that sum to one. Data of this type arise in many areas including geology, archaeology, biology, economics and political science. In this paper we inv…
▽ More
In compositional data analysis an observation is a vector containing non-negative values, only the relative sizes of which are considered to be of interest. Without loss of generality, a compositional vector can be taken to be a vector of proportions that sum to one. Data of this type arise in many areas including geology, archaeology, biology, economics and political science. In this paper we investigate methods for classification of compositional data. Our approach centres on the idea of using the $α$-transformation to transform the data and then to classify the transformed data via regularised discriminant analysis and the k-nearest neighbours algorithm. Using the $α$-transformation generalises two rival approaches in compositional data analysis, one (when $α=1$) that treats the data as though they were Euclidean, ignoring the compositional constraint, and another (when $α=0$) that employs Aitchison's centred log-ratio transformation. A numerical study with several real datasets shows that whether using $α=1$ or $α=0$ gives better classification performance depends on the dataset, and moreover that using an intermediate value of $α$ can sometimes give better performance than using either 1 or 0.
△ Less
Submitted 17 June, 2015; v1 submitted 16 June, 2015;
originally announced June 2015.
-
Fast Approximate Bayesian Computation for discretely observed Markov models using a factorised posterior distribution
Authors:
Simon R. White,
Theodore Kypraios,
Simon P. Preston
Abstract:
Many modern statistical applications involve inference for complicated stochastic models for which the likelihood function is difficult or even impossible to calculate, and hence conventional likelihood-based inferential echniques cannot be used. In such settings, Bayesian inference can be performed using Approximate Bayesian Computation (ABC). However, in spite of many recent developments to ABC…
▽ More
Many modern statistical applications involve inference for complicated stochastic models for which the likelihood function is difficult or even impossible to calculate, and hence conventional likelihood-based inferential echniques cannot be used. In such settings, Bayesian inference can be performed using Approximate Bayesian Computation (ABC). However, in spite of many recent developments to ABC methodology, in many applications the computational cost of ABC necessitates the choice of summary statistics and tolerances that can potentially severely bias the estimate of the posterior.
We propose a new "piecewise" ABC approach suitable for discretely observed Markov models that involves writing the posterior density of the parameters as a product of factors, each a function of only a subset of the data, and then using ABC within each factor. The approach has the advantage of side-step** the need to choose a summary statistic and it enables a stringent tolerance to be set, making the posterior "less approximate". We investigate two methods for estimating the posterior density based on ABC samples for each of the factors: the first is to use a Gaussian approximation for each factor, and the second is to use a kernel density estimate. Both methods have their merits. The Gaussian approximation is simple, fast, and probably adequate for many applications. On the other hand, using instead a kernel density estimate has the benefit of consistently estimating the true ABC posterior as the number of ABC samples tends to infinity. We illustrate the piecewise ABC approach for three examples; in each case, the approach enables "exact matching" between simulations and data and offers fast and accurate inference.
△ Less
Submitted 28 May, 2013; v1 submitted 14 January, 2013;
originally announced January 2013.
-
A data-based power transformation for compositional data
Authors:
Michail T. Tsagris,
Simon Preston,
Andrew T. A. Wood
Abstract:
Compositional data analysis is carried out either by neglecting the compositional constraint and applying standard multivariate data analysis, or by transforming the data using the logs of the ratios of the components. In this work we examine a more general transformation which includes both approaches as special cases. It is a power transformation and involves a single parameter, α. The transform…
▽ More
Compositional data analysis is carried out either by neglecting the compositional constraint and applying standard multivariate data analysis, or by transforming the data using the logs of the ratios of the components. In this work we examine a more general transformation which includes both approaches as special cases. It is a power transformation and involves a single parameter, α. The transformation has two equivalent versions. The first is the stay-in-the-simplex version, which is the power transformation as defined by Aitchison in 1986. The second version, which is a linear transformation of the power transformation, is a Box-Cox type transformation. We discuss a parametric way of estimating the value of α, which is maximization of its profile likelihood (assuming multivariate normality of the transformed data) and the equivalence between the two versions is exhibited. Other ways include maximization of the correct classification probability in discriminant analysis and maximization of the pseudo R-squared (as defined by Aitchison in 1986) in linear regression. We examine the relationship between the α-transformation, the raw data approach and the isometric log-ratio transformation. Furthermore, we also define a suitable family of metrics corresponding to the family of α-transformation and consider the corresponding family of Frechet means.
△ Less
Submitted 16 June, 2011; v1 submitted 7 June, 2011;
originally announced June 2011.