-
Post-selection inference for quantifying uncertainty in changes in variance
Authors:
Rachel Carrington,
Paul Fearnhead
Abstract:
Quantifying uncertainty in detected changepoints is an important problem. However it is challenging as the naive approach would use the data twice, first to detect the changes, and then to test them. This will bias the test, and can lead to anti-conservative p-values. One approach to avoid this is to use ideas from post-selection inference, which conditions on the information in the data used to c…
▽ More
Quantifying uncertainty in detected changepoints is an important problem. However it is challenging as the naive approach would use the data twice, first to detect the changes, and then to test them. This will bias the test, and can lead to anti-conservative p-values. One approach to avoid this is to use ideas from post-selection inference, which conditions on the information in the data used to choose which changes to test. As a result this produces valid p-values; that is, p-values that have a uniform distribution if there is no change. Currently such methods have been developed for detecting changes in mean only. This paper presents two approaches for constructing post-selection p-values for detecting changes in variance. These vary depending on the method use to detect the changes, but are general in terms of being applicable for a range of change-detection methods and a range of hypotheses that we may wish to test.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Urban map** in Dar es Salaam using AJIVE
Authors:
Rachel J. Carrington,
Ian L. Dryden,
Madeleine Ellis,
James O. Goulding,
Simon P. Preston,
David J. Sirl
Abstract:
Map** deprivation in urban areas is important, for example for identifying areas of greatest need and planning interventions. Traditional ways of obtaining deprivation estimates are based on either census or household survey data, which in many areas is unavailable or difficult to collect. However, there has been a huge rise in the amount of new, non-traditional forms of data, such as satellite…
▽ More
Map** deprivation in urban areas is important, for example for identifying areas of greatest need and planning interventions. Traditional ways of obtaining deprivation estimates are based on either census or household survey data, which in many areas is unavailable or difficult to collect. However, there has been a huge rise in the amount of new, non-traditional forms of data, such as satellite imagery and cell-phone call-record data, which may contain information useful for identifying deprivation. We use Angle-Based Joint and Individual Variation Explained (AJIVE) to jointly model satellite imagery data, cell-phone data, and survey data for the city of Dar es Salaam, Tanzania. We first identify interpretable low-dimensional structure from the imagery and cell-phone data, and find that we can use these to identify deprivation. We then consider what is gained from further incorporating the more traditional and costly survey data. We also introduce a scalar measure of deprivation as a response variable to be predicted, and consider various approaches to multiview regression, including using AJIVE scores as predictors.
△ Less
Submitted 13 March, 2024;
originally announced March 2024.
-
Improving Power by Conditioning on Less in Post-selection Inference for Changepoints
Authors:
Rachel Carrington,
Paul Fearnhead
Abstract:
Post-selection inference has recently been proposed as a way of quantifying uncertainty about detected changepoints. The idea is to run a changepoint detection algorithm, and then re-use the same data to perform a test for a change near each of the detected changes. By defining the p-value for the test appropriately, so that it is conditional on the information used to choose the test, this approa…
▽ More
Post-selection inference has recently been proposed as a way of quantifying uncertainty about detected changepoints. The idea is to run a changepoint detection algorithm, and then re-use the same data to perform a test for a change near each of the detected changes. By defining the p-value for the test appropriately, so that it is conditional on the information used to choose the test, this approach will produce valid p-values. We show how to improve the power of these procedures by conditioning on less information. This gives rise to an ideal selective p-value that is intractable but can be approximated by Monte Carlo. We show that for any Monte Carlo sample size, this procedure produces valid p-values, and empirically that noticeable increase in power is possible with only very modest Monte Carlo sample sizes. Our procedure is easy to implement given existing post-selection inference methods, as we just need to generate perturbations of the data set and re-apply the post-selection method to each of these. On genomic data consisting of human GC content, our procedure increases the number of significant changepoints that are detected from e.g. 17 to 27, when compared to existing methods.
△ Less
Submitted 17 January, 2024; v1 submitted 13 January, 2023;
originally announced January 2023.
-
Invariance and identifiability issues for word embeddings
Authors:
Rachel Carrington,
Karthik Bharath,
Simon Preston
Abstract:
Word embeddings are commonly obtained as optimizers of a criterion function $f$ of a text corpus, but assessed on word-task performance using a different evaluation function $g$ of the test data. We contend that a possible source of disparity in performance on tasks is the incompatibility between classes of transformations that leave $f$ and $g$ invariant. In particular, word embeddings defined by…
▽ More
Word embeddings are commonly obtained as optimizers of a criterion function $f$ of a text corpus, but assessed on word-task performance using a different evaluation function $g$ of the test data. We contend that a possible source of disparity in performance on tasks is the incompatibility between classes of transformations that leave $f$ and $g$ invariant. In particular, word embeddings defined by $f$ are not unique; they are defined only up to a class of transformations to which $f$ is invariant, and this class is larger than the class to which $g$ is invariant. One implication of this is that the apparent superiority of one word embedding over another, as measured by word task performance, may largely be a consequence of the arbitrary elements selected from the respective solution sets. We provide a formal treatment of the above identifiability issue, present some numerical examples, and discuss possible resolutions.
△ Less
Submitted 6 November, 2019;
originally announced November 2019.