Search | arXiv e-print repository

Saddlepoint approximations in binary genome-wide association studies

Authors: Pål Vegard Johnsen, Øyvind Bakke, Thea Bjørnland, Andrew Thomas DeWan, Mette Langaas

Abstract: We investigate saddlepoint approximations applied to the score test statistic in genome-wide association studies with binary phenotypes. The inaccuracy in the normal approximation of the score test statistic increases with increasing sample imbalance and with decreasing minor allele count. Applying saddlepoint approximations to the score test statistic distribution greatly improve the accuracy, ev… ▽ More We investigate saddlepoint approximations applied to the score test statistic in genome-wide association studies with binary phenotypes. The inaccuracy in the normal approximation of the score test statistic increases with increasing sample imbalance and with decreasing minor allele count. Applying saddlepoint approximations to the score test statistic distribution greatly improve the accuracy, even far out in the tail of the distribution. By using exact results for an intercept model and binary covariate model, as well as simulations for models with nuisance parameters, we emphasize the need for continuity corrections in order to achieve valid $p$-values. The performance of the saddlepoint approximations is evaluated by overall and conditional type I error rate on simulated data. We investigate the methods further by using data from UK Biobank with skin and soft tissue infections as phenotype, using both common and rare variants. The analysis confirms that continuity correction is important particularly for rare variants, and that the normal approximation gives a highly inflated type I error rate for case imbalance. △ Less

Submitted 8 October, 2021; originally announced October 2021.

Comments: 15 pages in main manuscript and 7 pages in supplementary file

arXiv:2109.00855 [pdf, other]

Inferring feature importance with uncertainties in high-dimensional data

Authors: Pål Vegard Johnsen, Inga Strümke, Signe Riemer-Sørensen, Andrew Thomas DeWan, Mette Langaas

Abstract: Estimating feature importance is a significant aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley value based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published… ▽ More Estimating feature importance is a significant aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley value based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published feature importance measure of SAGE (Shapley additive global importance) and introduce sub-SAGE which can be estimated without resampling for tree-based models. We argue that the uncertainties can be estimated from bootstrap** and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as high-dimensional genomics data. △ Less

Submitted 20 September, 2021; v1 submitted 2 September, 2021; originally announced September 2021.

arXiv:1701.01286 [pdf, other]

doi 10.1002/sim.7914

Powerful extreme phenotype sampling designs and score tests for genetic association studies

Authors: Thea Bjørnland, Anja Bye, Einar Ryeng, Ulrik Wisløff, Mette Langaas

Abstract: We consider cross-sectional genetic association studies (common and rare variants) where non-genetic information is available, or feasible to obtain for $N$ individuals, but where it is infeasible to genotype all $N$ individuals. We consider continuously measurable Gaussian traits (phenotypes). Genoty** $n<N$ extreme phenotype individuals can yield better power to detect phenotype-genotype assoc… ▽ More We consider cross-sectional genetic association studies (common and rare variants) where non-genetic information is available, or feasible to obtain for $N$ individuals, but where it is infeasible to genotype all $N$ individuals. We consider continuously measurable Gaussian traits (phenotypes). Genoty** $n<N$ extreme phenotype individuals can yield better power to detect phenotype-genotype associations, as compared to randomly selecting $n$ individuals. We define a person as having an extreme phenotype if the observed phenotype is above a specified threshold or below a specified thresholds. We consider a model where these thresholds can be tailored to each individual. The classical extreme sampling design is to set equal thresholds for all individuals. We introduce a design ($z$-extreme sampling) where personalized thresholds are defined based on the residuals of a regression model including only non-genetic (fully available) information. We derive score tests for the situation where only $n$ extremes are analyzed (complete case analysis), and for the situation where the non-genetic information on $N-n$ non-extremes is included in the analysis (all case analysis). For the classical design, all case analysis is generally more powerful than complete case analysis. For the $z$-extreme sample, we show that all case and complete case tests are equally powerful. Simulations and data analysis also show that $z$-extreme sampling is at least as powerful as the classical extreme sampling design and the classical design is shown to be at times less powerful than random sampling. The method of dichotomizing extreme phenotypes is also discussed. △ Less

Submitted 5 February, 2020; v1 submitted 5 January, 2017; originally announced January 2017.

Journal ref: Statistics in Medicine. 2018; 37: 4234-4251

arXiv:1612.07010 [pdf, ps, other]

Permutation in genetic association studies with covariates: controlling the familywise error rate with score tests in generalized linear models

Authors: Kari Krizak Halle, Mette Langaas

Abstract: In genome-wide association (GWA) studies the goal is to detect associations between genetic markers and a given phenotype. The number of genetic markers can be large and effective methods for control of the overall error rate is a central topic when analyzing GWA data. The Bonferroni method is known to be conservative when the tests are dependent. Permutation methods give exact control of the over… ▽ More In genome-wide association (GWA) studies the goal is to detect associations between genetic markers and a given phenotype. The number of genetic markers can be large and effective methods for control of the overall error rate is a central topic when analyzing GWA data. The Bonferroni method is known to be conservative when the tests are dependent. Permutation methods give exact control of the overall error rate when the assumption of exchangeability is satisfied, but are computationally intensive for large datasets. For regression models the exchangeability assumption is in general not satisfied and there is no standard solution on how to do permutation testing, except some approximate methods. In this paper we will discuss permutation methods for control of the familywise error rate in genetic association studies and present an approximate solution. These methods will be compared using simulated data. △ Less

Submitted 8 May, 2017; v1 submitted 21 December, 2016; originally announced December 2016.

Comments: 19 pages

arXiv:1612.04535 [pdf, other]

Is the familywise error rate in genomics controlled by methods based on the effective number of independent tests?

Authors: Kari Krizak Halle, Srdjan Djurovic, Ole Andreas Andreassen, Mette Langaas

Abstract: In genome-wide association (GWA) studies the goal is to detect association between one or more genetic markers and a given phenotype. The number of genetic markers in a GWA study can be in the order hundreds of thousands and therefore multiple testing methods are needed. This paper presents a set of popular methods to be used to correct for multiple testing in GWA studies. All are based on the con… ▽ More In genome-wide association (GWA) studies the goal is to detect association between one or more genetic markers and a given phenotype. The number of genetic markers in a GWA study can be in the order hundreds of thousands and therefore multiple testing methods are needed. This paper presents a set of popular methods to be used to correct for multiple testing in GWA studies. All are based on the concept of estimating an effective number of independent tests. We compare these methods using simulated data and data from the TOP study, and show that the effective number of independent tests is not additive over blocks of independent genetic markers unless we assume a common value for the local significance level. We also show that the reviewed methods based on estimating the effective number of independent tests in general do not control the familywise error rate. △ Less

Submitted 21 December, 2016; v1 submitted 14 December, 2016; originally announced December 2016.

Comments: 20 pages, 3 figures

arXiv:1603.05938 [pdf, other]

doi 10.1111/sjos.12451

Efficient and powerful familywise error control in genome-wide association studies using generalized linear models

Authors: K. K. Halle, Ø. Bakke, S. Djurovic, A. Bye, E. Ryeng, U. Wisløff, O. A. Andreassen, M. Langaas

Abstract: In genetic association studies, detecting phenotype-genotype association is a primary goal. We assume that the relationship between the data -phenotype, genetic markers and environmental covariates - can be modelled by a generalized linear model (GLM). The inclusion of environmental covariates makes it possible to account for important confounding factors, such as sex and population substructure.… ▽ More In genetic association studies, detecting phenotype-genotype association is a primary goal. We assume that the relationship between the data -phenotype, genetic markers and environmental covariates - can be modelled by a generalized linear model (GLM). The inclusion of environmental covariates makes it possible to account for important confounding factors, such as sex and population substructure. A multivariate score statistic, which under the complete null hypothesis of no phenotype-genotype association asymptotically has a multivariate normal distribution with a covariance matrix that can be estimated from the data, is used to test a large number of genetic markers for association with the phenotype. We stress the importance of controlling the familywise error rate (FWER), and use the asymptotic distribution of the multivariate score test statistic to find a local significance level for the individual test. Using real data (from one study on schizophrenia and bipolar disorder and one on maximal oxygen uptake) and constructed correlated structures, we show that our method is a powerful alternative to the popular Bonferroni and Sidak methods. For GLMs without environmental covariates, we show that our method is an efficient alternative to permutation methods for multiple testing. Further, we show that if environmental covariates and genetic markers are uncorrelated, the estimated covariance matrix of the score test statistic can be approximated by the estimated correlation matrix for just the genetic markers. As byproducts of our method, an effective number of independent tests can be defined, and FWER-adjusted $p$-values can be calculated as an alternative to using a local significance level. △ Less

Submitted 22 December, 2016; v1 submitted 18 March, 2016; originally announced March 2016.

arXiv:1307.7537 [pdf, ps, other]

Exact conditional p-values from arbitrary ranking of a sample space: An application to genome-wide association studies

Authors: Max Moldovan, Mette Langaas

Abstract: We introduce a method for computation of exact conditional efficiency robust enumeration p-values for detection of genotype--phenotype associations at a single bi-allelic genetic locus. Our method can be based on any arbitrary ranking test statistics, such as efficiency robust test statistics or asymptotic p-values. The resulting p-values are exact conditional enumeration p-values and satisfy the… ▽ More We introduce a method for computation of exact conditional efficiency robust enumeration p-values for detection of genotype--phenotype associations at a single bi-allelic genetic locus. Our method can be based on any arbitrary ranking test statistics, such as efficiency robust test statistics or asymptotic p-values. The resulting p-values are exact conditional enumeration p-values and satisfy the basic statistical validity property. Practically, the method allows performing statistically valid significance testing in genomic analyses with unknown modes of inheritance at individual bi-allelic genetic loci -- the situation typical in genome-wide association studies. We provide an open-source R code implementing the method. △ Less

Submitted 29 July, 2013; originally announced July 2013.

Journal ref: Advances in Systems Science and Applications (2014) Vol.14 No.1 76-83

arXiv:1307.7536 [pdf, ps, other]

doi 10.1515/sagmb-2013-0084

Robust Methods for Disease-Genotype Association in Genetic Association Studies: Calculate P-values Using Exact Conditional Enumeration instead of Asymptotic Approximations

Authors: Mette Langaas, Øyvind Bakke

Abstract: In genetic association studies, detecting disease-genotype associations is a primary goal. For most diseases, the underlying genetic model is unknown, and we study seven robust test statistics for monotone association. For a given test statistic, there are many ways to calculate a p-value, but in genetic association studies, calculations have predominantly been based on asymptotic approximations o… ▽ More In genetic association studies, detecting disease-genotype associations is a primary goal. For most diseases, the underlying genetic model is unknown, and we study seven robust test statistics for monotone association. For a given test statistic, there are many ways to calculate a p-value, but in genetic association studies, calculations have predominantly been based on asymptotic approximations or on simulated permutations. We show that when the number of permutations tends to infinity, the permutation p-value approaches the exact conditional enumeration p-value, and further that calculating the latter p-value is much more efficient than performing simulated permutations. We then answer two research questions. (i) Which of the test statistics under study are the most powerful for monotone genetic models? (ii) Based on test size, power, and computational considerations, should asymptotic approximations or exact conditional enumeration be used for calculating p-values? We have studied case-control sample sizes with 500-5000 cases and 500-15000 controls, and significance levels from 5e-8 to 0.05, thus our results are applicable to genetic association studies with only one genetic marker under study, intermediate follow-up studies, and genome wide association studies. We find that if all monotone genetic models are of interest, the best performance is achieved for a test statistics based on the maximum over a range of Cochrane-Armitage trend tests with different scores and for a constrained likelihood ratio test. For significance levels below 0.05, asymptotic approximations may give a test size up to 20 times the nominal level, and should therefore be used with caution. Further, calculating p-values based on exact conditional enumeration is a powerful, valid and computationally feasible approach, and we advocate its use in genetic association studies. △ Less

Submitted 29 July, 2013; originally announced July 2013.

Showing 1–8 of 8 results for author: Langaas, M