-
Tests for categorical data beyond Pearson: A distance covariance and energy distance approach
Authors:
Fernando Castro-Prado,
Wenceslao González-Manteiga,
Javier Costas,
Fernando Facal,
Dominic Edelmann
Abstract:
Categorical variables are of uttermost importance in biomedical research. When two of them are considered, it is often the case that one wants to test whether or not they are statistically dependent. We show weaknesses of classical methods -- such as Pearson's and the G-test -- and we propose testing strategies based on distances that lack those drawbacks. We first develop this theory for classica…
▽ More
Categorical variables are of uttermost importance in biomedical research. When two of them are considered, it is often the case that one wants to test whether or not they are statistically dependent. We show weaknesses of classical methods -- such as Pearson's and the G-test -- and we propose testing strategies based on distances that lack those drawbacks. We first develop this theory for classical two-dimensional contingency tables, within the context of distance covariance, an association measure that characterises general statistical independence of two variables. We then apply the same fundamental ideas to one-dimensional tables, namely to the testing for goodness of fit to a discrete distribution, for which we resort to an analogous statistic called energy distance. We prove that our methodology has desirable theoretical properties, and we show how we can calibrate the null distribution of our test statistics without resorting to any resampling technique. We illustrate all this in simulations, as well as with some real data examples, demonstrating the adequate performance of our approach for biostatistical practice.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
Linear parametric model checks for functional time series
Authors:
W. González-Manteiga,
M. D. Ruiz-Medina,
M. Febrero-Bande
Abstract:
The presented methodology for testing the goodness-of-fit of an Autoregressive Hilbertian model (ARH(1) model) provides an infinite-dimensional formulation of the approach proposed in Koul and Stute (1999), based on empirical process marked by residuals. Applying a central and functional central limit result for Hilbert-valued martingale difference sequences, the asymptotic behavior of the formula…
▽ More
The presented methodology for testing the goodness-of-fit of an Autoregressive Hilbertian model (ARH(1) model) provides an infinite-dimensional formulation of the approach proposed in Koul and Stute (1999), based on empirical process marked by residuals. Applying a central and functional central limit result for Hilbert-valued martingale difference sequences, the asymptotic behavior of the formulated H-valued empirical process, also indexed by H, is obtained under the null hypothesis. The limiting process is H-valued generalized (i.e., indexed by H) Wiener process, leading to an asymptotically distribution free test. Consistency of the test is also proved. The case of misspecified autocorrelation operator of the ARH(1) process is addressed. The asymptotic equivalence in probability, uniformly in the norm of H, of the empirical processes formulated under known and unknown autocorrelation operator is obtained. Beyond the Euclidean setting, this approach allows to implement goodness of fit testing in the context of manifold and spherical functional autoregressive processes.
△ Less
Submitted 29 June, 2024; v1 submitted 16 March, 2023;
originally announced March 2023.
-
Testing for genetic interactions in complex disease with distance correlation
Authors:
Fernando Castro-Prado,
Javier Costas,
Dominic Edelmann,
Wenceslao González-Manteiga,
David R. Penas
Abstract:
Understanding epistasis (genetic interaction) may shed some light on the genomic basis of common diseases, including disorders of maximum interest due to their high socioeconomic burden, like schizophrenia. Distance correlation is an association measure that characterises general statistical independence between random variables, not only the linear one. Here, we propose distance correlation as a…
▽ More
Understanding epistasis (genetic interaction) may shed some light on the genomic basis of common diseases, including disorders of maximum interest due to their high socioeconomic burden, like schizophrenia. Distance correlation is an association measure that characterises general statistical independence between random variables, not only the linear one. Here, we propose distance correlation as a novel tool for the detection of epistasis from case-control data of single-nucleotide polymorphisms (SNPs). On the methodological side, we highlight the derivation of the explicit asymptotic distribution of the test statistic. We show that this is the only way to obtain enough computational speed for the method to be used in practice, in a scenario where the resampling techniques found in the literature are impractical. Our simulations show satisfactory calibration of significance, as well as comparable or better power than existing methodology. We conclude with the application of our technique to a schizophrenia genetics dataset, obtaining biologically sound insights.
△ Less
Submitted 27 April, 2023; v1 submitted 9 December, 2020;
originally announced December 2020.
-
Nonparametric independence tests in metric spaces: What is known and what is not
Authors:
Fernando Castro-Prado,
Wenceslao González-Manteiga
Abstract:
Distance correlation is a recent extension of Pearson's correlation, that characterises general statistical independence between Euclidean-space-valued random variables, not only linear relations. This review delves into how and when distance correlation can be extended to metric spaces, combining the information that is available in the literature with some original remarks and proofs, in a way t…
▽ More
Distance correlation is a recent extension of Pearson's correlation, that characterises general statistical independence between Euclidean-space-valued random variables, not only linear relations. This review delves into how and when distance correlation can be extended to metric spaces, combining the information that is available in the literature with some original remarks and proofs, in a way that is comprehensible for any mathematical statistician.
△ Less
Submitted 29 September, 2020;
originally announced September 2020.
-
Partially linear models on Riemannian manifolds
Authors:
Wenceslao Gonzalez-Manteiga,
Guillermo Henry,
Daniela Rodriguez
Abstract:
In partially linear models the dependence of the response y on (x^T,t) is modeled through the relationship y=\x^T β+g(t)+ε where εis independent of (x^T,t). In this paper, estimators of βand g are constructed when the explanatory variables t take values on a Riemannian manifold. Our proposal combine the flexibility of these models with the complex structure of a set of explanatory variable…
▽ More
In partially linear models the dependence of the response y on (x^T,t) is modeled through the relationship y=\x^T β+g(t)+ε where εis independent of (x^T,t). In this paper, estimators of βand g are constructed when the explanatory variables t take values on a Riemannian manifold. Our proposal combine the flexibility of these models with the complex structure of a set of explanatory variables. We prove that the resulting estimator of βis asymptotically normal under the suitable conditions. Through a simulation study, we explored the performance of the estimators. Finally, we applied the studied model to an example based on real dataset.
△ Less
Submitted 8 March, 2010;
originally announced March 2010.