Search | arXiv e-print repository

doi 10.1080/00031305.2024.2374966

Distance Covariance, Independence, and Pairwise Differences

Authors: Jakob Raymaekers, Peter J. Rousseeuw

Abstract: (To appear in The American Statistician.) Distance covariance (Székely, Rizzo, and Bakirov, 2007) is a fascinating recent notion, which is popular as a test for dependence of any type between random variables $X$ and $Y$. This approach deserves to be touched upon in modern courses on mathematical statistics. It makes use of distances of the type $|X-X'|$ and $|Y-Y'|$, where $(X',Y')$ is an indepen… ▽ More (To appear in The American Statistician.) Distance covariance (Székely, Rizzo, and Bakirov, 2007) is a fascinating recent notion, which is popular as a test for dependence of any type between random variables $X$ and $Y$. This approach deserves to be touched upon in modern courses on mathematical statistics. It makes use of distances of the type $|X-X'|$ and $|Y-Y'|$, where $(X',Y')$ is an independent copy of $(X,Y)$. This raises natural questions about independence of variables like $X-X'$ and $Y-Y'$, about the connection between Cov$(|X-X'|,|Y-Y'|)$ and the covariance between doubly centered distances, and about necessary and sufficient conditions for independence. We show some basic results and present a new and nontechnical counterexample to a common fallacy, which provides more insight. We also show some motivating examples involving bivariate distributions and contingency tables, which can be used as didactic material for introducing distance correlation. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Journal ref: The American Statistician, 2024

arXiv:2403.03722 [pdf, other]

Is Distance Correlation Robust?

Authors: Sarah Leyder, Jakob Raymaekers, Peter J. Rousseeuw

Abstract: Distance correlation is a popular measure of dependence between random variables. It has some robustness properties, but not all. We prove that the influence function of the usual distance correlation is bounded, but that its breakdown value is zero. Moreover, it has an unbounded sensitivity function, converging to the bounded influence function for increasing sample size. To address this sensitiv… ▽ More Distance correlation is a popular measure of dependence between random variables. It has some robustness properties, but not all. We prove that the influence function of the usual distance correlation is bounded, but that its breakdown value is zero. Moreover, it has an unbounded sensitivity function, converging to the bounded influence function for increasing sample size. To address this sensitivity to outliers we construct a more robust version of distance correlation, which is based on a new data transformation. Simulations indicate that the resulting method is quite robust, and has good power in the presence of outliers. We illustrate the method on genetic data. Comparing the classical distance correlation with its more robust version provides additional insight. △ Less

Submitted 6 March, 2024; originally announced March 2024.

arXiv:2308.05422 [pdf, other]

TSLiNGAM: DirectLiNGAM under heavy tails

Authors: Sarah Leyder, Jakob Raymaekers, Tim Verdonck

Abstract: One of the established approaches to causal discovery consists of combining directed acyclic graphs (DAGs) with structural causal models (SCMs) to describe the functional dependencies of effects on their causes. Possible identifiability of SCMs given data depends on assumptions made on the noise variables and the functional classes in the SCM. For instance, in the LiNGAM model, the functional clas… ▽ More One of the established approaches to causal discovery consists of combining directed acyclic graphs (DAGs) with structural causal models (SCMs) to describe the functional dependencies of effects on their causes. Possible identifiability of SCMs given data depends on assumptions made on the noise variables and the functional classes in the SCM. For instance, in the LiNGAM model, the functional class is restricted to linear functions and the disturbances have to be non-Gaussian. In this work, we propose TSLiNGAM, a new method for identifying the DAG of a causal model based on observational data. TSLiNGAM builds on DirectLiNGAM, a popular algorithm which uses simple OLS regression for identifying causal directions between variables. TSLiNGAM leverages the non-Gaussianity assumption of the error terms in the LiNGAM model to obtain more efficient and robust estimation of the causal structure. TSLiNGAM is justified theoretically and is studied empirically in an extensive simulation study. It performs significantly better on heavy-tailed and skewed data and demonstrates a high small-sample efficiency. In addition, TSLiNGAM also shows better robustness properties as it is more resilient to contamination. △ Less

Submitted 10 August, 2023; originally announced August 2023.

Comments: 35 pages, 10 figures

arXiv:2303.05836 [pdf, other]

Generalized Spherical Principal Component Analysis

Authors: Sarah Leyder, Jakob Raymaekers, Tim Verdonck

Abstract: Outliers contaminating data sets are a challenge to statistical estimators. Even a small fraction of outlying observations can heavily influence most classical statistical methods. In this paper we propose generalized spherical principal component analysis, a new robust version of principal component analysis that is based on the generalized spatial sign covariance matrix. Supporting theoretical p… ▽ More Outliers contaminating data sets are a challenge to statistical estimators. Even a small fraction of outlying observations can heavily influence most classical statistical methods. In this paper we propose generalized spherical principal component analysis, a new robust version of principal component analysis that is based on the generalized spatial sign covariance matrix. Supporting theoretical properties of the proposed method including influence functions, breakdown values and asymptotic efficiencies are studied, and a simulation study is conducted to compare our new method to existing methods. We also propose an adjustment of the generalized spatial sign covariance matrix to achieve better Fisher consistency properties. We illustrate that generalized spherical principal component analysis, depending on a chosen radial function, has both great robustness and efficiency properties in addition to a low computational cost. △ Less

Submitted 10 March, 2023; originally announced March 2023.

arXiv:2302.03931 [pdf, other]

Fast Linear Model Trees by PILOT

Authors: Jakob Raymaekers, Peter J. Rousseeuw, Tim Verdonck, Ruicong Yao

Abstract: Linear model trees are regression trees that incorporate linear models in the leaf nodes. This preserves the intuitive interpretation of decision trees and at the same time enables them to better capture linear relationships, which is hard for standard decision trees. But most existing methods for fitting linear model trees are time consuming and therefore not scalable to large data sets. In addit… ▽ More Linear model trees are regression trees that incorporate linear models in the leaf nodes. This preserves the intuitive interpretation of decision trees and at the same time enables them to better capture linear relationships, which is hard for standard decision trees. But most existing methods for fitting linear model trees are time consuming and therefore not scalable to large data sets. In addition, they are more prone to overfitting and extrapolation issues than standard regression trees. In this paper we introduce PILOT, a new algorithm for linear model trees that is fast, regularized, stable and interpretable. PILOT trains in a greedy fashion like classic regression trees, but incorporates an $L^2$ boosting approach and a model selection rule for fitting linear models in the nodes. The abbreviation PILOT stands for $PI$ecewise $L$inear $O$rganic $T$ree, where `organic' refers to the fact that no pruning is carried out. PILOT has the same low time and space complexity as CART without its pruning. An empirical study indicates that PILOT tends to outperform standard decision trees and other linear model trees on a variety of data sets. Moreover, we prove its consistency in an additive model setting under weak assumptions. When the data is generated by a linear model, the convergence rate is polynomial. △ Less

Submitted 8 February, 2023; originally announced February 2023.

Journal ref: Machine Learning, 2024

arXiv:2302.02156 [pdf, other]

doi 10.1016/j.ecosta.2024.02.002

Challenges of cellwise outliers

Authors: Jakob Raymaekers, Peter J. Rousseeuw

Abstract: It is well-known that real data often contain outliers. The term outlier typically refers to a case, that is, a row of the $n \times d$ data matrix. In recent times a different type has come into focus, the cellwise outliers. These are suspicious cells (entries) that can occur anywhere in the data matrix. Even a relatively small proportion of outlying cells can contaminate over half the rows, whic… ▽ More It is well-known that real data often contain outliers. The term outlier typically refers to a case, that is, a row of the $n \times d$ data matrix. In recent times a different type has come into focus, the cellwise outliers. These are suspicious cells (entries) that can occur anywhere in the data matrix. Even a relatively small proportion of outlying cells can contaminate over half the rows, which is a problem for rowwise robust methods. In this article we discuss the challenges posed by cellwise outliers, and some methods developed so far to deal with them. We obtain new results on cellwise breakdown values for location, covariance and regression. We also propose a cellwise robust method for correspondence analysis, with real data illustrations. The paper concludes by formulating some points for debate. △ Less

Submitted 4 February, 2023; originally announced February 2023.

Journal ref: Econometrics and Statistics, 2024

arXiv:2209.07374 [pdf, other]

The Influence Function of Graphical Lasso Estimators

Authors: Gaëtan Louvet, Jakob Raymaekers, Germain Van Bever, Ines Wilms

Abstract: The precision matrix that encodes conditional linear dependency relations among a set of variables forms an important object of interest in multivariate analysis. Sparse estimation procedures for precision matrices such as the graphical lasso (Glasso) gained popularity as they facilitate interpretability, thereby separating pairs of variables that are conditionally dependent from those that are in… ▽ More The precision matrix that encodes conditional linear dependency relations among a set of variables forms an important object of interest in multivariate analysis. Sparse estimation procedures for precision matrices such as the graphical lasso (Glasso) gained popularity as they facilitate interpretability, thereby separating pairs of variables that are conditionally dependent from those that are independent (given all other variables). Glasso lacks, however, robustness to outliers. To overcome this problem, one typically applies a robust plug-in procedure where the Glasso is computed from a robust covariance estimate instead of the sample covariance, thereby providing protection against outliers. In this paper, we study such estimators theoretically, by deriving and comparing their influence function, sensitivity curves and asymptotic variances. △ Less

Submitted 8 March, 2023; v1 submitted 15 September, 2022; originally announced September 2022.

arXiv:2207.13493 [pdf, other]

doi 10.1080/01621459.2023.2267777

The Cellwise Minimum Covariance Determinant Estimator

Authors: Jakob Raymaekers, Peter J. Rousseeuw

Abstract: The usual Minimum Covariance Determinant (MCD) estimator of a covariance matrix is robust against casewise outliers. These are cases (that is, rows of the data matrix) that behave differently from the majority of cases, raising suspicion that they might belong to a different population. On the other hand, cellwise outliers are individual cells in the data matrix. When a row contains one or more ou… ▽ More The usual Minimum Covariance Determinant (MCD) estimator of a covariance matrix is robust against casewise outliers. These are cases (that is, rows of the data matrix) that behave differently from the majority of cases, raising suspicion that they might belong to a different population. On the other hand, cellwise outliers are individual cells in the data matrix. When a row contains one or more outlying cells, the other cells in the same row still contain useful information that we wish to preserve. We propose a cellwise robust version of the MCD method, called cellMCD. Its main building blocks are observed likelihood and a penalty term on the number of flagged cellwise outliers. It possesses good breakdown properties. We construct a fast algorithm for cellMCD based on concentration steps (C-steps) that always lower the objective. The method performs well in simulations with cellwise outliers, and has high finite-sample efficiency on clean data. It is illustrated on real data with visualizations of the results. △ Less

Submitted 15 November, 2023; v1 submitted 27 July, 2022; originally announced July 2022.

Journal ref: Journal of the American Statistical Association, 2025

arXiv:2202.08060 [pdf, other]

Equivariant Passing-Bablok regression in quasilinear time

Authors: Jakob Raymaekers, Florian Dufey

Abstract: Passing-Bablok regression is a standard tool for method and assay comparison studies thanks to its place in industry guidelines such as CLSI. Unfortunately, its computational cost is high as a naive approach requires O(n2) time. This makes it impossible to compute the Passing-Bablok regression estimator on large datasets. Additionally, even on smaller datasets it can be difficult to perform bootst… ▽ More Passing-Bablok regression is a standard tool for method and assay comparison studies thanks to its place in industry guidelines such as CLSI. Unfortunately, its computational cost is high as a naive approach requires O(n2) time. This makes it impossible to compute the Passing-Bablok regression estimator on large datasets. Additionally, even on smaller datasets it can be difficult to perform bootstrap-based inference. We introduce the first quasilinear time algorithm for the equivariant Passing-Bablok estimator. In contrast to the naive algorithm, our algorithm runs in O(n log(n)) expected time using O(n) space, allowing for its application to much larger data sets. Additionally, we introduce a fast estimator for the variance of the Passing-Bablok slope and discuss statistical inference based on bootstrap and this variance estimate. Finally, we propose a diagnostic plot to identify influential points in Passing-Bablok regression. The superior performance of the proposed methods is illustrated on real data examples of clinical method comparison studies. △ Less

Submitted 16 February, 2022; originally announced February 2022.

arXiv:2106.08814 [pdf, other]

doi 10.1080/10618600.2022.2050249

Silhouettes and quasi residual plots for neural nets and tree-based classifiers

Authors: Jakob Raymaekers, Peter J. Rousseeuw

Abstract: Classification by neural nets and by tree-based methods are powerful tools of machine learning. There exist interesting visualizations of the inner workings of these and other classifiers. Here we pursue a different goal, which is to visualize the cases being classified, either in training data or in test data. An important aspect is whether a case has been classified to its given class (label) or… ▽ More Classification by neural nets and by tree-based methods are powerful tools of machine learning. There exist interesting visualizations of the inner workings of these and other classifiers. Here we pursue a different goal, which is to visualize the cases being classified, either in training data or in test data. An important aspect is whether a case has been classified to its given class (label) or whether the classifier wants to assign it to different class. This is reflected in the (conditional and posterior) probability of the alternative class (PAC). A high PAC indicates label bias, i.e. the possibility that the case was mislabeled. The PAC is used to construct a silhouette plot which is similar in spirit to the silhouette plot for cluster analysis (Rousseeuw, 1987). The average silhouette width can be used to compare different classifications of the same dataset. We will also draw quasi residual plots of the PAC versus a data feature, which may lead to more insight in the data. One of these data features is how far each case lies from its given class. The graphical displays are illustrated and interpreted on benchmark data sets containing images, mixed features, and tweets. △ Less

Submitted 26 February, 2022; v1 submitted 16 June, 2021; originally announced June 2021.

Journal ref: Journal of Computational and Graphical Statistics 2022, Volume 31, 1332-1343

arXiv:2101.01494 [pdf, other]

Weight-of-evidence 2.0 with shrinkage and spline-binning

Authors: Jakob Raymaekers, Wouter Verbeke, Tim Verdonck

Abstract: In many practical applications, such as fraud detection, credit risk modeling or medical decision making, classification models for assigning instances to a predefined set of classes are required to be both precise as well as interpretable. Linear modeling methods such as logistic regression are often adopted, since they offer an acceptable balance between precision and interpretability. Linear me… ▽ More In many practical applications, such as fraud detection, credit risk modeling or medical decision making, classification models for assigning instances to a predefined set of classes are required to be both precise as well as interpretable. Linear modeling methods such as logistic regression are often adopted, since they offer an acceptable balance between precision and interpretability. Linear methods, however, are not well equipped to handle categorical predictors with high-cardinality or to exploit non-linear relations in the data. As a solution, data preprocessing methods such as weight-of-evidence are typically used for transforming the predictors. The binning procedure that underlies the weight-of-evidence approach, however, has been little researched and typically relies on ad-hoc or expert driven procedures. The objective in this paper, therefore, is to propose a formalized, data-driven and powerful method. To this end, we explore the discretization of continuous variables through the binning of spline functions, which allows for capturing non-linear effects in the predictor variables and yields highly interpretable predictors taking only a small number of discrete values. Moreover, we extend upon the weight-of-evidence approach and propose to estimate the proportions using shrinkage estimators. Together, this offers an improved ability to exploit both non-linear and categorical predictors for achieving increased classification precision, while maintaining interpretability of the resulting model and decreasing the risk of overfitting. We present the results of a series of experiments in a fraud detection setting, which illustrate the effectiveness of the presented approach. We facilitate reproduction of the presented results and adoption of the proposed approaches by providing both the dataset and the code for implementing the experiments and the presented approach. △ Less

Submitted 24 September, 2021; v1 submitted 5 January, 2021; originally announced January 2021.

Comments: New version: duplicate paragraph omitted

arXiv:2010.00950 [pdf, other]

Regularized K-means through hard-thresholding

Authors: Jakob Raymaekers, Ruben H. Zamar

Abstract: We study a framework of regularized $K$-means methods based on direct penalization of the size of the cluster centers. Different penalization strategies are considered and compared through simulation and theoretical analysis. Based on the results, we propose HT $K$-means, which uses an $\ell_0$ penalty to induce sparsity in the variables. Different techniques for selecting the tuning parameter are… ▽ More We study a framework of regularized $K$-means methods based on direct penalization of the size of the cluster centers. Different penalization strategies are considered and compared through simulation and theoretical analysis. Based on the results, we propose HT $K$-means, which uses an $\ell_0$ penalty to induce sparsity in the variables. Different techniques for selecting the tuning parameter are discussed and compared. The proposed method stacks up favorably with the most popular regularized $K$-means methods in an extensive simulation study. Finally, HT $K$-means is applied to several real data examples. Graphical displays are presented and used in these examples to gain more insight into the datasets. △ Less

Submitted 2 October, 2020; originally announced October 2020.

arXiv:2008.12974 [pdf, other]

doi 10.1016/j.chemolab.2020.104197

Real-time discriminant analysis in the presence of label and measurement noise

Authors: Iwein Vranckx, Jakob Raymaekers, Bart De Ketelaere, Peter J. Rousseeuw, Mia Hubert

Abstract: Quadratic discriminant analysis (QDA) is a widely used classification technique. Based on a training dataset, each class in the data is characterized by an estimate of its center and shape, which can then be used to assign unseen observations to one of the classes. The traditional QDA rule relies on the empirical mean and covariance matrix. Unfortunately, these estimators are sensitive to label an… ▽ More Quadratic discriminant analysis (QDA) is a widely used classification technique. Based on a training dataset, each class in the data is characterized by an estimate of its center and shape, which can then be used to assign unseen observations to one of the classes. The traditional QDA rule relies on the empirical mean and covariance matrix. Unfortunately, these estimators are sensitive to label and measurement noise which often impairs the model's predictive ability. Robust estimators of location and scatter are resistant to this type of contamination. However, they have a prohibitive computational cost for large scale industrial experiments. We present a novel QDA method based on a recent real-time robust algorithm. We additionally integrate an anomaly detection step to classify the most atypical observations into a separate class of outliers. Finally, we introduce the label bias plot, a graphical display to identify label and measurement noise in the training data. The performance of the proposed approach is illustrated in a simulation study with huge datasets, and on real datasets about diabetes and fruit. △ Less

Submitted 10 November, 2020; v1 submitted 29 August, 2020; originally announced August 2020.

Journal ref: Chemometrics and Intelligent Laboratory Systems, 2021, Volume 208

arXiv:2007.14495 [pdf, other]

doi 10.1080/00401706.2021.1927849

Class maps for visualizing classification results

Authors: Jakob Raymaekers, Peter J. Rousseeuw, Mia Hubert

Abstract: Classification is a major tool of statistics and machine learning. A classification method first processes a training set of objects with given classes (labels), with the goal of afterward assigning new objects to one of these classes. When running the resulting prediction method on the training data or on test data, it can happen that an object is predicted to lie in a class that differs from its… ▽ More Classification is a major tool of statistics and machine learning. A classification method first processes a training set of objects with given classes (labels), with the goal of afterward assigning new objects to one of these classes. When running the resulting prediction method on the training data or on test data, it can happen that an object is predicted to lie in a class that differs from its given label. This is sometimes called label bias, and raises the question whether the object was mislabeled. The proposed class map reflects the probability that an object belongs to an alternative class, how far it is from the other objects in its given class, and whether some objects lie far from all classes. The goal is to visualize aspects of the classification results to obtain insight in the data. The display is constructed for discriminant analysis, the k-nearest neighbor classifier, support vector machines, logistic regression, and coupling pairwise classifications. It is illustrated on several benchmark datasets, including some about images and texts. △ Less

Submitted 19 May, 2021; v1 submitted 28 July, 2020; originally announced July 2020.

Comments: Appeared online, Technometrics

Journal ref: Technometrics 2022, Vol. 64, pages 151-165

arXiv:2005.07946 [pdf, other]

doi 10.1007/s10994-021-05960-5

Transforming variables to central normality

Authors: Jakob Raymaekers, Peter J. Rousseeuw

Abstract: Many real data sets contain numerical features (variables) whose distribution is far from normal (gaussian). Instead, their distribution is often skewed. In order to handle such data it is customary to preprocess the variables to make them more normal. The Box-Cox and Yeo-Johnson transformations are well-known tools for this. However, the standard maximum likelihood estimator of their transformati… ▽ More Many real data sets contain numerical features (variables) whose distribution is far from normal (gaussian). Instead, their distribution is often skewed. In order to handle such data it is customary to preprocess the variables to make them more normal. The Box-Cox and Yeo-Johnson transformations are well-known tools for this. However, the standard maximum likelihood estimator of their transformation parameter is highly sensitive to outliers, and will often try to move outliers inward at the expense of the normality of the central part of the data. We propose a modification of these transformations as well as an estimator of the transformation parameter that is robust to outliers, so the transformed data can be approximately normal in the center and a few outliers may deviate from it. It compares favorably to existing techniques in an extensive simulation study and on real data. △ Less

Submitted 21 November, 2020; v1 submitted 16 May, 2020; originally announced May 2020.

Journal ref: Machine Learning, 2021

arXiv:1912.12446 [pdf, other]

doi 10.52933/jdssv.v1i3.18

Handling cellwise outliers by sparse regression and robust covariance

Authors: Jakob Raymaekers, Peter J. Rousseeuw

Abstract: We propose a data-analytic method for detecting cellwise outliers. Given a robust covariance matrix, outlying cells (entries) in a row are found by the cellHandler technique which combines lasso regression with a stepwise application of constructed cutoff values. The penalty term of the lasso has a physical interpretation as the total distance that suspicious cells need to move in order to bring t… ▽ More We propose a data-analytic method for detecting cellwise outliers. Given a robust covariance matrix, outlying cells (entries) in a row are found by the cellHandler technique which combines lasso regression with a stepwise application of constructed cutoff values. The penalty term of the lasso has a physical interpretation as the total distance that suspicious cells need to move in order to bring their row into the fold. For estimating a cellwise robust covariance matrix we construct a detection-imputation method which alternates between flagging outlying cells and updating the covariance matrix as in the EM algorithm. The proposed methods are illustrated by simulations and on real data about volatile organic compounds in children. △ Less

Submitted 7 December, 2020; v1 submitted 28 December, 2019; originally announced December 2019.

Journal ref: Journal of Data Science, Statistics, and Visualisation, 2021, issue 3

arXiv:1912.10492 [pdf, ps, other]

doi 10.1093/bioinformatics/btaa243

Pooled variable scaling for cluster analysis

Authors: Jakob Raymaekers, Ruben H. Zamar

Abstract: We propose a new approach for scaling prior to cluster analysis based on the concept of pooled variance. Unlike available scaling procedures such as the standard deviation and the range, our proposed scale avoids dampening the beneficial effect of informative clustering variables. We confirm through an extensive simulation study and applications to well known real data examples that the proposed s… ▽ More We propose a new approach for scaling prior to cluster analysis based on the concept of pooled variance. Unlike available scaling procedures such as the standard deviation and the range, our proposed scale avoids dampening the beneficial effect of informative clustering variables. We confirm through an extensive simulation study and applications to well known real data examples that the proposed scaling method is safe and generally useful. Finally, we use our approach to cluster a high dimensional genomic dataset consisting of gene expression data for several specimens of breast cancer cells tissue. △ Less

Submitted 25 July, 2020; v1 submitted 22 December, 2019; originally announced December 2019.

Comments: 29 pages, 32 figures

arXiv:1910.05615 [pdf, other]

doi 10.1016/j.chemolab.2020.103957

Real-time outlier detection for large datasets by RT-DetMCD

Authors: Bart De Ketelaere, Mia Hubert, Jakob Raymaekers, Peter J. Rousseeuw, Iwein Vranckx

Abstract: Modern industrial machines can generate gigabytes of data in seconds, frequently pushing the boundaries of available computing power. Together with the time criticality of industrial processing this presents a challenging problem for any data analytics procedure. We focus on the deterministic minimum covariance determinant method (DetMCD), which detects outliers by fitting a robust covariance matr… ▽ More Modern industrial machines can generate gigabytes of data in seconds, frequently pushing the boundaries of available computing power. Together with the time criticality of industrial processing this presents a challenging problem for any data analytics procedure. We focus on the deterministic minimum covariance determinant method (DetMCD), which detects outliers by fitting a robust covariance matrix. We construct a much faster version of DetMCD by replacing its initial estimators by two new methods and incorporating update-based concentration steps. The computation time is reduced further by parallel computing, with a novel robust aggregation method to combine the results from the threads. The speed and accuracy of the proposed real-time DetMCD method (RT-DetMCD) are illustrated by simulation and a real industrial application to food sorting. △ Less

Submitted 24 January, 2020; v1 submitted 12 October, 2019; originally announced October 2019.

Journal ref: Chemometrics and Intelligent Laboratory Systems, 2020, Volume 199

arXiv:1808.04278 [pdf, other]

doi 10.1007/s11634-019-00362-x

Clustering genomic words in human DNA using peaks and trends of distributions

Authors: Ana Helena Tavares, Jakob Raymaekers, Peter J. Rousseeuw, Paula Brito, Vera Afreixo

Abstract: In this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust fitting method is used to estimate the baseline distribution (the `trend'), and a sparse vector of… ▽ More In this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust fitting method is used to estimate the baseline distribution (the `trend'), and a sparse vector of detrended data captures the peak structure. A simulation study demonstrates the effectiveness of the clustering procedure in grou** distributions with similar peak behavior and/or baseline features. The procedure is applied to investigate similarities between the distribution patterns of genomic words of lengths 3 and 5 in the human genome. These experiments demonstrate the potential of the new method for identifying words with similar distance patterns. △ Less

Submitted 13 August, 2018; originally announced August 2018.

Journal ref: Advances in Data Analysis and Classification, 2020

arXiv:1805.01417 [pdf, other]

doi 10.1016/j.jmva.2018.11.010

A generalized spatial sign covariance matrix

Authors: Jakob Raymaekers, Peter J. Rousseeuw

Abstract: The well-known spatial sign covariance matrix (SSCM) carries out a radial transform which moves all data points to a sphere, followed by computing the classical covariance matrix of the transformed data. Its popularity stems from its robustness to outliers, fast computation, and applications to correlation and principal component analysis. In this paper we study more general radial functions. It i… ▽ More The well-known spatial sign covariance matrix (SSCM) carries out a radial transform which moves all data points to a sphere, followed by computing the classical covariance matrix of the transformed data. Its popularity stems from its robustness to outliers, fast computation, and applications to correlation and principal component analysis. In this paper we study more general radial functions. It is shown that the eigenvectors of the generalized SSCM are still consistent and the ranks of the eigenvalues are preserved. The influence function of the resulting scatter matrix is derived, and it is shown that its breakdown value is as high as that of the original SSCM. A simulation study indicates that the best results are obtained when the inner half of the data points are not transformed and points lying far away are moved to the center. △ Less

Submitted 10 October, 2018; v1 submitted 3 May, 2018; originally announced May 2018.

Journal ref: Journal of Multivariate Analysis, 2019, Vol. 171, 94-111

arXiv:1803.04820 [pdf, other]

doi 10.1007/s10260-018-0425-3

Discussion of "The power of monitoring"

Authors: Jakob Raymaekers, Peter J. Rousseeuw, Iwein Vranckx

Abstract: This is an invited comment on the discussion paper "The power of monitoring: how to make the most of a contaminated multivariate sample" by A. Cerioli, M. Riani, A. Atkinson and A. Corbellini that will appear in the journal Statistical Methods & Applications. This is an invited comment on the discussion paper "The power of monitoring: how to make the most of a contaminated multivariate sample" by A. Cerioli, M. Riani, A. Atkinson and A. Corbellini that will appear in the journal Statistical Methods & Applications. △ Less

Submitted 13 March, 2018; originally announced March 2018.

Journal ref: Statistical Methods and Applications, 2018, Vol. 27, 589-594

arXiv:1712.05151 [pdf, other]

doi 10.1080/00401706.2019.1677270

Fast robust correlation for high-dimensional data

Authors: Jakob Raymaekers, Peter J. Rousseeuw

Abstract: The product moment covariance is a cornerstone of multivariate data analysis, from which one can derive correlations, principal components, Mahalanobis distances and many other results. Unfortunately the product moment covariance and the corresponding Pearson correlation are very susceptible to outliers (anomalies) in the data. Several robust measures of covariance have been developed, but few are… ▽ More The product moment covariance is a cornerstone of multivariate data analysis, from which one can derive correlations, principal components, Mahalanobis distances and many other results. Unfortunately the product moment covariance and the corresponding Pearson correlation are very susceptible to outliers (anomalies) in the data. Several robust measures of covariance have been developed, but few are suitable for the ultrahigh dimensional data that are becoming more prevalent nowadays. For that one needs methods whose computation scales well with the dimension, are guaranteed to yield a positive semidefinite covariance matrix, and are sufficiently robust to outliers as well as sufficiently accurate in the statistical sense of low variability. We construct such methods using data transformations. The resulting approach is simple, fast and widely applicable. We study its robustness by deriving influence functions and breakdown values, and computing the mean squared error on contaminated data. Using these results we select a method that performs well overall. This also allows us to construct a faster version of the DetectDeviatingCells method (Rousseeuw and Van den Bossche, 2018) to detect cellwise outliers, that can deal with much higher dimensions. The approach is illustrated on genomic data with 12,000 variables and color video data with 920,000 dimensions. △ Less

Submitted 20 October, 2019; v1 submitted 14 December, 2017; originally announced December 2017.

Journal ref: Technometrics, vol. 63, 184-198 (2021)

arXiv:1710.02520 [pdf, other]

doi 10.1007/s12539-017-0273-0

Comparing reverse complementary genomic words based on their distance distributions and frequencies

Authors: Ana Helena Tavares, Jakob Raymaekers, Peter Rousseeuw, Raquel M. Silva, Carlos A. C. Bastos, Armando Pinho, Paula Brito, Vera Afreixo

Abstract: In this work we study reverse complementary genomic word pairs in the human DNA, by comparing both the distance distribution and the frequency of a word to those of its reverse complement. Several measures of dissimilarity between distance distributions are considered, and it is found that the peak dissimilarity works best in this setting. We report the existence of reverse complementary word pair… ▽ More In this work we study reverse complementary genomic word pairs in the human DNA, by comparing both the distance distribution and the frequency of a word to those of its reverse complement. Several measures of dissimilarity between distance distributions are considered, and it is found that the peak dissimilarity works best in this setting. We report the existence of reverse complementary word pairs with very dissimilar distance distributions, as well as word pairs with very similar distance distributions even when both distributions are irregular and contain strong peaks. The association between distribution dissimilarity and frequency discrepancy is explored also, and it is speculated that symmetric pairs combining low and high values of each measure may uncover features of interest. Taken together, our results suggest that some asymmetries in the human genome go far beyond Chargaff's rules. This study uses both the complete human genome and its repeat-masked version. △ Less

Submitted 6 October, 2017; originally announced October 2017.

Comments: Post-print of a paper accepted to publication in "Interdisciplinary Sciences: Computational Life Sciences" (ISSN: 1913-2751, ESSN: 1867-1462)

MSC Class: 62P10

Journal ref: Interdisciplinary Sciences: Computational Life Sciences, 2018, Vol. 10, 1-11

arXiv:1702.04197 [pdf, other]

doi 10.1007/978-3-319-60816-7_30

Dissimilar Symmetric Word Pairs in the Human Genome

Authors: Ana Helena Tavares, Jakob Raymaekers, Peter J. Rousseeuw, Raquel M. Silva, Carlos A. C. Bastos, Armando Pinho, Paula Brito, Vera Afreixo

Abstract: In this work we explore the dissimilarity between symmetric word pairs, by comparing the inter-word distance distribution of a word to that of its reversed complement. We propose a new measure of dissimilarity between such distributions. Since symmetric pairs with different patterns could point to evolutionary features, we search for the pairs with the most dissimilar behaviour. We focus our study… ▽ More In this work we explore the dissimilarity between symmetric word pairs, by comparing the inter-word distance distribution of a word to that of its reversed complement. We propose a new measure of dissimilarity between such distributions. Since symmetric pairs with different patterns could point to evolutionary features, we search for the pairs with the most dissimilar behaviour. We focus our study on the complete human genome and its repeat-masked version. △ Less

Submitted 5 July, 2017; v1 submitted 14 February, 2017; originally announced February 2017.

Comments: Submitted 13-Feb-2017; accepted, after a minor revision, 17-Mar-2017; 11th International Conference on Practical Applications of Computational Biology & Bioinformatics, PACBB 2017, Porto, Portugal, 21-23 June, 2017

Journal ref: Advances in Intelligent Systems and Computing, Vol 616, 248-256. Springer, 2017

arXiv:1608.05012 [pdf, other]

doi 10.1080/10618600.2017.1366912

A Measure of Directional Outlyingness with Applications to Image Data and Video

Authors: Peter J. Rousseeuw, Jakob Raymaekers, Mia Hubert

Abstract: Functional data covers a wide range of data types. They all have in common that the observed objects are functions of of a univariate argument (e.g. time or wavelength) or a multivariate argument (say, a spatial position). These functions take on values which can in turn be univariate (such as the absorbance level) or multivariate (such as the red/green/blue color levels of an image). In practice… ▽ More Functional data covers a wide range of data types. They all have in common that the observed objects are functions of of a univariate argument (e.g. time or wavelength) or a multivariate argument (say, a spatial position). These functions take on values which can in turn be univariate (such as the absorbance level) or multivariate (such as the red/green/blue color levels of an image). In practice it is important to be able to detect outliers in such data. For this purpose we introduce a new measure of outlyingness that we compute at each gridpoint of the functions' domain. The proposed Directional Outlyingness} (DO) measure accounts for skewness in the data and only requires O(n) computation time per direction. We derive the influence function of the DO and compute a cutoff for outlier detection. The resulting heatmap and functional outlier map reflect local and global outlyingness of a function. To illustrate the performance of the method on real data it is applied to spectra, MRI images, and video surveillance data. △ Less

Submitted 3 March, 2017; v1 submitted 17 August, 2016; originally announced August 2016.

Journal ref: Journal of Computational and Graphical Statistics, 2018, Vol. 27, 345-359

arXiv:1601.08133 [pdf, other]

Finding Outliers in Surface Data and Video

Authors: Mia Hubert, Jakob Raymaekers, Peter J. Rousseeuw, Pieter Segaert

Abstract: Surface, image and video data can be considered as functional data with a bivariate domain. To detect outlying surfaces or images, a new method is proposed based on the mean and the variability of the degree of outlyingness at each grid point. A rule is constructed to flag the outliers in the resulting functional outlier map. Heatmaps of their outlyingness indicate the regions which are most devia… ▽ More Surface, image and video data can be considered as functional data with a bivariate domain. To detect outlying surfaces or images, a new method is proposed based on the mean and the variability of the degree of outlyingness at each grid point. A rule is constructed to flag the outliers in the resulting functional outlier map. Heatmaps of their outlyingness indicate the regions which are most deviating from the regular surfaces. The method is applied to fluorescence excitation-emission spectra after fitting a PARAFAC model, to MRI image data which are augmented with their gradients, and to video surveillance data. △ Less

Submitted 29 January, 2016; originally announced January 2016.

Showing 1–26 of 26 results for author: Raymaekers, J