Skip to main content

Showing 1–26 of 26 results for author: Raymaekers, J

Searching in archive stat. Search in all archives.
.
  1. Distance Covariance, Independence, and Pairwise Differences

    Authors: Jakob Raymaekers, Peter J. Rousseeuw

    Abstract: (To appear in The American Statistician.) Distance covariance (Székely, Rizzo, and Bakirov, 2007) is a fascinating recent notion, which is popular as a test for dependence of any type between random variables $X$ and $Y$. This approach deserves to be touched upon in modern courses on mathematical statistics. It makes use of distances of the type $|X-X'|$ and $|Y-Y'|$, where $(X',Y')$ is an indepen… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Journal ref: The American Statistician, 2024

  2. arXiv:2403.03722  [pdf, other

    stat.ME

    Is Distance Correlation Robust?

    Authors: Sarah Leyder, Jakob Raymaekers, Peter J. Rousseeuw

    Abstract: Distance correlation is a popular measure of dependence between random variables. It has some robustness properties, but not all. We prove that the influence function of the usual distance correlation is bounded, but that its breakdown value is zero. Moreover, it has an unbounded sensitivity function, converging to the bounded influence function for increasing sample size. To address this sensitiv… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

  3. arXiv:2308.05422  [pdf, other

    stat.ME stat.ML

    TSLiNGAM: DirectLiNGAM under heavy tails

    Authors: Sarah Leyder, Jakob Raymaekers, Tim Verdonck

    Abstract: One of the established approaches to causal discovery consists of combining directed acyclic graphs (DAGs) with structural causal models (SCMs) to describe the functional dependencies of effects on their causes. Possible identifiability of SCMs given data depends on assumptions made on the noise variables and the functional classes in the SCM. For instance, in the LiNGAM model, the functional clas… ▽ More

    Submitted 10 August, 2023; originally announced August 2023.

    Comments: 35 pages, 10 figures

  4. arXiv:2303.05836  [pdf, other

    stat.ME

    Generalized Spherical Principal Component Analysis

    Authors: Sarah Leyder, Jakob Raymaekers, Tim Verdonck

    Abstract: Outliers contaminating data sets are a challenge to statistical estimators. Even a small fraction of outlying observations can heavily influence most classical statistical methods. In this paper we propose generalized spherical principal component analysis, a new robust version of principal component analysis that is based on the generalized spatial sign covariance matrix. Supporting theoretical p… ▽ More

    Submitted 10 March, 2023; originally announced March 2023.

  5. arXiv:2302.03931  [pdf, other

    stat.ML cs.LG stat.ME

    Fast Linear Model Trees by PILOT

    Authors: Jakob Raymaekers, Peter J. Rousseeuw, Tim Verdonck, Ruicong Yao

    Abstract: Linear model trees are regression trees that incorporate linear models in the leaf nodes. This preserves the intuitive interpretation of decision trees and at the same time enables them to better capture linear relationships, which is hard for standard decision trees. But most existing methods for fitting linear model trees are time consuming and therefore not scalable to large data sets. In addit… ▽ More

    Submitted 8 February, 2023; originally announced February 2023.

    Journal ref: Machine Learning, 2024

  6. Challenges of cellwise outliers

    Authors: Jakob Raymaekers, Peter J. Rousseeuw

    Abstract: It is well-known that real data often contain outliers. The term outlier typically refers to a case, that is, a row of the $n \times d$ data matrix. In recent times a different type has come into focus, the cellwise outliers. These are suspicious cells (entries) that can occur anywhere in the data matrix. Even a relatively small proportion of outlying cells can contaminate over half the rows, whic… ▽ More

    Submitted 4 February, 2023; originally announced February 2023.

    Journal ref: Econometrics and Statistics, 2024

  7. arXiv:2209.07374  [pdf, other

    stat.ME math.ST

    The Influence Function of Graphical Lasso Estimators

    Authors: Gaëtan Louvet, Jakob Raymaekers, Germain Van Bever, Ines Wilms

    Abstract: The precision matrix that encodes conditional linear dependency relations among a set of variables forms an important object of interest in multivariate analysis. Sparse estimation procedures for precision matrices such as the graphical lasso (Glasso) gained popularity as they facilitate interpretability, thereby separating pairs of variables that are conditionally dependent from those that are in… ▽ More

    Submitted 8 March, 2023; v1 submitted 15 September, 2022; originally announced September 2022.

  8. The Cellwise Minimum Covariance Determinant Estimator

    Authors: Jakob Raymaekers, Peter J. Rousseeuw

    Abstract: The usual Minimum Covariance Determinant (MCD) estimator of a covariance matrix is robust against casewise outliers. These are cases (that is, rows of the data matrix) that behave differently from the majority of cases, raising suspicion that they might belong to a different population. On the other hand, cellwise outliers are individual cells in the data matrix. When a row contains one or more ou… ▽ More

    Submitted 15 November, 2023; v1 submitted 27 July, 2022; originally announced July 2022.

    Journal ref: Journal of the American Statistical Association, 2025

  9. arXiv:2202.08060  [pdf, other

    stat.ME

    Equivariant Passing-Bablok regression in quasilinear time

    Authors: Jakob Raymaekers, Florian Dufey

    Abstract: Passing-Bablok regression is a standard tool for method and assay comparison studies thanks to its place in industry guidelines such as CLSI. Unfortunately, its computational cost is high as a naive approach requires O(n2) time. This makes it impossible to compute the Passing-Bablok regression estimator on large datasets. Additionally, even on smaller datasets it can be difficult to perform bootst… ▽ More

    Submitted 16 February, 2022; originally announced February 2022.

  10. Silhouettes and quasi residual plots for neural nets and tree-based classifiers

    Authors: Jakob Raymaekers, Peter J. Rousseeuw

    Abstract: Classification by neural nets and by tree-based methods are powerful tools of machine learning. There exist interesting visualizations of the inner workings of these and other classifiers. Here we pursue a different goal, which is to visualize the cases being classified, either in training data or in test data. An important aspect is whether a case has been classified to its given class (label) or… ▽ More

    Submitted 26 February, 2022; v1 submitted 16 June, 2021; originally announced June 2021.

    Journal ref: Journal of Computational and Graphical Statistics 2022, Volume 31, 1332-1343

  11. arXiv:2101.01494  [pdf, other

    stat.ML cs.LG

    Weight-of-evidence 2.0 with shrinkage and spline-binning

    Authors: Jakob Raymaekers, Wouter Verbeke, Tim Verdonck

    Abstract: In many practical applications, such as fraud detection, credit risk modeling or medical decision making, classification models for assigning instances to a predefined set of classes are required to be both precise as well as interpretable. Linear modeling methods such as logistic regression are often adopted, since they offer an acceptable balance between precision and interpretability. Linear me… ▽ More

    Submitted 24 September, 2021; v1 submitted 5 January, 2021; originally announced January 2021.

    Comments: New version: duplicate paragraph omitted

  12. arXiv:2010.00950  [pdf, other

    stat.ML cs.LG stat.ME

    Regularized K-means through hard-thresholding

    Authors: Jakob Raymaekers, Ruben H. Zamar

    Abstract: We study a framework of regularized $K$-means methods based on direct penalization of the size of the cluster centers. Different penalization strategies are considered and compared through simulation and theoretical analysis. Based on the results, we propose HT $K$-means, which uses an $\ell_0$ penalty to induce sparsity in the variables. Different techniques for selecting the tuning parameter are… ▽ More

    Submitted 2 October, 2020; originally announced October 2020.

  13. Real-time discriminant analysis in the presence of label and measurement noise

    Authors: Iwein Vranckx, Jakob Raymaekers, Bart De Ketelaere, Peter J. Rousseeuw, Mia Hubert

    Abstract: Quadratic discriminant analysis (QDA) is a widely used classification technique. Based on a training dataset, each class in the data is characterized by an estimate of its center and shape, which can then be used to assign unseen observations to one of the classes. The traditional QDA rule relies on the empirical mean and covariance matrix. Unfortunately, these estimators are sensitive to label an… ▽ More

    Submitted 10 November, 2020; v1 submitted 29 August, 2020; originally announced August 2020.

    Journal ref: Chemometrics and Intelligent Laboratory Systems, 2021, Volume 208

  14. arXiv:2007.14495  [pdf, other

    stat.ML cs.LG stat.CO stat.ME

    Class maps for visualizing classification results

    Authors: Jakob Raymaekers, Peter J. Rousseeuw, Mia Hubert

    Abstract: Classification is a major tool of statistics and machine learning. A classification method first processes a training set of objects with given classes (labels), with the goal of afterward assigning new objects to one of these classes. When running the resulting prediction method on the training data or on test data, it can happen that an object is predicted to lie in a class that differs from its… ▽ More

    Submitted 19 May, 2021; v1 submitted 28 July, 2020; originally announced July 2020.

    Comments: Appeared online, Technometrics

    Journal ref: Technometrics 2022, Vol. 64, pages 151-165

  15. Transforming variables to central normality

    Authors: Jakob Raymaekers, Peter J. Rousseeuw

    Abstract: Many real data sets contain numerical features (variables) whose distribution is far from normal (gaussian). Instead, their distribution is often skewed. In order to handle such data it is customary to preprocess the variables to make them more normal. The Box-Cox and Yeo-Johnson transformations are well-known tools for this. However, the standard maximum likelihood estimator of their transformati… ▽ More

    Submitted 21 November, 2020; v1 submitted 16 May, 2020; originally announced May 2020.

    Journal ref: Machine Learning, 2021

  16. Handling cellwise outliers by sparse regression and robust covariance

    Authors: Jakob Raymaekers, Peter J. Rousseeuw

    Abstract: We propose a data-analytic method for detecting cellwise outliers. Given a robust covariance matrix, outlying cells (entries) in a row are found by the cellHandler technique which combines lasso regression with a stepwise application of constructed cutoff values. The penalty term of the lasso has a physical interpretation as the total distance that suspicious cells need to move in order to bring t… ▽ More

    Submitted 7 December, 2020; v1 submitted 28 December, 2019; originally announced December 2019.

    Journal ref: Journal of Data Science, Statistics, and Visualisation, 2021, issue 3

  17. Pooled variable scaling for cluster analysis

    Authors: Jakob Raymaekers, Ruben H. Zamar

    Abstract: We propose a new approach for scaling prior to cluster analysis based on the concept of pooled variance. Unlike available scaling procedures such as the standard deviation and the range, our proposed scale avoids dampening the beneficial effect of informative clustering variables. We confirm through an extensive simulation study and applications to well known real data examples that the proposed s… ▽ More

    Submitted 25 July, 2020; v1 submitted 22 December, 2019; originally announced December 2019.

    Comments: 29 pages, 32 figures

  18. Real-time outlier detection for large datasets by RT-DetMCD

    Authors: Bart De Ketelaere, Mia Hubert, Jakob Raymaekers, Peter J. Rousseeuw, Iwein Vranckx

    Abstract: Modern industrial machines can generate gigabytes of data in seconds, frequently pushing the boundaries of available computing power. Together with the time criticality of industrial processing this presents a challenging problem for any data analytics procedure. We focus on the deterministic minimum covariance determinant method (DetMCD), which detects outliers by fitting a robust covariance matr… ▽ More

    Submitted 24 January, 2020; v1 submitted 12 October, 2019; originally announced October 2019.

    Journal ref: Chemometrics and Intelligent Laboratory Systems, 2020, Volume 199

  19. Clustering genomic words in human DNA using peaks and trends of distributions

    Authors: Ana Helena Tavares, Jakob Raymaekers, Peter J. Rousseeuw, Paula Brito, Vera Afreixo

    Abstract: In this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust fitting method is used to estimate the baseline distribution (the `trend'), and a sparse vector of… ▽ More

    Submitted 13 August, 2018; originally announced August 2018.

    Journal ref: Advances in Data Analysis and Classification, 2020

  20. A generalized spatial sign covariance matrix

    Authors: Jakob Raymaekers, Peter J. Rousseeuw

    Abstract: The well-known spatial sign covariance matrix (SSCM) carries out a radial transform which moves all data points to a sphere, followed by computing the classical covariance matrix of the transformed data. Its popularity stems from its robustness to outliers, fast computation, and applications to correlation and principal component analysis. In this paper we study more general radial functions. It i… ▽ More

    Submitted 10 October, 2018; v1 submitted 3 May, 2018; originally announced May 2018.

    Journal ref: Journal of Multivariate Analysis, 2019, Vol. 171, 94-111

  21. Discussion of "The power of monitoring"

    Authors: Jakob Raymaekers, Peter J. Rousseeuw, Iwein Vranckx

    Abstract: This is an invited comment on the discussion paper "The power of monitoring: how to make the most of a contaminated multivariate sample" by A. Cerioli, M. Riani, A. Atkinson and A. Corbellini that will appear in the journal Statistical Methods & Applications.

    Submitted 13 March, 2018; originally announced March 2018.

    Journal ref: Statistical Methods and Applications, 2018, Vol. 27, 589-594

  22. Fast robust correlation for high-dimensional data

    Authors: Jakob Raymaekers, Peter J. Rousseeuw

    Abstract: The product moment covariance is a cornerstone of multivariate data analysis, from which one can derive correlations, principal components, Mahalanobis distances and many other results. Unfortunately the product moment covariance and the corresponding Pearson correlation are very susceptible to outliers (anomalies) in the data. Several robust measures of covariance have been developed, but few are… ▽ More

    Submitted 20 October, 2019; v1 submitted 14 December, 2017; originally announced December 2017.

    Journal ref: Technometrics, vol. 63, 184-198 (2021)

  23. Comparing reverse complementary genomic words based on their distance distributions and frequencies

    Authors: Ana Helena Tavares, Jakob Raymaekers, Peter Rousseeuw, Raquel M. Silva, Carlos A. C. Bastos, Armando Pinho, Paula Brito, Vera Afreixo

    Abstract: In this work we study reverse complementary genomic word pairs in the human DNA, by comparing both the distance distribution and the frequency of a word to those of its reverse complement. Several measures of dissimilarity between distance distributions are considered, and it is found that the peak dissimilarity works best in this setting. We report the existence of reverse complementary word pair… ▽ More

    Submitted 6 October, 2017; originally announced October 2017.

    Comments: Post-print of a paper accepted to publication in "Interdisciplinary Sciences: Computational Life Sciences" (ISSN: 1913-2751, ESSN: 1867-1462)

    MSC Class: 62P10

    Journal ref: Interdisciplinary Sciences: Computational Life Sciences, 2018, Vol. 10, 1-11

  24. Dissimilar Symmetric Word Pairs in the Human Genome

    Authors: Ana Helena Tavares, Jakob Raymaekers, Peter J. Rousseeuw, Raquel M. Silva, Carlos A. C. Bastos, Armando Pinho, Paula Brito, Vera Afreixo

    Abstract: In this work we explore the dissimilarity between symmetric word pairs, by comparing the inter-word distance distribution of a word to that of its reversed complement. We propose a new measure of dissimilarity between such distributions. Since symmetric pairs with different patterns could point to evolutionary features, we search for the pairs with the most dissimilar behaviour. We focus our study… ▽ More

    Submitted 5 July, 2017; v1 submitted 14 February, 2017; originally announced February 2017.

    Comments: Submitted 13-Feb-2017; accepted, after a minor revision, 17-Mar-2017; 11th International Conference on Practical Applications of Computational Biology & Bioinformatics, PACBB 2017, Porto, Portugal, 21-23 June, 2017

    Journal ref: Advances in Intelligent Systems and Computing, Vol 616, 248-256. Springer, 2017

  25. A Measure of Directional Outlyingness with Applications to Image Data and Video

    Authors: Peter J. Rousseeuw, Jakob Raymaekers, Mia Hubert

    Abstract: Functional data covers a wide range of data types. They all have in common that the observed objects are functions of of a univariate argument (e.g. time or wavelength) or a multivariate argument (say, a spatial position). These functions take on values which can in turn be univariate (such as the absorbance level) or multivariate (such as the red/green/blue color levels of an image). In practice… ▽ More

    Submitted 3 March, 2017; v1 submitted 17 August, 2016; originally announced August 2016.

    Journal ref: Journal of Computational and Graphical Statistics, 2018, Vol. 27, 345-359

  26. arXiv:1601.08133  [pdf, other

    stat.ME

    Finding Outliers in Surface Data and Video

    Authors: Mia Hubert, Jakob Raymaekers, Peter J. Rousseeuw, Pieter Segaert

    Abstract: Surface, image and video data can be considered as functional data with a bivariate domain. To detect outlying surfaces or images, a new method is proposed based on the mean and the variability of the degree of outlyingness at each grid point. A rule is constructed to flag the outliers in the resulting functional outlier map. Heatmaps of their outlyingness indicate the regions which are most devia… ▽ More

    Submitted 29 January, 2016; originally announced January 2016.