-
Boundary Peeling: Outlier Detection Method Using One-Class Peeling
Authors:
Sheikh Arafat,
Na Sun,
Maria L. Weese,
Waldyn G. Martinez
Abstract:
Unsupervised outlier detection constitutes a crucial phase within data analysis and remains a dynamic realm of research. A good outlier detection algorithm should be computationally efficient, robust to tuning parameter selection, and perform consistently well across diverse underlying data distributions. We introduce One-Class Boundary Peeling, an unsupervised outlier detection algorithm. One-cla…
▽ More
Unsupervised outlier detection constitutes a crucial phase within data analysis and remains a dynamic realm of research. A good outlier detection algorithm should be computationally efficient, robust to tuning parameter selection, and perform consistently well across diverse underlying data distributions. We introduce One-Class Boundary Peeling, an unsupervised outlier detection algorithm. One-class Boundary Peeling uses the average signed distance from iteratively-peeled, flexible boundaries generated by one-class support vector machines. One-class Boundary Peeling has robust hyperparameter settings and, for increased flexibility, can be cast as an ensemble method. In synthetic data simulations One-Class Boundary Peeling outperforms all state of the art methods when no outliers are present while maintaining comparable or superior performance in the presence of outliers, as compared to benchmark methods. One-Class Boundary Peeling performs competitively in terms of correct classification, AUC, and processing time using common benchmark data sets.
△ Less
Submitted 11 September, 2023;
originally announced September 2023.
-
NLP-based detection of systematic anomalies among the narratives of consumer complaints
Authors:
Peiheng Gao,
Ning Sun,
Xuefeng Wang,
Chen Yang,
Ričardas Zitikis
Abstract:
We develop an NLP-based procedure for detecting systematic nonmeritorious consumer complaints, simply called systematic anomalies, among complaint narratives. While classification algorithms are used to detect pronounced anomalies, in the case of smaller and frequent systematic anomalies, the algorithms may falter due to a variety of reasons, including technical ones as well as natural limitations…
▽ More
We develop an NLP-based procedure for detecting systematic nonmeritorious consumer complaints, simply called systematic anomalies, among complaint narratives. While classification algorithms are used to detect pronounced anomalies, in the case of smaller and frequent systematic anomalies, the algorithms may falter due to a variety of reasons, including technical ones as well as natural limitations of human analysts. Therefore, as the next step after classification, we convert the complaint narratives into quantitative data, which are then analyzed using an algorithm for detecting systematic anomalies. We illustrate the entire procedure using complaint narratives from the Consumer Complaint Database of the Consumer Financial Protection Bureau.
△ Less
Submitted 26 March, 2024; v1 submitted 21 August, 2023;
originally announced August 2023.
-
Tail maximal dependence in bivariate models: estimation and applications
Authors:
Ning Sun,
Chen Yang,
Ričardas Zitikis
Abstract:
Assessing dependence within co-movements of financial instruments has been of much interest in risk management. Typically, indices of tail dependence are used to quantify the strength of such dependence, although many of the indices underestimate the strength. Hence, we advocate the use of a statistical procedure designed to estimate the maximal strength of dependence that can possibly occur among…
▽ More
Assessing dependence within co-movements of financial instruments has been of much interest in risk management. Typically, indices of tail dependence are used to quantify the strength of such dependence, although many of the indices underestimate the strength. Hence, we advocate the use of a statistical procedure designed to estimate the maximal strength of dependence that can possibly occur among the co-movements. We illustrate the procedure using simulated and real data-sets.
△ Less
Submitted 19 September, 2022; v1 submitted 26 July, 2022;
originally announced July 2022.
-
Clustering Structure of Microstructure Measures
Authors:
Liao Zhu,
Ningning Sun,
Martin T. Wells
Abstract:
This paper builds the clustering model of measures of market microstructure features which are popular in predicting stock returns. In a 10-second time-frequency, we study the clustering structure of different measures to find out the best ones for predicting. In this way, we can predict more accurately with a limited number of predictors, which removes the noise and makes the model more interpret…
▽ More
This paper builds the clustering model of measures of market microstructure features which are popular in predicting stock returns. In a 10-second time-frequency, we study the clustering structure of different measures to find out the best ones for predicting. In this way, we can predict more accurately with a limited number of predictors, which removes the noise and makes the model more interpretable.
△ Less
Submitted 25 December, 2021; v1 submitted 5 July, 2021;
originally announced July 2021.
-
Detecting systematic anomalies affecting systems when inputs are stationary time series
Authors:
Ning Sun,
Chen Yang,
Ričardas Zitikis
Abstract:
We develop an anomaly-detection method when systematic anomalies, possibly statistically very similar to genuine inputs, are affecting control systems at the input and/or output stages. The method allows anomaly-free inputs (i.e., those before contamination) to originate from a wide class of random sequences, thus opening up possibilities for diverse applications. To illustrate how the method work…
▽ More
We develop an anomaly-detection method when systematic anomalies, possibly statistically very similar to genuine inputs, are affecting control systems at the input and/or output stages. The method allows anomaly-free inputs (i.e., those before contamination) to originate from a wide class of random sequences, thus opening up possibilities for diverse applications. To illustrate how the method works on data, and how to interpret its results and make decisions, we analyze several actual time series, which are originally non-stationary but in the process of analysis are converted into stationary. As a further illustration, we provide a controlled experiment with anomaly-free inputs following an ARMA time series model under various contamination scenarios.
△ Less
Submitted 31 January, 2022; v1 submitted 19 November, 2020;
originally announced November 2020.
-
The Alzheimer's Disease Prediction Of Longitudinal Evolution (TADPOLE) Challenge: Results after 1 Year Follow-up
Authors:
Razvan V. Marinescu,
Neil P. Oxtoby,
Alexandra L. Young,
Esther E. Bron,
Arthur W. Toga,
Michael W. Weiner,
Frederik Barkhof,
Nick C. Fox,
Arman Eshaghi,
Tina Toni,
Marcin Salaterski,
Veronika Lunina,
Manon Ansart,
Stanley Durrleman,
Pascal Lu,
Samuel Iddi,
Dan Li,
Wesley K. Thompson,
Michael C. Donohue,
Aviv Nahon,
Yarden Levy,
Dan Halbersberg,
Mariya Cohen,
Huiling Liao,
Tengfei Li
, et al. (71 additional authors not shown)
Abstract:
We present the findings of "The Alzheimer's Disease Prediction Of Longitudinal Evolution" (TADPOLE) Challenge, which compared the performance of 92 algorithms from 33 international teams at predicting the future trajectory of 219 individuals at risk of Alzheimer's disease. Challenge participants were required to make a prediction, for each month of a 5-year future time period, of three key outcome…
▽ More
We present the findings of "The Alzheimer's Disease Prediction Of Longitudinal Evolution" (TADPOLE) Challenge, which compared the performance of 92 algorithms from 33 international teams at predicting the future trajectory of 219 individuals at risk of Alzheimer's disease. Challenge participants were required to make a prediction, for each month of a 5-year future time period, of three key outcomes: clinical diagnosis, Alzheimer's Disease Assessment Scale Cognitive Subdomain (ADAS-Cog13), and total volume of the ventricles. The methods used by challenge participants included multivariate linear regression, machine learning methods such as support vector machines and deep neural networks, as well as disease progression models. No single submission was best at predicting all three outcomes. For clinical diagnosis and ventricle volume prediction, the best algorithms strongly outperform simple baselines in predictive ability. However, for ADAS-Cog13 no single submitted prediction method was significantly better than random guesswork. Two ensemble methods based on taking the mean and median over all predictions, obtained top scores on almost all tasks. Better than average performance at diagnosis prediction was generally associated with the additional inclusion of features from cerebrospinal fluid (CSF) samples and diffusion tensor imaging (DTI). On the other hand, better performance at ventricle volume prediction was associated with inclusion of summary statistics, such as the slope or maxima/minima of biomarkers. TADPOLE's unique results suggest that current prediction algorithms provide sufficient accuracy to exploit biomarkers related to clinical diagnosis and ventricle volume, for cohort refinement in clinical trials for Alzheimer's disease. However, results call into question the usage of cognitive test scores for patient selection and as a primary endpoint in clinical trials.
△ Less
Submitted 27 December, 2021; v1 submitted 9 February, 2020;
originally announced February 2020.
-
Spectral algorithms for tensor completion
Authors:
Andrea Montanari,
Nike Sun
Abstract:
In the tensor completion problem, one seeks to estimate a low-rank tensor based on a random sample of revealed entries. In terms of the required sample size, earlier work revealed a large gap between estimation with unbounded computational resources (using, for instance, tensor nuclear norm minimization) and polynomial-time algorithms. Among the latter, the best statistical guarantees have been pr…
▽ More
In the tensor completion problem, one seeks to estimate a low-rank tensor based on a random sample of revealed entries. In terms of the required sample size, earlier work revealed a large gap between estimation with unbounded computational resources (using, for instance, tensor nuclear norm minimization) and polynomial-time algorithms. Among the latter, the best statistical guarantees have been proved, for third-order tensors, using the sixth level of the sum-of-squares (SOS) semidefinite programming hierarchy (Barak and Moitra, 2014). However, the SOS approach does not scale well to large problem instances. By contrast, spectral methods --- based on unfolding or matricizing the tensor --- are attractive for their low complexity, but have been believed to require a much larger sample size.
This paper presents two main contributions. First, we propose a new unfolding-based method, which outperforms naive ones for symmetric $k$-th order tensors of rank $r$. For this result we make a study of singular space estimation for partially revealed matrices of large aspect ratio, which may be of independent interest. For third-order tensors, our algorithm matches the SOS method in terms of sample size (requiring about $rd^{3/2}$ revealed entries), subject to a worse rank condition ($r\ll d^{3/4}$ rather than $r\ll d^{3/2}$). We complement this result with a different spectral algorithm for third-order tensors in the overcomplete ($r\ge d$) regime. Under a random model, this second approach succeeds in estimating tensors of rank $d\le r \ll d^{3/2}$ from about $rd^{3/2}$ revealed entries.
△ Less
Submitted 22 December, 2016;
originally announced December 2016.