Search | arXiv e-print repository

arXiv:2406.19213 [pdf, other]

Comparing Lasso and Adaptive Lasso in High-Dimensional Data: A Genetic Survival Analysis in Triple-Negative Breast Cancer

Authors: Pilar González-Barquero, Rosa E. Lillo, Álvaro Méndez-Civieta

Abstract: This study aims to evaluate the performance of Cox regression with lasso penalty and adaptive lasso penalty in high-dimensional settings. Variable selection methods are necessary in this context to reduce dimensionality and make the problem feasible. Several weight calculation procedures for adaptive lasso are proposed to determine if they offer an improvement over lasso, as adaptive lasso address… ▽ More This study aims to evaluate the performance of Cox regression with lasso penalty and adaptive lasso penalty in high-dimensional settings. Variable selection methods are necessary in this context to reduce dimensionality and make the problem feasible. Several weight calculation procedures for adaptive lasso are proposed to determine if they offer an improvement over lasso, as adaptive lasso addresses its inherent bias. These proposed weights are based on principal component analysis, ridge regression, univariate Cox regressions and random survival forest (RSF). The proposals are evaluated in simulated datasets. A real application of these methodologies in the context of genomic data is also carried out. The study consists of determining the variables, clinical and genetic, that influence the survival of patients with triple-negative breast cancer (TNBC), which is a type breast cancer with low survival rates due to its aggressive nature. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: 39 pages, 2 figures, 8 tables

arXiv:2406.01588 [pdf, other]

nn2poly: An R Package for Converting Neural Networks into Interpretable Polynomials

Authors: Pablo Morala, Jenny Alexandra Cifuentes, Rosa E. Lillo, Iñaki Ucar

Abstract: The nn2poly package provides the implementation in R of the NN2Poly method to explain and interpret feed-forward neural networks by means of polynomial representations that predict in an equivalent manner as the original network.Through the obtained polynomial coefficients, the effect and importance of each variable and their interactions on the output can be represented. This capabiltiy of captur… ▽ More The nn2poly package provides the implementation in R of the NN2Poly method to explain and interpret feed-forward neural networks by means of polynomial representations that predict in an equivalent manner as the original network.Through the obtained polynomial coefficients, the effect and importance of each variable and their interactions on the output can be represented. This capabiltiy of capturing interactions is a key aspect usually missing from most Explainable Artificial Intelligence (XAI) methods, specially if they rely on expensive computations that can be amplified when used on large neural networks. The package provides integration with the main deep learning framework packages in R (tensorflow and torch), allowing an user-friendly application of the NN2Poly algorithm. Furthermore, nn2poly provides implementation of the required weight constraints to be used during the network training in those same frameworks. Other neural networks packages can also be used by including their weights in list format. Polynomials obtained with nn2poly can also be used to predict with new data or be visualized through its own plot method. Simulations are provided exemplifying the usage of the package alongside with a comparison with other approaches available in R to interpret neural networks. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2401.15225 [pdf, other]

doi 10.1016/j.ress.2020.107318

A bivariate two-state Markov modulated Poisson process for failure modelling

Authors: Yoel G. Yera, Rosa E. Lillo, Bo F. Nielsen, Pepa Ramírez-Cobo, Fabrizio Ruggeri

Abstract: Motivated by a real failure dataset in a two-dimensional context, this paper presents an extension of the Markov modulated Poisson process (MMPP) to two dimensions. The one-dimensional MMPP has been proposed for the modeling of dependent and non-exponential inter-failure times (in contexts as queuing, risk or reliability, among others). The novel two-dimensional MMPP allows for dependence between… ▽ More Motivated by a real failure dataset in a two-dimensional context, this paper presents an extension of the Markov modulated Poisson process (MMPP) to two dimensions. The one-dimensional MMPP has been proposed for the modeling of dependent and non-exponential inter-failure times (in contexts as queuing, risk or reliability, among others). The novel two-dimensional MMPP allows for dependence between the two sequences of inter-failure times, while at the same time preserves the MMPP properties, marginally. The generalization is based on the Marshall-Olkin exponential distribution. Inference is undertaken for the new model through a method combining a matching moments approach with an Approximate Bayesian Computation (ABC) algorithm. The performance of the method is shown on simulated and real datasets representing times and distances covered between consecutive failures in a public transport company. For the real dataset, some quantities of importance associated with the reliability of the system are estimated as the probabilities and expected number of failures at different times and distances covered by trains until the occurrence of a failure. △ Less

Submitted 26 January, 2024; originally announced January 2024.

Journal ref: Reliability Engineering and System Safety 208(2021) 107318

arXiv:2401.14561 [pdf, other]

doi 10.1016/j.ejor.2019.04.018

Fitting procedure for the two-state Batch Markov modulated Poisson process

Authors: Yoel G. Yera, Rosa E. Lillo, Pepa Ramírez-Cobo

Abstract: The Batch Markov Modulated Poisson Process (BMMPP) is a subclass of the versatile Batch Markovian Arrival process (BMAP) which has been proposed for the modeling of dependent events occurring in batches (as group arrivals, failures or risk events). This paper focuses on exploring the possibilities of the BMMPP for the modeling of real phenomena involving point processes with group arrivals. The fi… ▽ More The Batch Markov Modulated Poisson Process (BMMPP) is a subclass of the versatile Batch Markovian Arrival process (BMAP) which has been proposed for the modeling of dependent events occurring in batches (as group arrivals, failures or risk events). This paper focuses on exploring the possibilities of the BMMPP for the modeling of real phenomena involving point processes with group arrivals. The first result in this sense is the characterization of the two-state BMMPP with maximum batch size equal to K, the BMMPP2(K), by a set of moments related to the inter-event time and batch size distributions. This characterization leads to a sequential fitting approach via a moments matching method. The performance of the novel fitting approach is illustrated on both simulated and a real teletraffic data set, and compared to that of the EM algorithm. In addition, as an extension of the inference approach, the queue length distributions at departures in the queueing system BMMPP/M/1 is also estimated. △ Less

Submitted 25 January, 2024; originally announced January 2024.

Journal ref: European Journal of Operational Research (2019)

arXiv:2401.14553 [pdf, ps, other]

Analysis of an aggregate loss model in a Markov renewal regime

Authors: Pepa Ramírez-Cobo, Emilio Carrizosa, Rosa Elvira Lillo

Abstract: In this article we consider an aggregate loss model with dependent losses. The losses occurrence process is governed by a two-state Markovian arrival process (MAP2), a Markov renewal process process that allows for (1) correlated inter-losses times, (2) non-exponentially distributed inter-losses times and, (3) overdisperse losses counts. Some quantities of interest to measure persistence in the lo… ▽ More In this article we consider an aggregate loss model with dependent losses. The losses occurrence process is governed by a two-state Markovian arrival process (MAP2), a Markov renewal process process that allows for (1) correlated inter-losses times, (2) non-exponentially distributed inter-losses times and, (3) overdisperse losses counts. Some quantities of interest to measure persistence in the loss occurrence process are obtained. Given a real operational risk database, the aggregate loss model is estimated by fitting separately the inter-losses times and severities. The MAP2 is estimated via direct maximization of the likelihood function, and severities are modeled by the heavy-tailed, double-Pareto Lognormal distribution. In comparison with the fit provided by the Poisson process, the results point out that taking into account the dependence and overdispersion in the inter-losses times distribution leads to higher capital charges. △ Less

Submitted 4 February, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

Journal ref: Applied Mathematics and Computation (2021)

arXiv:2307.16720 [pdf, other]

The epigraph and the hypograph indexes as useful tools for clustering multivariate functional data

Authors: Belén Pulido, Alba M. Franco-Pereira, Rosa E. Lillo

Abstract: The proliferation of data generation has spurred advancements in functional data analysis. With the ability to analyze multiple variables simultaneously, the demand for working with multivariate functional data has increased. This study proposes a novel formulation of the epigraph and hypograph indexes, as well as their generalized expressions, specifically tailored for the multivariate functional… ▽ More The proliferation of data generation has spurred advancements in functional data analysis. With the ability to analyze multiple variables simultaneously, the demand for working with multivariate functional data has increased. This study proposes a novel formulation of the epigraph and hypograph indexes, as well as their generalized expressions, specifically tailored for the multivariate functional context. These definitions take into account the interrelations between components. Furthermore, the proposed indexes are employed to cluster multivariate functional data. In the clustering process, the indexes are applied to both the data and their first and second derivatives. This generates a reduced-dimension dataset from the original multivariate functional data, enabling the application of well-established multivariate clustering techniques which have been extensively studied in the literature. This methodology has been tested through simulated and real datasets, performing comparative analyses against state-of-the-art to assess its performance. △ Less

Submitted 17 October, 2023; v1 submitted 31 July, 2023; originally announced July 2023.

Comments: 32 pages

arXiv:2307.06643 [pdf, other]

Nowcasting Temporal Trends Using Indirect Surveys

Authors: Ajitesh Srivastava, Juan Marcos Ramírez, Sergio Díaz-Aranda, Jose Aguilar, Antonio Ortega, Antonio Fernández Anta, Rosa Elvira Lillo

Abstract: Indirect surveys, in which respondents provide information about other people they know, have been proposed for estimating (nowcasting) the size of a \emph{hidden population} where privacy is important or the hidden population is hard to reach. Examples include estimating casualties in an earthquake, conditions among female sex workers, and the prevalence of drug use and infectious diseases. The N… ▽ More Indirect surveys, in which respondents provide information about other people they know, have been proposed for estimating (nowcasting) the size of a \emph{hidden population} where privacy is important or the hidden population is hard to reach. Examples include estimating casualties in an earthquake, conditions among female sex workers, and the prevalence of drug use and infectious diseases. The Network Scale-up Method (NSUM) is the classical approach to develo** estimates from indirect surveys, but it was designed for one-shot surveys. Further, it requires certain assumptions and asking for or estimating the number of individuals in each respondent's network. In recent years, surveys have been increasingly deployed online and can collect data continuously (e.g., COVID-19 surveys on Facebook during much of the pandemic). Conventional NSUM can be applied to these scenarios by analyzing the data independently at each point in time, but this misses the opportunity of leveraging the temporal dimension. We propose to use the responses from indirect surveys collected over time and develop analytical tools (i) to prove that indirect surveys can provide better estimates for the trends of the hidden population over time, as compared to direct surveys and (ii) to identify appropriate temporal aggregations to improve the estimates. We demonstrate through extensive simulations that our approach outperforms traditional NSUM and direct surveying methods. We also empirically demonstrate the superiority of our approach on a real indirect survey dataset of COVID-19 cases. △ Less

Submitted 14 December, 2023; v1 submitted 13 July, 2023; originally announced July 2023.

Comments: Accepted at AAAI 2024

ACM Class: G.3

arXiv:2207.12803 [pdf, other]

Multivariate Functional Outlier Detection using the FastMUOD Indices

Authors: Oluwasegun Taiwo Ojo, Antonio Fernández Anta, Marc G. Genton, Rosa E. Lillo

Abstract: We present definitions and properties of the fast massive unsupervised outlier detection (FastMUOD) indices, used for outlier detection (OD) in functional data. FastMUOD detects outliers by computing, for each curve, an amplitude, magnitude and shape index meant to target the corresponding types of outliers. Some methods adapting FastMUOD to outlier detection in multivariate functional data are th… ▽ More We present definitions and properties of the fast massive unsupervised outlier detection (FastMUOD) indices, used for outlier detection (OD) in functional data. FastMUOD detects outliers by computing, for each curve, an amplitude, magnitude and shape index meant to target the corresponding types of outliers. Some methods adapting FastMUOD to outlier detection in multivariate functional data are then proposed. These include applying FastMUOD on the components of the multivariate data and using random projections. Moreover, these techniques are tested on various simulated and real multivariate functional datasets. Compared with the state of the art in multivariate functional OD, the use of random projections showed the most effective results with similar, and in some cases improved, OD performance. △ Less

Submitted 26 July, 2022; originally announced July 2022.

arXiv:2112.11397 [pdf, other]

doi 10.1109/TNNLS.2023.3330328

NN2Poly: A polynomial representation for deep feed-forward artificial neural networks

Authors: Pablo Morala, Jenny Alexandra Cifuentes, Rosa E. Lillo, Iñaki Ucar

Abstract: Interpretability of neural networks and their underlying theoretical behavior remain an open field of study even after the great success of their practical applications, particularly with the emergence of deep learning. In this work, NN2Poly is proposed: a theoretical approach to obtain an explicit polynomial model that provides an accurate representation of an already trained fully-connected feed… ▽ More Interpretability of neural networks and their underlying theoretical behavior remain an open field of study even after the great success of their practical applications, particularly with the emergence of deep learning. In this work, NN2Poly is proposed: a theoretical approach to obtain an explicit polynomial model that provides an accurate representation of an already trained fully-connected feed-forward artificial neural network (a multilayer perceptron or MLP). This approach extends a previous idea proposed in the literature, which was limited to single hidden layer networks, to work with arbitrarily deep MLPs in both regression and classification tasks. NN2Poly uses a Taylor expansion on the activation function, at each layer, and then applies several combinatorial properties to calculate the coefficients of the desired polynomials. Discussion is presented on the main computational challenges of this method, and the way to overcome them by imposing certain constraints during the training phase. Finally, simulation experiments as well as applications to real tabular data sets are presented to demonstrate the effectiveness of the proposed method. △ Less

Submitted 25 September, 2023; v1 submitted 21 December, 2021; originally announced December 2021.

Journal ref: IEEE Transactions on Neural Networks and Learning Systems (2023, Early Access)

arXiv:2111.00472 [pdf, other]

Asgl: A Python Package for Penalized Linear and Quantile Regression

Authors: Álvaro Méndez Civieta, M. Carmen Aguilera-Morillo, Rosa E. Lillo

Abstract: Asg is a Python package that solves penalized linear regression and quantile regression models for simultaneous variable selection and prediction, for both high and low dimensional frameworks. It makes very easy to set up and solve different types of lasso-based penalizations among which the asgl (adaptive sparse group lasso, that gives name to the package) is remarked. This package is built on to… ▽ More Asg is a Python package that solves penalized linear regression and quantile regression models for simultaneous variable selection and prediction, for both high and low dimensional frameworks. It makes very easy to set up and solve different types of lasso-based penalizations among which the asgl (adaptive sparse group lasso, that gives name to the package) is remarked. This package is built on top of cvxpy, a Python-embedded modeling language for convex optimization problems and makes extensive use of multiprocessing, a Python module for parallel computing that significantly reduces computation times of asgl. △ Less

Submitted 31 October, 2021; originally announced November 2021.

Comments: 31 pages, 1 figure, 1 table

arXiv:2110.07998 [pdf, other]

Fast Partial Quantile Regression

Authors: Alvaro Mendez Civieta, M. Carmen Aguilera-Morillo, Rosa E. Lillo

Abstract: Partial least squares (PLS) is a dimensionality reduction technique used as an alternative to ordinary least squares (OLS) in situations where the data is colinear or high dimensional. Both PLS and OLS provide mean based estimates, which are extremely sensitive to the presence of outliers or heavy tailed distributions. In contrast, quantile regression is an alternative to OLS that computes robust… ▽ More Partial least squares (PLS) is a dimensionality reduction technique used as an alternative to ordinary least squares (OLS) in situations where the data is colinear or high dimensional. Both PLS and OLS provide mean based estimates, which are extremely sensitive to the presence of outliers or heavy tailed distributions. In contrast, quantile regression is an alternative to OLS that computes robust quantile based estimates. In this work, the multivariate PLS is extended to the quantile regression framework, obtaining a theoretical formulation of the problem and a robust dimensionality reduction technique that we call fast partial quantile regression (fPQR), that provides quantile based estimates. An efficient implementation of fPQR is also derived, and its performance is studied through simulation experiments and the chemometrics well known biscuit dough dataset, a real high dimensional example. △ Less

Submitted 15 October, 2021; originally announced October 2021.

Comments: 22 pages, 5 figures and 5 tables

MSC Class: 62-08; 62Hxx; 62Jxx ACM Class: G.3

arXiv:2108.00217 [pdf, other]

doi 10.1007/s11222-023-10213-7

Functional clustering via multivariate clustering

Authors: Belén Pulido, Alba María Franco-Pereira, Rosa Elvira Lillo

Abstract: Clustering techniques applied to multivariate data are a very useful tool in Statistics and have been fully studied in the literature. Nevertheless, these clustering methodologies are less well known when dealing with functional data. Our proposal consists of introducing a clustering procedure for functional data using the very well known techniques for clustering multivariate data. The idea is to… ▽ More Clustering techniques applied to multivariate data are a very useful tool in Statistics and have been fully studied in the literature. Nevertheless, these clustering methodologies are less well known when dealing with functional data. Our proposal consists of introducing a clustering procedure for functional data using the very well known techniques for clustering multivariate data. The idea is to reduce a functional data problem to a multivariate data problem by applying the epigraph and the hypograph indexes to the original data and to its first and second derivatives. All the information given by the functional data is therefore transformed to the multivariate context, being sufficiently informative for the usual multivariate clustering techniques to be efficient. The performance of this new methodology is evaluated through a simulation study and it is also illustrated through real data sets. △ Less

Submitted 31 July, 2021; originally announced August 2021.

arXiv:2105.05213 [pdf, other]

Outlier Detection for Functional Data with R Package fdaoutlier

Authors: Oluwasegun Ojo, Rosa E. Lillo, Antonio Fernández Anta

Abstract: Outlier detection is one of the standard exploratory analysis tasks in functional data analysis. We present the R package fdaoutlier which contains implementations of some of the latest techniques for detecting functional outliers. The package makes it easy to detect different types of outliers (magnitude, shape, and amplitude) in functional data, and some of the implemented methods can be applied… ▽ More Outlier detection is one of the standard exploratory analysis tasks in functional data analysis. We present the R package fdaoutlier which contains implementations of some of the latest techniques for detecting functional outliers. The package makes it easy to detect different types of outliers (magnitude, shape, and amplitude) in functional data, and some of the implemented methods can be applied to both univariate and multivariate functional data. We illustrate the main functionality of the R package with common functional datasets in the literature. △ Less

Submitted 14 October, 2021; v1 submitted 11 May, 2021; originally announced May 2021.

arXiv:2102.03865 [pdf, other]

doi 10.1016/j.neunet.2021.04.036

Towards a mathematical framework to inform Neural Network modelling via Polynomial Regression

Authors: Pablo Morala, Jenny Alexandra Cifuentes, Rosa E. Lillo, Iñaki Ucar

Abstract: Even when neural networks are widely used in a large number of applications, they are still considered as black boxes and present some difficulties for dimensioning or evaluating their prediction error. This has led to an increasing interest in the overlap** area between neural networks and more traditional statistical methods, which can help overcome those problems. In this article, a mathemati… ▽ More Even when neural networks are widely used in a large number of applications, they are still considered as black boxes and present some difficulties for dimensioning or evaluating their prediction error. This has led to an increasing interest in the overlap** area between neural networks and more traditional statistical methods, which can help overcome those problems. In this article, a mathematical framework relating neural networks and polynomial regression is explored by building an explicit expression for the coefficients of a polynomial regression from the weights of a given neural network, using a Taylor expansion approach. This is achieved for single hidden layer neural networks in regression problems. The validity of the proposed method depends on different factors like the distribution of the synaptic potentials or the chosen activation function. The performance of this method is empirically tested via simulation of synthetic data generated from polynomials to train neural networks with different structures and hyperparameters, showing that almost identical predictions can be obtained when certain conditions are met. Lastly, when learning from polynomial generated data, the proposed method produces polynomials that approximate correctly the data locally. △ Less

Submitted 7 February, 2021; originally announced February 2021.

Comments: 39 pages, 15 figures

Journal ref: Neural Networks 142 (2021), 57-72

arXiv:2009.06357 [pdf]

Automatic elimination of the pectoral muscle in mammograms based on anatomical features

Authors: Jairo A. Ayala-Godoy, Rosa E. Lillo, Juan Romo

Abstract: Digital mammogram inspection is the most popular technique for early detection of abnormalities in human breast tissue. When mammograms are analyzed through a computational method, the presence of the pectoral muscle might affect the results of breast lesions detection. This problem is particularly evident in the mediolateral oblique view (MLO), where pectoral muscle occupies a large part of the m… ▽ More Digital mammogram inspection is the most popular technique for early detection of abnormalities in human breast tissue. When mammograms are analyzed through a computational method, the presence of the pectoral muscle might affect the results of breast lesions detection. This problem is particularly evident in the mediolateral oblique view (MLO), where pectoral muscle occupies a large part of the mammography. Therefore, identifying and eliminating the pectoral muscle are essential steps for improving the automatic discrimination of breast tissue. In this paper, we propose an approach based on anatomical features to tackle this problem. Our method consists of two steps: (1) a process to remove the noisy elements such as labels, markers, scratches and wedges, and (2) application of an intensity transformation based on the Beta distribution. The novel methodology is tested with 322 digital mammograms from the Mammographic Image Analysis Society (mini-MIAS) database and with a set of 84 mammograms for which the area normalized error was previously calculated. The results show a very good performance of the method. △ Less

Submitted 17 August, 2020; originally announced September 2020.

Journal ref: International Journal of Computer Science Issues; 2020

arXiv:1912.07287 [pdf, other]

doi 10.1007/s11634-021-00460-9

Detecting and Classifying Outliers in Big Functional Data

Authors: Oluwasegun Taiwo Ojo, Antonio Fernández Anta, Rosa E. Lillo, Carlo Sguera

Abstract: We propose two new outlier detection methods, for identifying and classifying different types of outliers in (big) functional data sets. The proposed methods are based on an existing method called Massive Unsupervised Outlier Detection (MUOD). MUOD detects and classifies outliers by computing for each curve, three indices, all based on the concept of linear regression and correlation, which measur… ▽ More We propose two new outlier detection methods, for identifying and classifying different types of outliers in (big) functional data sets. The proposed methods are based on an existing method called Massive Unsupervised Outlier Detection (MUOD). MUOD detects and classifies outliers by computing for each curve, three indices, all based on the concept of linear regression and correlation, which measure outlyingness in terms of shape, magnitude and amplitude, relative to the other curves in the data. 'Semifast-MUOD', the first method, uses a sample of the observations in computing the indices, while 'Fast-MUOD', the second method, uses the point-wise or $L_1$ median in computing the indices. The classical boxplot is used to separate the indices of the outliers from those of the typical observations. Performance evaluation of the proposed methods using simulated data show significant improvements compared to MUOD, both in outlier detection and computational time. We show that Fast-MUOD is especially well suited to handling big and dense functional datasets with very small computational time compared to other methods. Further comparisons with some recent outlier detection methods for functional data also show superior or comparable outlier detection accuracy of the proposed methods. We apply the proposed methods on weather, population growth, and video data. △ Less

Submitted 14 October, 2021; v1 submitted 16 December, 2019; originally announced December 2019.

MSC Class: 2R10 (Functional data analysis)

arXiv:1911.01081 [pdf, other]

Quantile regression: a penalization approach

Authors: Álvaro Méndez Civieta, M. Carmen Aguilera-Morillo, Rosa E. Lillo

Abstract: Sparse group LASSO (SGL) is a penalization technique used in regression problems where the covariates have a natural grouped structure and provides solutions that are both between and within group sparse. In this paper the SGL is introduced to the quantile regression (QR) framework, and a more flexible version, the adaptive sparse group LASSO (ASGL), is proposed. This proposal adds weights to the… ▽ More Sparse group LASSO (SGL) is a penalization technique used in regression problems where the covariates have a natural grouped structure and provides solutions that are both between and within group sparse. In this paper the SGL is introduced to the quantile regression (QR) framework, and a more flexible version, the adaptive sparse group LASSO (ASGL), is proposed. This proposal adds weights to the penalization improving prediction accuracy. Usually, adaptive weights are taken as a function of the original nonpenalized solution model. This approach is only feasible in the n > p framework. In this work, a solution that allows using adaptive weights in high-dimensional scenarios is proposed. The benefits of this proposal are studied both in synthetic and real datasets. △ Less

Submitted 4 November, 2019; originally announced November 2019.

Comments: 9 figures, 5 tables

arXiv:1905.02962 [pdf, other]

doi 10.1007/s00477-020-01774-4

Robust regression based on shrinkage estimators

Authors: Elisa Cabana, Rosa E. Lillo, Henry Laniado

Abstract: A robust estimator is proposed for the parameters that characterize the linear regression problem. It is based on the notion of shrinkages, often used in Finance and previously studied for outlier detection in multivariate data. A thorough simulation study is conducted to investigate: the efficiency with normal and heavy-tailed errors, the robustness under contamination, the computational times, t… ▽ More A robust estimator is proposed for the parameters that characterize the linear regression problem. It is based on the notion of shrinkages, often used in Finance and previously studied for outlier detection in multivariate data. A thorough simulation study is conducted to investigate: the efficiency with normal and heavy-tailed errors, the robustness under contamination, the computational times, the affine equivariance and breakdown value of the regression estimator. Two classical data-sets often used in the literature and a real socio-economic data-set about the Living Environment Deprivation of areas in Liverpool (UK), are studied. The results from the simulations and the real data examples show the advantages of the proposed robust estimator in regression. △ Less

Submitted 8 May, 2019; originally announced May 2019.

arXiv:1904.02596 [pdf, other]

doi 10.1007/s00362-019-01148-1

Multivariate outlier detection based on a robust Mahalanobis distance with shrinkage estimators

Authors: Elisa Cabana, Rosa E. Lillo, Henry Laniado

Abstract: A collection of robust Mahalanobis distances for multivariate outlier detection is proposed, based on the notion of shrinkage. Robust intensity and scaling factors are optimally estimated to define the shrinkage. Some properties are investigated, such as affine equivariance and breakdown value. The performance of the proposal is illustrated through the comparison to other techniques from the liter… ▽ More A collection of robust Mahalanobis distances for multivariate outlier detection is proposed, based on the notion of shrinkage. Robust intensity and scaling factors are optimally estimated to define the shrinkage. Some properties are investigated, such as affine equivariance and breakdown value. The performance of the proposal is illustrated through the comparison to other techniques from the literature, in a simulation study and with a real dataset. The behavior when the underlying distribution is heavy-tailed or skewed, shows the appropriateness of the method when we deviate from the common assumption of normality. The resulting high correct detection rates and low false detection rates in the vast majority of cases, as well as the significantly smaller computation time shows the advantages of our proposal. △ Less

Submitted 4 April, 2019; originally announced April 2019.

Journal ref: Stat Papers (2019)

arXiv:1610.08386 [pdf, other]

On the estimation of extreme directional multivariate quantiles

Authors: Raúl Torres, Elena Di Bernardino, Henry Laniado, Rosa E. Lillo

Abstract: In multivariate extreme value theory (MEVT), the focus is on analysis outside of the observable sampling zone, which implies that the region of interest is associated to high risk levels. This work provides tools to include directional notions into the MEVT, giving the opportunity to characterize the recently introduced directional multivariate quantiles (DMQ) at high levels. Then, an out-sample e… ▽ More In multivariate extreme value theory (MEVT), the focus is on analysis outside of the observable sampling zone, which implies that the region of interest is associated to high risk levels. This work provides tools to include directional notions into the MEVT, giving the opportunity to characterize the recently introduced directional multivariate quantiles (DMQ) at high levels. Then, an out-sample estimation method for these quantiles is given. A bootstrap procedure carries out the estimation of the tuning parameter in this multivariate framework and helps with the estimation of the DMQ. Asymptotic normality for the proposed estimator is provided and the methodology is illustrated with simulated data-sets. Finally, a real-life application to a financial case is also performed. △ Less

Submitted 4 December, 2018; v1 submitted 26 October, 2016; originally announced October 2016.

arXiv:1607.05042 [pdf, ps, other]

An empirical comparison of global and local functional depths

Authors: Carlo Sguera, Rosa E. Lillo

Abstract: A functional data depth provides a center-outward ordering criterion which allows the definition of measures such as median, trimmed means, central regions or ranks in a functional framework. A functional data depth can be global or local. With global depths, the degree of centrality of a curve $x$ depends equally on the rest of the sample observations, while with local depths, the contribution of… ▽ More A functional data depth provides a center-outward ordering criterion which allows the definition of measures such as median, trimmed means, central regions or ranks in a functional framework. A functional data depth can be global or local. With global depths, the degree of centrality of a curve $x$ depends equally on the rest of the sample observations, while with local depths, the contribution of each observation in defining the degree of centrality of $x$ decreases as the distance from $x$ increases. We empirically compare the global and the local approaches to the functional depth problem focusing on three global and two local functional depths. First, we consider two real data sets and show that global and local depths may provide different insights. Second, we use simulated data to show when we should expect differences between a global and a local approach to the functional depth problem. △ Less

Submitted 5 July, 2018; v1 submitted 18 July, 2016; originally announced July 2016.

arXiv:1606.01797 [pdf, other]

doi 10.1002/env.2428

Directional Multivariate Extremes in Environmental Phenomena

Authors: Raúl Torres, Carlo De Michele, Henry Laniado, Rosa E. Lillo

Abstract: Several environmental phenomena can be described by different correlated variables that must be considered jointly in order to be more representative of the nature of these phenomena. For such events, identification of extremes is inappropriate if it is based on marginal analysis. Extremes have usually been linked to the notion of quantile, which is an important tool to analyze risk in the univari… ▽ More Several environmental phenomena can be described by different correlated variables that must be considered jointly in order to be more representative of the nature of these phenomena. For such events, identification of extremes is inappropriate if it is based on marginal analysis. Extremes have usually been linked to the notion of quantile, which is an important tool to analyze risk in the univariate setting. We propose to identify multivariate extremes and analyze environmental phenomena in terms of the directional multivariate quantile, which allows us to analyze the data considering all the variables implied in the phenomena, as well as look at the data in interesting directions that can better describe an environmental catastrophe. Since there are many references in the literature that propose extremes detection based on copula models, we also generalize the copula method by introducing the directional approach. Advantages and disadvantages of the non-parametric proposal that we introduce and the copula methods are provided in the paper. We show with simulated and real data sets how by considering the first principal component direction we can improve the visualization of extremes. Finally, two cases of study are analyzed: a synthetic case of flood risk at a dam (a 3-variable case), and a real case study of sea storms (a 5-variable case). △ Less

Submitted 10 June, 2016; v1 submitted 6 June, 2016; originally announced June 2016.

Comments: Article with supplementary material in the appendix

Journal ref: Environmetrics, Volume 28, Issue 2 March 2017 e2428

arXiv:1502.00908 [pdf, ps, other]

doi 10.1016/j.insmatheco.2015.09.002

A Directional Multivariate Value at Risk

Authors: Raúl Torres, Rosa E. Lillo, Henry Laniado

Abstract: In economics, insurance and finance, value at risk (VaR) is a widely used measure of the risk of loss on a specific portfolio of financial assets. For a given portfolio, time horizon, and probability $α$, the $100α\%$ VaR is defined as a threshold loss value, such that the probability that the loss on the portfolio over the given time horizon exceeds this value is $α$. That is to say, it is a quan… ▽ More In economics, insurance and finance, value at risk (VaR) is a widely used measure of the risk of loss on a specific portfolio of financial assets. For a given portfolio, time horizon, and probability $α$, the $100α\%$ VaR is defined as a threshold loss value, such that the probability that the loss on the portfolio over the given time horizon exceeds this value is $α$. That is to say, it is a quantile of the distribution of the losses, which has both good analytic properties and easy interpretation as a risk measure. However, its extension to the multivariate framework is not unique because a unique definition of multivariate quantile does not exist. In the current literature, the multivariate quantiles are related to a specific partial order considered in $\mathbb{R}^{n}$, or to a property of the univariate quantile that is desirable to be extended to $\mathbb{R}^{n}$. In this work, we introduce a multivariate value at risk as a vector-valued directional risk measure, based on a directional multivariate quantile, which has recently been introduced in the literature. The directional approach allows the manager to consider external information or risk preferences in her/his analysis. We have derived some properties of the risk measure and we have compared the univariate \textit{VaR} over the marginals with the components of the directional multivariate VaR. We have also analyzed the relationship between some families of copulas, for which it is possible to obtain closed forms of the multivariate VaR that we propose. Finally, comparisons with other alternative multivariate VaR given in the literature, are provided in terms of robustness. △ Less

Submitted 3 February, 2015; originally announced February 2015.

Comments: 30 pages, 9 figures

Journal ref: Insurance: Mathematics and Economics, Volume 65, November 2015, Pages 111-123

arXiv:1409.1816 [pdf, ps, other]

Extremality measures and a rank test for functional data

Authors: A. M. Franco-Pereira, R. E. Lillo, J. Romo

Abstract: The statistical analysis of functional data is a growing need in many research areas. In particular, a robust methodology is important to study curves, which are the output of experiments in applied statistics. In this paper we study some new definitions which reflect the "extremality" of a curve with respect to a collection of functions, and provide natural orderings for sample curves. Their fini… ▽ More The statistical analysis of functional data is a growing need in many research areas. In particular, a robust methodology is important to study curves, which are the output of experiments in applied statistics. In this paper we study some new definitions which reflect the "extremality" of a curve with respect to a collection of functions, and provide natural orderings for sample curves. Their finite dimensional versions are computationally feasible and useful for studying high dimensional observations. Thus, these extreme measures are suitable for complex observations such as microarray data and images. We show the applicability of these measures designing a rank test for functional data. This functional rank test shows different growth patterns for boys and girls when it is applied to children growth data. △ Less

Submitted 4 September, 2014; originally announced September 2014.

Comments: 20pages, 11 figures

arXiv:1304.4786 [pdf, other]

The Mahalanobis distance for functional data with applications to classification

Authors: Esdras Joseph, Pedro Galeano, Rosa E. Lillo

Abstract: This paper presents a general notion of Mahalanobis distance for functional data that extends the classical multivariate concept to situations where the observed data are points belonging to curves generated by a stochastic process. More precisely, a new semi-distance for functional observations that generalize the usual Mahalanobis distance for multivariate datasets is introduced. For that, the d… ▽ More This paper presents a general notion of Mahalanobis distance for functional data that extends the classical multivariate concept to situations where the observed data are points belonging to curves generated by a stochastic process. More precisely, a new semi-distance for functional observations that generalize the usual Mahalanobis distance for multivariate datasets is introduced. For that, the development uses a regularized square root inverse operator in Hilbert spaces. Some of the main characteristics of the functional Mahalanobis semi-distance are shown. Afterwards, new versions of several well known functional classification procedures are developed using the Mahalanobis distance for functional data as a measure of proximity between functional observations. The performance of several well known functional classification procedures are compared with those methods used in conjunction with the Mahalanobis distance for functional data, with positive results, through a Monte Carlo study and the analysis of two real data examples. △ Less

Submitted 17 April, 2013; originally announced April 2013.

arXiv:1011.3411 [pdf, ps, other]

doi 10.1214/10-AOAS336

Bayesian inference for double Pareto lognormal queues

Authors: Pepa Ramirez-Cobo, Rosa E. Lillo, Simon Wilson, Michael P. Wiper

Abstract: In this article we describe a method for carrying out Bayesian estimation for the double Pareto lognormal (dPlN) distribution which has been proposed as a model for heavy-tailed phenomena. We apply our approach to estimate the $\mathit{dPlN}/M/1$ and $M/\mathit{dPlN}/1$ queueing systems. These systems cannot be analyzed using standard techniques due to the fact that the dPlN distribution does not… ▽ More In this article we describe a method for carrying out Bayesian estimation for the double Pareto lognormal (dPlN) distribution which has been proposed as a model for heavy-tailed phenomena. We apply our approach to estimate the $\mathit{dPlN}/M/1$ and $M/\mathit{dPlN}/1$ queueing systems. These systems cannot be analyzed using standard techniques due to the fact that the dPlN distribution does not possess a Laplace transform in closed form. This difficulty is overcome using some recent approximations for the Laplace transform of the interarrival distribution for the $\mathit{Pareto}/M/1$ system. Our procedure is illustrated with applications in internet traffic analysis and risk theory. △ Less

Submitted 15 November, 2010; originally announced November 2010.

Comments: Published in at http://dx.doi.org/10.1214/10-AOAS336 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS336

Journal ref: Annals of Applied Statistics 2010, Vol. 4, No. 3, 1533-1557

Showing 1–26 of 26 results for author: Lillo, R E