-
Physical dealloying for two-phase heat transfer applications: pool boiling case
Authors:
Artem Nikulin,
Yaroslav Grosu,
Jean-Luc Dauvergne,
Asier Ortuondo,
Elena Palomo del Barrio
Abstract:
In this work, physical dealloying was explored as a simple and green method to microstructure the surface of commercial brass for pool boiling heat transfer coefficient enhancement. Three samples were dealloyed for 0.5, 1 and 3 hours at 650 C, turning the smooth surface into a porous one with a depth of 175, 200 and 223 um. The boiling experiments carried out in ethanol at 78 C have shown, that th…
▽ More
In this work, physical dealloying was explored as a simple and green method to microstructure the surface of commercial brass for pool boiling heat transfer coefficient enhancement. Three samples were dealloyed for 0.5, 1 and 3 hours at 650 C, turning the smooth surface into a porous one with a depth of 175, 200 and 223 um. The boiling experiments carried out in ethanol at 78 C have shown, that the maximum enhancement of heat transfer coefficient between 110 and 150% was achieved for the sample dealloyed for 0.5 h. Longer intervals of dealloying reduce boiling performance, but it is still much higher compared to smooth brass. This simple method can be customized for various thermal management equipment, such as conventional, plate and micro heat exchangers, all types of heat pipes, HVAC equipment etc., where the heat transfer occurs with phase change.
△ Less
Submitted 19 June, 2023;
originally announced June 2023.
-
Regularity of center-outward distribution functions in non-convex domains
Authors:
Eustasio del Barrio,
Alberto González Sanz
Abstract:
For a probability P in $R^d$ its center outward distribution function $F_{\pm}$, introduced in Chernozhukov et al. (2017) and Hallin et al. (2021), is a new and successful concept of multivariate distribution function based on mass transportation theory. This work proves, for a probability P with density locally bounded away from zero and infinity in its support, the continuity of the center-outwa…
▽ More
For a probability P in $R^d$ its center outward distribution function $F_{\pm}$, introduced in Chernozhukov et al. (2017) and Hallin et al. (2021), is a new and successful concept of multivariate distribution function based on mass transportation theory. This work proves, for a probability P with density locally bounded away from zero and infinity in its support, the continuity of the center-outward map on the interior of the support of P and the continuity of its inverse, the quantile, $Q_{\pm}$. This relaxes the convexity assumption in del Barrio et al. (2020). Some important consequences of this continuity are Glivenko-Cantelli type theorems and characterisation of weak convergence by the stability of the center-outward map.
△ Less
Submitted 5 April, 2023; v1 submitted 29 March, 2023;
originally announced March 2023.
-
Using the Sinkhorn divergence in permutation tests for the multivariate two-sample problem
Authors:
E. del Barrio,
J. S. Osorio,
A. J. Quiroz
Abstract:
In order to adapt the Wasserstein distance to the large sample multivariate non-parametric two-sample problem, making its application computationally feasible, permutation tests based on the Sinkhorn divergence between probability vectors associated to data dependent partitions are considered. Different ways of implementing these tests are evaluated and the asymptotic distribution of the underlyin…
▽ More
In order to adapt the Wasserstein distance to the large sample multivariate non-parametric two-sample problem, making its application computationally feasible, permutation tests based on the Sinkhorn divergence between probability vectors associated to data dependent partitions are considered. Different ways of implementing these tests are evaluated and the asymptotic distribution of the underlying statistic is established in some cases. The statistics proposed are compared, in simulated examples, with the test of Schilling's, one of the best non-parametric tests available in the literature.
△ Less
Submitted 28 September, 2022;
originally announced September 2022.
-
A facile approach for phase change material encapsulation into polymeric flexible fibers using microfluidic principles
Authors:
Mikel Duran,
Artem Nikulin,
Jean-Luc Dauvergne,
Angel Serrano,
Yaroslav Grosu,
Jalel Labidi,
Elena Palomo del Barrio
Abstract:
It is widely agreed that phase change materials (PCMs) are of high interest for sustainable energy future. Many of the applications require anti-leakage properties of PCM, that can be accomplished through PCM encapsulation. In this study, scalable and considerably simplified approach based on the microfluidics principles was successfully designed for polyvinylidene fluoride (PVDF) hollow- and for…
▽ More
It is widely agreed that phase change materials (PCMs) are of high interest for sustainable energy future. Many of the applications require anti-leakage properties of PCM, that can be accomplished through PCM encapsulation. In this study, scalable and considerably simplified approach based on the microfluidics principles was successfully designed for polyvinylidene fluoride (PVDF) hollow- and for leakage-free paraffin-core/PVDF-sheath fibers production. The required device can be as simple as syringe+tube+glass capillary. The fibers were created by PVDF/N,NDimethylformamide (DMF) solution and PVDF/DMF/paraffin emulsion injection in water followed by solvent extraction process. The proposed approach results in a hollow PVDF or PVDF/paraffin composite fibers with the PCM content between 32-47.5% according to DSC and TGA measurements. SEM study of the fibers morphology has shown that PCM is in the form of slugs along the fibers. Such PCM distribution is maintained until the first melting cycle. Later, molten PCM spreads within the fiber under capillary forces that was captured by infrared camera. Elastic modules and stress vs. strain were measured to characterise mechanical properties of designed fibers. Finally, the composite fibers have shown outstanding retention capacity with only 3.5% of PCM mass loose after 1000 melting/crystallisation cycles.
△ Less
Submitted 13 June, 2022;
originally announced June 2022.
-
Nonparametric Multiple-Output Center-Outward Quantile Regression
Authors:
Eustasio del Barrio,
Alberto Gonzalez Sanz,
Marc Hallin
Abstract:
Based on the novel concept of multivariate center-outward quantiles introduced recently in Chernozhukov et al. (2017) and Hallin et al. (2021), we are considering the problem of nonparametric multiple-output quantile regression. Our approach defines nested conditional center-outward quantile regression contours and regions with given conditional probability content irrespective of the underlying d…
▽ More
Based on the novel concept of multivariate center-outward quantiles introduced recently in Chernozhukov et al. (2017) and Hallin et al. (2021), we are considering the problem of nonparametric multiple-output quantile regression. Our approach defines nested conditional center-outward quantile regression contours and regions with given conditional probability content irrespective of the underlying distribution; their graphs constitute nested center-outward quantile regression tubes. Empirical counterparts of these concepts are constructed, yielding interpretable empirical regions and contours which are shown to consistently reconstruct their population versions in the Pompeiu-Hausdorff topology. Our method is entirely non-parametric and performs well in simulations including heteroskedasticity and nonlinear trends; its power as a data-analytic tool is illustrated on some real datasets.
△ Less
Submitted 26 April, 2022; v1 submitted 25 April, 2022;
originally announced April 2022.
-
An improved central limit theorem and fast convergence rates for entropic transportation costs
Authors:
Eustasio del Barrio,
Alberto Gonzalez-Sanz,
Jean-Michel Loubes,
Jonathan Niles-Weed
Abstract:
We prove a central limit theorem for the entropic transportation cost between subgaussian probability measures, centered at the population cost. This is the first result which allows for asymptotically valid inference for entropic optimal transport between measures which are not necessarily discrete. In the compactly supported case, we complement these results with new, faster, convergence rates f…
▽ More
We prove a central limit theorem for the entropic transportation cost between subgaussian probability measures, centered at the population cost. This is the first result which allows for asymptotically valid inference for entropic optimal transport between measures which are not necessarily discrete. In the compactly supported case, we complement these results with new, faster, convergence rates for the expected entropic transportation cost between empirical measures. Our proof is based on strengthening convergence results for dual solutions to the entropic optimal transport problem.
△ Less
Submitted 4 May, 2022; v1 submitted 19 April, 2022;
originally announced April 2022.
-
Central Limit Theorems for Semidiscrete Wasserstein Distances
Authors:
Eustasio del Barrio,
Alberto González-Sanz,
Jean-Michel Loubes
Abstract:
We prove a Central Limit Theorem for the empirical optimal transport cost, $\sqrt{\frac{nm}{n+m}}\{\mathcal{T}_c(P_n,Q_m)-\mathcal{T}_c(P,Q)\}$, in the semi discrete case, i.e when the distribution $P$ is supported in $N$ points, but without assumptions on $Q$. We show that the asymptotic distribution is the supremun of a centered Gaussian process, which is Gaussian under some additional condition…
▽ More
We prove a Central Limit Theorem for the empirical optimal transport cost, $\sqrt{\frac{nm}{n+m}}\{\mathcal{T}_c(P_n,Q_m)-\mathcal{T}_c(P,Q)\}$, in the semi discrete case, i.e when the distribution $P$ is supported in $N$ points, but without assumptions on $Q$. We show that the asymptotic distribution is the supremun of a centered Gaussian process, which is Gaussian under some additional conditions on the probability $Q$ and on the cost. Such results imply the central limit theorem for the $p$-Wassertein distance, for $p\geq 1$. This means that, for fixed $N$, the curse of dimensionality is avoided. To better understand the influence of such $N$, we provide bounds of $E|\mathcal{W}_1(P,Q_m)-\mathcal{W}_1(P,Q)|$ depending on $m$ and $N$. Finally, the semidiscrete framework provides a control on the second derivative of the dual formulation, which yields the first central limit theorem for the optimal transport potentials. The results are supported by simulations that help to visualize the given limits and bounds. We analyse also the cases where classical bootstrap works.
△ Less
Submitted 13 February, 2022;
originally announced February 2022.
-
Tetralin + fullerene C60 solutions for thermal management of flat-plate photovoltaic/thermal collector
Authors:
Rita Adrião Lamosa,
Igor Motovoy,
Nikita Khliiev,
Artem Nikulin,
Olga Khliyeva,
Ana S. Moita,
Janusz Krupanek,
Yaroslav Grosu,
Vitaly Zhelezny,
Antonio Luis Moreira,
Elena Palomo del Barrio
Abstract:
A new composite heat transfer fluid consisting of tetralin and fullerene has been proposed for photovoltaic thermal hybrid solar harvesting. It features a unique absorption spectrum that is capable of sharply cutting off solar energy irradiated in the range of wavelength from 300 to 650 nm, making it a perfect candidate for simultaneous harvesting of both photovoltaic and thermal components of sol…
▽ More
A new composite heat transfer fluid consisting of tetralin and fullerene has been proposed for photovoltaic thermal hybrid solar harvesting. It features a unique absorption spectrum that is capable of sharply cutting off solar energy irradiated in the range of wavelength from 300 to 650 nm, making it a perfect candidate for simultaneous harvesting of both photovoltaic and thermal components of solar energy. The proposed composite revealed outstanding stability and facile synthesize root, which are the two main obstacles for applicability of nanofluids. It was shown experimentally that the additives of fullerene to tetralin do not alter significantly it's thermophysical properties apart from viscosity that increases moderately. Besides, tetralin/fullerene solutions show similar thermohydraulics performance to that of pure tetralin in laminar flow regime or insignificantly lower in transient and turbulent flow regimes. A new figure of merit was proposed to analyze the thermohydraulics performance that consider not only exergy losses due to the kinetic energy dissipation, but also exergy losses associated with a finite temperature difference in the heat exchanger. As a result, the proposed figure of merit indicates the decrease of the heat transfer performance of tetralin/fullerene solutions that directly proportional to fullerene concentration. The performed simulation suggests that the total energy efficiency of flat-plate photovoltaic/thermal solar collector goes up to 60.4 % estimated according regulation (EU) No. 811/2013. Finally, life cycle analysis revealed further improvement root in view of environmental impact.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.
-
Spacing effect on pool boiling performance of three triangular pitched and vertically oriented tubes
Authors:
Artem Nikulin,
Jean-Luc Dauvergne,
Asier Ortuondo,
Elena Palomo del Barrio
Abstract:
There is a scarcity of available data on boiling process in vertically oriented tube bundles in accessible sources. Lack of systematic studies is limiting further expansion of this highly efficient process of heat transfer into heat recovery field. In this paper boiling process of three triangular pitched and vertically oriented tubes has been studied in ethanol at 78$^{\circ}$C. The main focus of…
▽ More
There is a scarcity of available data on boiling process in vertically oriented tube bundles in accessible sources. Lack of systematic studies is limiting further expansion of this highly efficient process of heat transfer into heat recovery field. In this paper boiling process of three triangular pitched and vertically oriented tubes has been studied in ethanol at 78$^{\circ}$C. The main focus of this work was to study the effect of tube spacings on heat transfer coefficient (HTC) and bubbles behavior (bubble departure diameter in particular) that were visualised with the help of a high speed camera. Experiments were performed in a wide range of tube spacings (from 10.75 to 0.25 mm) and heat flux densities (from 3 to 70 kW/m$^2$).
The obtained results show that, long spacings i.e., much longer than bubble departure diameter, have no influence on HTC as well as on bubbles behavior. On the contrary, the spacings on the order of the bubble departure diameter tend to create slug flow in the bundle, that is very beneficial for the heat exchange at low heat fluxes. Finally, narrow spacings that are much shorter than the bubble departure diameter have shown the potential to enhance the HTC in tube bundles with low length to diameter ratios.
△ Less
Submitted 20 June, 2021;
originally announced June 2021.
-
A Central Limit Theorem for Semidiscrete Wasserstein Distances
Authors:
Eustasio del Barrio,
Alberto González-Sanz,
Jean-Michel Loubes
Abstract:
We address the problem of proving a Central Limit Theorem for the empirical optimal transport cost, $\sqrt{n}\{\mathcal{T}_c(P_n,Q)-\mathcal{W}_c(P,Q)\}$, in the semi discrete case, i.e when the distribution $P$ is finitely supported. We show that the asymptotic distribution is the supremun of a centered Gaussian process which is Gaussian under some additional conditions on the probability $Q$ and…
▽ More
We address the problem of proving a Central Limit Theorem for the empirical optimal transport cost, $\sqrt{n}\{\mathcal{T}_c(P_n,Q)-\mathcal{W}_c(P,Q)\}$, in the semi discrete case, i.e when the distribution $P$ is finitely supported. We show that the asymptotic distribution is the supremun of a centered Gaussian process which is Gaussian under some additional conditions on the probability $Q$ and on the cost. Such results imply the central limit theorem for the $p$-Wassertein distance, for $p\geq 1$. Finally, the semidiscrete framework provides a control on the second derivative of the dual formulation, which yields the first central limit theorem for the optimal transport potentials.
△ Less
Submitted 25 May, 2021;
originally announced May 2021.
-
Central Limit Theorems for General Transportation Costs
Authors:
Eustasio del Barrio,
Alberto González-Sanz,
Jean-Michel Loubes
Abstract:
We consider the problem of optimal transportation with general cost between a empirical measure and a general target probability on R d , with d $\ge$ 1. We extend results in [19] and prove asymptotic stability of both optimal transport maps and potentials for a large class of costs in R d. We derive a central limit theorem (CLT) towards a Gaussian distribution for the empirical transportation cos…
▽ More
We consider the problem of optimal transportation with general cost between a empirical measure and a general target probability on R d , with d $\ge$ 1. We extend results in [19] and prove asymptotic stability of both optimal transport maps and potentials for a large class of costs in R d. We derive a central limit theorem (CLT) towards a Gaussian distribution for the empirical transportation cost under minimal assumptions, with a new proof based on the Efron-Stein inequality and on the sequential compactness of the closed unit ball in L 2 (P) for the weak topology. We provide also CLTs for empirical Wassertsein distances in the special case of potential costs | $\bullet$ | p , p > 1.
△ Less
Submitted 23 February, 2021; v1 submitted 12 February, 2021;
originally announced February 2021.
-
The complex behaviour of Galton rank order statistic
Authors:
E. del Barrio,
J. A. Cuesta-Albertos,
C. Matran
Abstract:
Galton's rank order statistic is one of the oldest statistical tools for two-sample comparisons. It is also a very natural index to measure departures from stochastic dominance. Yet, its asymptotic behaviour has been investigated only partially, under restrictive assumptions. This work provides a comprehensive {study} of this behaviour, based on the analysis of the so-called contact set (a modific…
▽ More
Galton's rank order statistic is one of the oldest statistical tools for two-sample comparisons. It is also a very natural index to measure departures from stochastic dominance. Yet, its asymptotic behaviour has been investigated only partially, under restrictive assumptions. This work provides a comprehensive {study} of this behaviour, based on the analysis of the so-called contact set (a modification of the set in which the quantile functions coincide). We show that a.s. convergence to the population counterpart holds if and only if {the} contact set has zero Lebesgue measure. When this set is finite we show that the asymptotic behaviour is determined by the local behaviour of a suitable reparameterization of the quantile functions in a neighbourhood of the contact points. Regular crossings result in standard rates and Gaussian limiting distributions, but higher order contacts (in the sense introduced in this work) or contacts at the extremes of the supports may result in different rates and non-Gaussian limits.
△ Less
Submitted 4 February, 2021;
originally announced February 2021.
-
Achieving robustness in classification using optimal transport with hinge regularization
Authors:
Mathieu Serrurier,
Franck Mamalet,
Alberto González-Sanz,
Thibaut Boissin,
Jean-Michel Loubes,
Eustasio del Barrio
Abstract:
Adversarial examples have pointed out Deep Neural Networks vulnerability to small local noise. It has been shown that constraining their Lipschitz constant should enhance robustness, but make them harder to learn with classical loss functions. We propose a new framework for binary classification, based on optimal transport, which integrates this Lipschitz constraint as a theoretical requirement. W…
▽ More
Adversarial examples have pointed out Deep Neural Networks vulnerability to small local noise. It has been shown that constraining their Lipschitz constant should enhance robustness, but make them harder to learn with classical loss functions. We propose a new framework for binary classification, based on optimal transport, which integrates this Lipschitz constraint as a theoretical requirement. We propose to learn 1-Lipschitz networks using a new loss that is an hinge regularized version of the Kantorovich-Rubinstein dual formulation for the Wasserstein distance estimation. This loss function has a direct interpretation in terms of adversarial robustness together with certifiable robustness bound. We also prove that this hinge regularized version is still the dual formulation of an optimal transportation problem, and has a solution. We also establish several geometrical properties of this optimal solution, and extend the approach to multi-class problems. Experiments show that the proposed approach provides the expected guarantees in terms of robustness without any significant accuracy drop. The adversarial examples, on the proposed models, visibly and meaningfully change the input providing an explanation for the classification.
△ Less
Submitted 26 April, 2021; v1 submitted 11 June, 2020;
originally announced June 2020.
-
The statistical effect of entropic regularization in optimal transportation
Authors:
Eustasio del Barrio,
Jean-Michel Loubes
Abstract:
We propose to tackle the problem of understanding the effect of regularization in Sinkhorn algotihms. In the case of Gaussian distributions we provide a closed form for the regularized optimal transport which enables to provide a better understanding of the effect of the regularization from a statistical framework.
We propose to tackle the problem of understanding the effect of regularization in Sinkhorn algotihms. In the case of Gaussian distributions we provide a closed form for the regularized optimal transport which enables to provide a better understanding of the effect of the regularization from a statistical framework.
△ Less
Submitted 15 June, 2020; v1 submitted 9 June, 2020;
originally announced June 2020.
-
Review of Mathematical frameworks for Fairness in Machine Learning
Authors:
Eustasio del Barrio,
Paula Gordaliza,
Jean-Michel Loubes
Abstract:
A review of the main fairness definitions and fair learning methodologies proposed in the literature over the last years is presented from a mathematical point of view. Following our independence-based approach, we consider how to build fair algorithms and the consequences on the degradation of their performance compared to the possibly unfair case. This corresponds to the price for fairness given…
▽ More
A review of the main fairness definitions and fair learning methodologies proposed in the literature over the last years is presented from a mathematical point of view. Following our independence-based approach, we consider how to build fair algorithms and the consequences on the degradation of their performance compared to the possibly unfair case. This corresponds to the price for fairness given by the criteria $\textit{statistical parity}$ or $\textit{equality of odds}$. Novel results giving the expressions of the optimal fair classifier and the optimal fair predictor (under a linear regression gaussian model) in the sense of $\textit{equality of odds}$ are presented.
△ Less
Submitted 26 May, 2020;
originally announced May 2020.
-
A survey of bias in Machine Learning through the prism of Statistical Parity for the Adult Data Set
Authors:
Philippe Besse,
Eustasio del Barrio,
Paula Gordaliza,
Jean-Michel Loubes,
Laurent Risser
Abstract:
Applications based on Machine Learning models have now become an indispensable part of the everyday life and the professional world. A critical question then recently arised among the population: Do algorithmic decisions convey any type of discrimination against specific groups of population or minorities? In this paper, we show the importance of understanding how a bias can be introduced into aut…
▽ More
Applications based on Machine Learning models have now become an indispensable part of the everyday life and the professional world. A critical question then recently arised among the population: Do algorithmic decisions convey any type of discrimination against specific groups of population or minorities? In this paper, we show the importance of understanding how a bias can be introduced into automatic decisions. We first present a mathematical framework for the fair learning problem, specifically in the binary classification setting. We then propose to quantify the presence of bias by using the standard Disparate Impact index on the real and well-known Adult income data set. Finally, we check the performance of different approaches aiming to reduce the bias in binary classification outcomes. Importantly, we show that some intuitive methods are ineffective. This sheds light on the fact trying to make fair machine learning models may be a particularly challenging task, in particular when the training observations contain a bias.
△ Less
Submitted 6 April, 2020; v1 submitted 31 March, 2020;
originally announced March 2020.
-
A note on the Regularity of Center-Outward Distribution and Quantile Functions
Authors:
Eustasio del Barrio,
Alberto González-Sanz,
Marc Hallin
Abstract:
We provide sufficient conditions under which the center-outward distribution and quantile functions introduced in Chernozhukov et al.~(2017) and Hallin~(2017) are homeomorphisms, thereby extending a recent result by Figalli \cite{Fi2}. Our approach relies on Cafarelli's classical regularity theory for the solutions of the Monge-Ampère equation, but has to deal with difficulties related with the un…
▽ More
We provide sufficient conditions under which the center-outward distribution and quantile functions introduced in Chernozhukov et al.~(2017) and Hallin~(2017) are homeomorphisms, thereby extending a recent result by Figalli \cite{Fi2}. Our approach relies on Cafarelli's classical regularity theory for the solutions of the Monge-Ampère equation, but has to deal with difficulties related with the unboundedness at the origin of the density of the spherical uniform reference measure. Our conditions are satisfied by probabillities on Euclidean space with a general (bounded or unbounded) convex support which are not covered in~\cite{Fi2}. We provide some additional results about center-outward distribution and quantile functions, including the fact that quantile sets exhibit some weak form of convexity.
△ Less
Submitted 23 December, 2019;
originally announced December 2019.
-
optimalFlow: Optimal-transport approach to flow cytometry gating and population matching
Authors:
Eustasio del Barrio,
Hristo Inouzhe,
Jean-Michel Loubes,
Carlos Matrán,
Agustín Mayo-Íscar
Abstract:
Data obtained from Flow Cytometry present pronounced variability due to biological and technical reasons. Biological variability is a well-known phenomenon produced by measurements on different individuals, with different characteristics such as illness, age, sex, etc. The use of different settings for measurement, the variation of the conditions during experiments and the different types of flow…
▽ More
Data obtained from Flow Cytometry present pronounced variability due to biological and technical reasons. Biological variability is a well-known phenomenon produced by measurements on different individuals, with different characteristics such as illness, age, sex, etc. The use of different settings for measurement, the variation of the conditions during experiments and the different types of flow cytometers are some of the technical causes of variability. This mixture of sources of variability makes the use of supervised machine learning for identification of cell populations difficult. The present work is conceived as a combination of strategies to facilitate the task of supervised gating.
We propose $optimalFlowTemplates$, based on a similarity distance and $\text{Wasserstein barycenters}$, which clusters cytometries and produces prototype cytometries for the different groups. We show that supervised learning, restricted to the new groups, performs better than the same techniques applied to the whole collection. We also present $optimalFlowClassification$, which uses a database of gated cytometries and optimalFlowTemplates to assign cell types to a new cytometry. We show that this procedure can outperform state of the art techniques in the proposed datasets. Our code is freely available as $optimalFlow$ a Bioconductor R package at https://bioconductor.org/packages/optimalFlow.
optimalFlowTemplates+optimalFlowClassification addresses the problem of using supervised learning while accounting for biological and technical variability. Our methodology provides a robust automated gating workflow that handles the intrinsic variability of flow cytometry data well. Our main innovation is the methodology itself and the optimal-transport techniques that we apply to flow cytometry analysis.
△ Less
Submitted 29 April, 2020; v1 submitted 18 July, 2019;
originally announced July 2019.
-
Attraction-Repulsion clustering with applications to fairness
Authors:
Eustasio del Barrio,
Hristo Inouzhe,
Jean-Michel Loubes
Abstract:
We consider the problem of diversity enhancing clustering, i.e, develo** clustering methods which produce clusters that favour diversity with respect to a set of protected attributes such as race, sex, age, etc. In the context of fair clustering, diversity plays a major role when fairness is understood as demographic parity. To promote diversity, we introduce perturbations to the distance in the…
▽ More
We consider the problem of diversity enhancing clustering, i.e, develo** clustering methods which produce clusters that favour diversity with respect to a set of protected attributes such as race, sex, age, etc. In the context of fair clustering, diversity plays a major role when fairness is understood as demographic parity. To promote diversity, we introduce perturbations to the distance in the unprotected attributes that account for protected attributes in a way that resembles attraction-repulsion of charged particles in Physics. These perturbations are defined through dissimilarities with a tractable interpretation. Cluster analysis based on attraction-repulsion dissimilarities penalizes homogeneity of the clusters with respect to the protected attributes and leads to an improvement in diversity. An advantage of our approach, which falls into a pre-processing set-up, is its compatibility with a wide variety of clustering methods and whit non-Euclidean data. We illustrate the use of our procedures with both synthetic and real data and provide discussion about the relation between diversity, fairness, and cluster structure. Our procedures are implemented in an R package freely available at https://github.com/HristoInouzhe/AttractionRepulsionClustering.
△ Less
Submitted 26 October, 2021; v1 submitted 10 April, 2019;
originally announced April 2019.
-
On approximate validation of models: A Kolmogorov-Smirnov based approach
Authors:
Eustasio del Barrio,
Hristo Inouzhe,
Carlos Matrán
Abstract:
Classical tests of fit typically reject a model for large enough real data samples. In contrast, often in statistical practice a model offers a good description of the data even though it is not the "true" random generator. We consider a more flexible approach based on contamination neighbourhoods around a model. Using trimming methods and the Kolmogorov metric we introduce a functional statistic…
▽ More
Classical tests of fit typically reject a model for large enough real data samples. In contrast, often in statistical practice a model offers a good description of the data even though it is not the "true" random generator. We consider a more flexible approach based on contamination neighbourhoods around a model. Using trimming methods and the Kolmogorov metric we introduce a functional statistic measuring departures from a contaminated model and the associated estimator corresponding to its sample version. We show how this estimator allows testing of fit for the (slightly) contaminated model vs sensible deviations from it, with uniformly exponentially small type I and type II error probabilities. We also address the asymptotic behavior of the estimator showing that, under suitable regularity conditions, it asymptotically behaves as the supremum of a Gaussian process. As an application we explore methods of comparison between descriptive models based on the paradigm of model falseness. We also include some connections of our approach with the False-Discovery-Rate setting, showing competitive behavior when estimating the contamination level, although applicable in a wider framework.
△ Less
Submitted 20 March, 2019;
originally announced March 2019.
-
Box-constrained monotone $L_\infty$-approximations to Lipschitz regularizations, with applications to robust testing
Authors:
Eustasio del Barrio,
Hristo Inouzhe,
Carlos Matrán
Abstract:
Tests of fit to exact models in statistical analysis often lead to rejections even when the model is a useful approximate description of the random generator of the data. Among possible relaxations of a fixed model, the one defined by contamination neighbourhoods, namely, $\mathcal{V}_α(P_0)=\{(1-α)P_0+αQ: Q \in \mathcal{P}\}$, where $\mathcal{P}$ is the set of all probabilities in the sample spac…
▽ More
Tests of fit to exact models in statistical analysis often lead to rejections even when the model is a useful approximate description of the random generator of the data. Among possible relaxations of a fixed model, the one defined by contamination neighbourhoods, namely, $\mathcal{V}_α(P_0)=\{(1-α)P_0+αQ: Q \in \mathcal{P}\}$, where $\mathcal{P}$ is the set of all probabilities in the sample space, has received much attention, from its central role in Robust Statistics. For probabilities on the real line, consistent tests of fit to $\mathcal{V}_α(P_0)$ can be based on $d_K(P_0,R_α(P))$, the minimal Kolmogorov distance between $P_0$ and the set of trimmings of $P$, $R_α(P)=\big\{\tilde P\in\mathcal{P}:\tilde P\ll P,\,{\textstyle \frac{d\tilde P}{dP}\leq\frac{1}{1-α}}\, P\text{-a.s.}\big\}$. We show that this functional admits equivalent formulations in terms of, either best approximation in uniform norm by $L$-Lipschitz functions satisfying a box constraint, or as the best monotone approximation in uniform norm to the $L$-Lipschitz regularization, which is seen to be expressable in terms of the average of the Pasch-Hausdorff envelopes. This representation for the solution of the variational problem allows to obtain results showing stability of the functional $d_K(P_0,R_α(P))$, as well as directional differentiability, providing the basis for a Central Limit Theorem for that functional.
△ Less
Submitted 13 November, 2019; v1 submitted 20 March, 2019;
originally announced March 2019.
-
A Central Limit Theorem for $L_p$ transportation cost with applications to Fairness Assessment in Machine Learning
Authors:
Eustasio del Barrio,
Paula Gordaliza,
Jean-Michel Loubes
Abstract:
We provide a Central Limit Theorem for the Monge-Kantorovich distance between two empirical distributions with size $n$ and $m$, $W_p(P_n,Q_m)$ for $p>1$ for observations on the real line, using a minimal amount of assumptions. We provide an estimate of the asymptotic variance which enables to build a two sample test to assess the similarity between two distributions. This test is then used to pro…
▽ More
We provide a Central Limit Theorem for the Monge-Kantorovich distance between two empirical distributions with size $n$ and $m$, $W_p(P_n,Q_m)$ for $p>1$ for observations on the real line, using a minimal amount of assumptions. We provide an estimate of the asymptotic variance which enables to build a two sample test to assess the similarity between two distributions. This test is then used to provide a new criterion to assess the notion of fairness of a classification algorithm.
△ Less
Submitted 18 July, 2018;
originally announced July 2018.
-
Confidence Intervals for Testing Disparate Impact in Fair Learning
Authors:
Philippe Besse,
Eustasio del Barrio,
Paula Gordaliza,
Jean-Michel Loubes
Abstract:
We provide the asymptotic distribution of the major indexes used in the statistical literature to quantify disparate treatment in machine learning. We aim at promoting the use of confidence intervals when testing the so-called group disparate impact. We illustrate on some examples the importance of using confidence intervals and not a single value.
We provide the asymptotic distribution of the major indexes used in the statistical literature to quantify disparate treatment in machine learning. We aim at promoting the use of confidence intervals when testing the so-called group disparate impact. We illustrate on some examples the importance of using confidence intervals and not a single value.
△ Less
Submitted 17 July, 2018;
originally announced July 2018.
-
Obtaining fairness using optimal transport theory
Authors:
Eustasio del Barrio,
Fabrice Gamboa,
Paula Gordaliza,
Jean-Michel Loubes
Abstract:
Statistical algorithms are usually hel** in making decisions in many aspects of our lives. But, how do we know if these algorithms are biased and commit unfair discrimination of a particular group of people, typically a minority? \textit{Fairness} is generally studied in a probabilistic framework where it is assumed that there exists a protected variable, whose use as an input of the algorithm m…
▽ More
Statistical algorithms are usually hel** in making decisions in many aspects of our lives. But, how do we know if these algorithms are biased and commit unfair discrimination of a particular group of people, typically a minority? \textit{Fairness} is generally studied in a probabilistic framework where it is assumed that there exists a protected variable, whose use as an input of the algorithm may imply discrimination. There are different definitions of Fairness in the literature. In this paper we focus on two of them which are called Disparate Impact (DI) and Balanced Error Rate (BER). Both are based on the outcome of the algorithm across the different groups determined by the protected variable. The relationship between these two notions is also studied. The goals of this paper are to detect when a binary classification rule lacks fairness and to try to fight against the potential discrimination attributable to it. This can be done by modifying either the classifiers or the data itself. Our work falls into the second category and modifies the input data using optimal transport theory.
△ Less
Submitted 18 July, 2018; v1 submitted 8 June, 2018;
originally announced June 2018.
-
Center-Outward Distribution Functions, Quantiles, Ranks, and Signs in $\mathbb{R}^d$
Authors:
Eustasio del Barrio,
Juan A. Cuesta-Albertos,
Marc Hallin,
Carlos Matrán
Abstract:
Univariate concepts as quantile and distribution functions involving ranks and signs, do not canonically extend to $\mathbb{R}^d, d\geq 2$. Palliating that has generated an abundant literature. Chapter 1 shows that, unlike the many definitions that have been proposed so far, the measure transportation-based ones introduced in Chernozhukov et al. (2017) enjoy all the properties that make univariate…
▽ More
Univariate concepts as quantile and distribution functions involving ranks and signs, do not canonically extend to $\mathbb{R}^d, d\geq 2$. Palliating that has generated an abundant literature. Chapter 1 shows that, unlike the many definitions that have been proposed so far, the measure transportation-based ones introduced in Chernozhukov et al. (2017) enjoy all the properties that make univariate quantiles and ranks successful tools for semiparametric statistical inference.
We therefore propose a new center-outward definition of multivariate distribution and quantile functions, along with their empirical counterparts, for which we obtain a Glivenko-Cantelli result. Our approach is geometric and, contrary to the Monge-Kantorovich one in Chernozhukov et al. (2017), does not require any moment assumptions. The resulting ranks and signs are strictly distribution-free, and maximal invariant under the action of a data-driven class of (order-preserving) transformations generating the family of absolutely continuous distributions; that property is the theoretical foundation of the semiparametric efficiency preservation property of ranks. The corresponding quantiles are equivariant under the same transformations.
The empirical proposed distribution functions are defined at observed values only. A continuous extension to the entire $\mathbb{R}^d$, yielding continuous empirical quantile contours while preserving the monotonicity and Glivenko-Cantelli features is desirable. Such extension requires solving a nontrivial problem of smooth interpolation under cyclical monotonicity constraints. A complete solution of that problem is given in Chapter 2; we show that the resulting distribution and quantile functions are Lipschitz, and provide a sharp lower bound for the Lipschitz constants. A numerical study of empirical center-outward quantile contours and their consistency is conducted.
△ Less
Submitted 27 February, 2020; v1 submitted 4 June, 2018;
originally announced June 2018.
-
Invariant measures of disagreement with stochastic dominance
Authors:
E. del Barrio,
J. A. Cuesta-Albertos,
C. Matran
Abstract:
An essential feature of stochastic order is its invariance against increasing maps. In this paper, we analyze a family of invariant indices of disagreement with respect to stochastic dominance. The indices in this family admit the representation $θ(F,G)=P(X>Y)$, where $(X,Y)$ is a random vector with marginal distribution functions $F$ and $G$. This includes the case of independent marginals, but a…
▽ More
An essential feature of stochastic order is its invariance against increasing maps. In this paper, we analyze a family of invariant indices of disagreement with respect to stochastic dominance. The indices in this family admit the representation $θ(F,G)=P(X>Y)$, where $(X,Y)$ is a random vector with marginal distribution functions $F$ and $G$. This includes the case of independent marginals, but also other interesting indices related to a contamination model or to a joint quantile representation. For some choices of $θ$ the condition $θ(F,G)=0$ is equivalent to stochastic dominance of $G$ over $F$. We show that the index associated to the contamination model achieves the minimal value within this family. The plug-in sample-based versions of these indices lead to the Mann-Whitney, the one-sided Kolmogorov-Smirnov, and the Galton statistics. For some of the most interesting indices this fact provides sufficient theoretical support for asymptotic inference. However, this is not the case for Galton's statistic, for which we provide additional theory for its resampling behaviour. We stress on the complementary roles of some of these indices, which beyond measuring disagreement with respect to stochastic order allow to describe the maximum possible difference in status of a value $x\in \mathbb{R}$ under $F$ or $G$. We apply these indices to some real data sets.
△ Less
Submitted 25 March, 2022; v1 submitted 9 April, 2018;
originally announced April 2018.
-
An optimal transportation approach for assessing almost stochastic order
Authors:
E. del Barrio,
J. A. Cuesta-Albertos,
C. Matrán
Abstract:
When stochastic dominance $F\leq_{st}G$ does not hold, we can improve agreement to stochastic order by suitably trimming both distributions. In this work we consider the $L_2-$Wasserstein distance, $\mathcal W_2$, to stochastic order of these trimmed versions. Our characterization for that distance naturally leads to consider a $\mathcal W_2$-based index of disagreement with stochastic order,…
▽ More
When stochastic dominance $F\leq_{st}G$ does not hold, we can improve agreement to stochastic order by suitably trimming both distributions. In this work we consider the $L_2-$Wasserstein distance, $\mathcal W_2$, to stochastic order of these trimmed versions. Our characterization for that distance naturally leads to consider a $\mathcal W_2$-based index of disagreement with stochastic order, $\varepsilon_{\mathcal W_2}(F,G)$. We provide asymptotic results allowing to test $H_0: \varepsilon_{\mathcal W_2}(F,G)\geq \varepsilon_0$ vs $H_a: \varepsilon_{\mathcal W_2}(F,G)<\varepsilon_0$, that, under rejection, would give statistical guarantee of almost stochastic dominance. We include a simulation study showing a good performance of the index under the normal model.
△ Less
Submitted 4 May, 2017;
originally announced May 2017.
-
Central Limit Theorem for empirical transportation cost in general dimension
Authors:
Eustasio Del Barrio,
Jean-Michel Loubes
Abstract:
We consider the problem of optimal transportation with quadratic cost between a empirical measure and a general target probability on R d , with d $\ge$ 1. We provide new results on the uniqueness and stability of the associated optimal transportation potentials , namely, the minimizers in the dual formulation of the optimal transportation problem. As a consequence, we show that a CLT holds for th…
▽ More
We consider the problem of optimal transportation with quadratic cost between a empirical measure and a general target probability on R d , with d $\ge$ 1. We provide new results on the uniqueness and stability of the associated optimal transportation potentials , namely, the minimizers in the dual formulation of the optimal transportation problem. As a consequence, we show that a CLT holds for the empirical transportation cost under mild moment and smoothness requirements. The limiting distributions are Gaussian and admit a simple description in terms of the optimal transportation potentials.
△ Less
Submitted 9 March, 2018; v1 submitted 3 May, 2017;
originally announced May 2017.
-
A data driven trimming procedure for robust classification
Authors:
Marina Antolín,
Eustasio Del Barrio,
Jean-Michel Loubes
Abstract:
Classification rules can be severely affected by the presence of disturbing observations in the training sample. Looking for an optimal classifier with such data may lead to unnecessarily complex rules. So, simpler effective classification rules could be achieved if we relax the goal of fitting a good rule for the whole training sample but only consider a fraction of the data. In this paper we int…
▽ More
Classification rules can be severely affected by the presence of disturbing observations in the training sample. Looking for an optimal classifier with such data may lead to unnecessarily complex rules. So, simpler effective classification rules could be achieved if we relax the goal of fitting a good rule for the whole training sample but only consider a fraction of the data. In this paper we introduce a new method based on trimming to produce classification rules with guaranteed performance on a significant fraction of the data. In particular, we provide an automatic way of determining the right trimming proportion and obtain in this setting oracle bounds for the classification error on the new data set.
△ Less
Submitted 18 January, 2017;
originally announced January 2017.
-
Models for the assessment of treatment improvement: the ideal and the feasible
Authors:
P. C. Álvarez-Esteban,
E. del Barrio,
J. A. Cuesta-Albertos,
C. Matrán
Abstract:
Comparisons of different treatments or production processes are the goals of a significant fraction of applied research. Unsurprisingly, two-sample problems play a main role in Statistics through natural questions such as `Is the the new treatment significantly better than the old?'. However, this is only partially answered by some of the usual statistical tools for this task. More importantly, of…
▽ More
Comparisons of different treatments or production processes are the goals of a significant fraction of applied research. Unsurprisingly, two-sample problems play a main role in Statistics through natural questions such as `Is the the new treatment significantly better than the old?'. However, this is only partially answered by some of the usual statistical tools for this task. More importantly, often practitioners are not aware of the real meaning behind these statistical procedures. We analyze these troubles from the point of view of the order between distributions, the stochastic order, showing evidence of the limitations of the usual approaches, paying special attention to the classical comparison of means under the normal model. We discuss the unfeasibility of statistically proving stochastic dominance, but show that it is possible, instead, to gather statistical evidence to conclude that slightly relaxed versions of stochastic dominance hold.
△ Less
Submitted 18 April, 2017; v1 submitted 5 December, 2016;
originally announced December 2016.
-
Central limit theorem and bootstrap procedure for Wasserstein's variations with an application to structural relationships between distributions
Authors:
Eustasio Del Barrio,
Paula Gordaliza,
Hélène Lescornel,
Jean-Michel Loubes
Abstract:
Wasserstein barycenters and variance-like criteria based on the Wasserstein distance are used in many problems to analyze the homogeneity of collections of distributions and structural relationships between the observations. We propose the estimation of the quantiles of the empirical process of Wasserstein's variation using a bootstrap procedure. We then use these results for statistical inference…
▽ More
Wasserstein barycenters and variance-like criteria based on the Wasserstein distance are used in many problems to analyze the homogeneity of collections of distributions and structural relationships between the observations. We propose the estimation of the quantiles of the empirical process of Wasserstein's variation using a bootstrap procedure. We then use these results for statistical inference on a distribution registration model for general deformation functions. The tests are based on the variance of the distributions with respect to their Wasserstein's barycenters for which we prove central limit theorems, including bootstrap versions.
△ Less
Submitted 30 October, 2018; v1 submitted 14 November, 2016;
originally announced November 2016.
-
Robust clustering tools based on optimal transportation
Authors:
E. del Barrio,
J. A. Cuesta-Albertos,
C. Matrán,
A. Mayo-Íscar
Abstract:
A robust clustering method for probabilities in Wasserstein space is introduced. This new "trimmed $k$-barycenters" approach relies on recent results on barycenters in Wasserstein space that allow intensive computation, as required by clustering algorithms. The possibility of trimming the most discrepant distributions results in a gain in stability and robustness, highly convenient in this setting…
▽ More
A robust clustering method for probabilities in Wasserstein space is introduced. This new "trimmed $k$-barycenters" approach relies on recent results on barycenters in Wasserstein space that allow intensive computation, as required by clustering algorithms. The possibility of trimming the most discrepant distributions results in a gain in stability and robustness, highly convenient in this setting. As a remarkable application we consider a parallelized estimation setup in which each of $m$ units processes a portion of the data, producing an estimate of $k$-features, encoded as $k$ probabilities. We prove that the trimmed $k$-barycenter of the $m\times k$ estimates produces a consistent aggregation. We illustrate the methodology with simulated and real data examples. These include clustering populations by age distributions and analysis of cytometric data.
△ Less
Submitted 23 November, 2016; v1 submitted 5 July, 2016;
originally announced July 2016.
-
Berry-Esseen bounds for weighted averages of Poisson avoidance functionals
Authors:
Eustasio del Barrio
Abstract:
We consider functionals which are weighted averages of the avoidance function of a Poisson process. Using the approach to Stein's method based on Malliavin calculus for Poisson functionals we provide explicit bounds for the Wasserstein distance between these standardized functionals and the standard normal distribution. Our approach relies on closed-form expressions for the action of some Malliavi…
▽ More
We consider functionals which are weighted averages of the avoidance function of a Poisson process. Using the approach to Stein's method based on Malliavin calculus for Poisson functionals we provide explicit bounds for the Wasserstein distance between these standardized functionals and the standard normal distribution. Our approach relies on closed-form expressions for the action of some Malliavin type operators on avoidance functionals of Poisson processes. As a result we provide Berry-Esseen bounds in the CLT for the volume of the union of balls of a fixed radius around random Poisson centers or for the quantization error around points of a Poisson process. We also give Berry-Esseen bounds for avoidance functionals of empirical measures.
△ Less
Submitted 14 December, 2015;
originally announced December 2015.
-
A fixed-point approach to barycenters in Wasserstein space
Authors:
Pedro C. Álvarez-Esteban,
E. del Barrio,
J. A. Cuesta-Albertos,
C. Matrán
Abstract:
Let $\mathcal{P}_{2,ac}$ be the set of Borel probabilities on $\mathbb{R}^d$ with finite second moment and absolutely continuous with respect to Lebesgue measure. We consider the problem of finding the barycenter (or Fréchet mean) of a finite set of probabilities $ν_1,\ldots,ν_k \in \mathcal{P}_{2,ac}$ with respect to the $L_2-$Wasserstein metric. For this task we introduce an operator on…
▽ More
Let $\mathcal{P}_{2,ac}$ be the set of Borel probabilities on $\mathbb{R}^d$ with finite second moment and absolutely continuous with respect to Lebesgue measure. We consider the problem of finding the barycenter (or Fréchet mean) of a finite set of probabilities $ν_1,\ldots,ν_k \in \mathcal{P}_{2,ac}$ with respect to the $L_2-$Wasserstein metric. For this task we introduce an operator on $\mathcal{P}_{2,ac}$ related to the optimal transport maps pushing forward any $μ\in \mathcal{P}_{2,ac}$ to $ν_1,\ldots,ν_k$. Under very general conditions we prove that the barycenter must be a fixed point for this operator and introduce an iterative procedure which consistently approximates the barycenter. The procedure allows effective computation of barycenters in any location-scatter family, including the Gaussian case. In such cases the barycenter must belong to the family, thus it is characterized by its mean and covariance matrix. While its mean is just the weighted mean of the means of the probabilities, the covariance matrix is characterized in terms of their covariance matrices $Σ_1,\dots,Σ_k$ through a nonlinear matrix equation. The performance of the iterative procedure in this case is illustrated through numerical simulations, which show fast convergence towards the barycenter.
△ Less
Submitted 22 April, 2016; v1 submitted 17 November, 2015;
originally announced November 2015.
-
Wide Consensus for Parallelized Inference
Authors:
P. C. Álvarez-Esteban,
E. del Barrio,
J. A. Cuesta-Albertos,
C. Matrán
Abstract:
We develop a general theory to address a consensus-based combination of estimations in a parallelized or distributed estimation setting. Taking into account the possibility of very discrepant estimations, instead of a full consensus we consider a "wide consensus" procedure. The approach is based on the consideration of trimmed barycenters in the Wasserstein space of probability distributions on R^…
▽ More
We develop a general theory to address a consensus-based combination of estimations in a parallelized or distributed estimation setting. Taking into account the possibility of very discrepant estimations, instead of a full consensus we consider a "wide consensus" procedure. The approach is based on the consideration of trimmed barycenters in the Wasserstein space of probability distributions on R^d with finite second order moments. We include general existence and consistency results as well as characterizations of barycenters of probabilities that belong to (non necessarily elliptical) location and scatter familes. On these families, the effective computation of barycenters and distances can be addressed through a consistent iterative algorithm. Since, once a shape has been chosen, these computations just depend on the locations and scatters, the theory can be applied to cover with great generality a wide consensus approach for location and scatter estimation or for obtaining confidence regions.
△ Less
Submitted 11 May, 2017; v1 submitted 17 November, 2015;
originally announced November 2015.
-
A statistical analysis of a deformation model with Wasserstein barycenters : estimation procedure and goodness of fit test
Authors:
Eustasio Del Barrio,
Hélène Lescornel,
Jean-Michel Loubes
Abstract:
We propose a study of a distribution registration model for general deformation functions. In this framework, we provide estimators of the deformations as well as a goodness of fit test of the model. For this, we consider a criterion which studies the Fr{é}chet mean (or barycenter) of the warped distributions whose study enables to make inference on the model. In particular we obtain the asymptoti…
▽ More
We propose a study of a distribution registration model for general deformation functions. In this framework, we provide estimators of the deformations as well as a goodness of fit test of the model. For this, we consider a criterion which studies the Fr{é}chet mean (or barycenter) of the warped distributions whose study enables to make inference on the model. In particular we obtain the asymptotic distribution and a bootstrap procedure for the Wasserstein variation.
△ Less
Submitted 1 October, 2015; v1 submitted 26 August, 2015;
originally announced August 2015.
-
A contamination model for approximate stochastic order: extended version
Authors:
Pedro C. Alvarez-Esteban,
Eustasio del Barrio,
Juan A. Cuesta-Albertos,
Carlos Matran
Abstract:
Stochastic ordering among distributions has been considered in a variety of scenarios. Economic studies often involve research about the ordering of investment strategies or social welfare. However, as noted in the literature, stochastic orderings are often a too strong assumption which is not supported by the data even in cases in which the researcher tends to believe that a certain variable is s…
▽ More
Stochastic ordering among distributions has been considered in a variety of scenarios. Economic studies often involve research about the ordering of investment strategies or social welfare. However, as noted in the literature, stochastic orderings are often a too strong assumption which is not supported by the data even in cases in which the researcher tends to believe that a certain variable is somehow smaller than other. Instead of considering this rigid model of stochastic order we propose to look at a more flexible version in which two distributions are said to satisfy an approximate stochastic order relation if they are slightly contaminated versions of distributions which do satisfy the stochastic ordering. The minimal level of contamination that makes this approximate model hold can be used as a measure of the deviation of the original distributions from the exact stochastic order model. Our approach is based on the use of trimmings of probability measures. We discuss the connection between them and the approximate stochastic order model and provide theoretical support for its use in data analysis. We also provide simulation results.
△ Less
Submitted 5 December, 2014;
originally announced December 2014.
-
The empirical cost of optimal incomplete transportation
Authors:
Eustasio del Barrio,
Carlos Matrán
Abstract:
We consider the problem of optimal incomplete transportation between the empirical measure on an i.i.d. uniform sample on the d-dimensional unit cube $[0,1]^d$ and the true measure. This is a family of problems lying in between classical optimal transportation and nearest neighbor problems. We show that the empirical cost of optimal incomplete transportation vanishes at rate $O_P(n^{-1/d})$, where…
▽ More
We consider the problem of optimal incomplete transportation between the empirical measure on an i.i.d. uniform sample on the d-dimensional unit cube $[0,1]^d$ and the true measure. This is a family of problems lying in between classical optimal transportation and nearest neighbor problems. We show that the empirical cost of optimal incomplete transportation vanishes at rate $O_P(n^{-1/d})$, where n denotes the sample size. In dimension $d\geq3$ the rate is the same as in classical optimal transportation, but in low dimension it is (much) higher than the classical rate.
△ Less
Submitted 3 October, 2013;
originally announced October 2013.
-
Similarity of samples and trimming
Authors:
Pedro C. Álvarez-Esteban,
Eustasio del Barrio,
Juan A. Cuesta-Albertos,
Carlos Matrán
Abstract:
We say that two probabilities are similar at level $α$ if they are contaminated versions (up to an $α$ fraction) of the same common probability. We show how this model is related to minimal distances between sets of trimmed probabilities. Empirical versions turn out to present an overfitting effect in the sense that trimming beyond the similarity level results in trimmed samples that are closer th…
▽ More
We say that two probabilities are similar at level $α$ if they are contaminated versions (up to an $α$ fraction) of the same common probability. We show how this model is related to minimal distances between sets of trimmed probabilities. Empirical versions turn out to present an overfitting effect in the sense that trimming beyond the similarity level results in trimmed samples that are closer than expected to each other. We show how this can be combined with a bootstrap approach to assess similarity from two data samples.
△ Less
Submitted 9 May, 2012;
originally announced May 2012.