-
A Distributed Approach for Persistent Homology Computation on a Large Scale
Authors:
Riccardo Ceccaroni,
Lorenzo Di Rocco,
Umberto Ferraro Petrillo,
Pierpaolo Brutti
Abstract:
Persistent homology (PH) is a powerful mathematical method to automatically extract relevant insights from images, such as those obtained by high-resolution imaging devices like electron microscopes or new-generation telescopes. However, the application of this method comes at a very high computational cost, that is bound to explode more because new imaging devices generate an ever-growing amount…
▽ More
Persistent homology (PH) is a powerful mathematical method to automatically extract relevant insights from images, such as those obtained by high-resolution imaging devices like electron microscopes or new-generation telescopes. However, the application of this method comes at a very high computational cost, that is bound to explode more because new imaging devices generate an ever-growing amount of data. In this paper we present PixHomology, a novel algorithm for efficiently computing $0$-dimensional PH on 2D images, optimizing memory and processing time. By leveraging the Apache Spark framework, we also present a distributed version of our algorithm with several optimized variants, able to concurrently process large batches of astronomical images. Finally, we present the results of an experimental analysis showing that our algorithm and its distributed version are efficient in terms of required memory, execution time, and scalability, consistently outperforming existing state-of-the-art PH computation tools when used to process large datasets.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
Energy Trees: Regression and Classification With Structured and Mixed-Type Covariates
Authors:
Riccardo Giubilei,
Tullia Padellini,
Pierpaolo Brutti
Abstract:
The increasing complexity of data requires methods and models that can effectively handle intricate structures, as simplifying them would result in loss of information. While several analytical tools have been developed to work with complex data objects in their original form, these tools are typically limited to single-type variables. In this work, we propose energy trees as a regression and clas…
▽ More
The increasing complexity of data requires methods and models that can effectively handle intricate structures, as simplifying them would result in loss of information. While several analytical tools have been developed to work with complex data objects in their original form, these tools are typically limited to single-type variables. In this work, we propose energy trees as a regression and classification model capable of accommodating structured covariates of various types. Energy trees leverage energy statistics to extend the capabilities of conditional inference trees, from which they inherit sound statistical foundations, interpretability, scale invariance, and freedom from distributional assumptions. We specifically focus on functional and graph-structured covariates, while also highlighting the model's flexibility in integrating other variable types. Extensive simulation studies demonstrate the model's competitive performance in terms of variable selection and robustness to overfitting. Finally, we assess the model's predictive ability through two empirical analyses involving human biological data. Energy trees are implemented in the R package etree.
△ Less
Submitted 15 June, 2023; v1 submitted 10 July, 2022;
originally announced July 2022.
-
Reprogramming FairGANs with Variational Auto-Encoders: A New Transfer Learning Model
Authors:
Beatrice Nobile,
Gabriele Santin,
Bruno Lepri,
Pierpaolo Brutti
Abstract:
Fairness-aware GANs (FairGANs) exploit the mechanisms of Generative Adversarial Networks (GANs) to impose fairness on the generated data, freeing them from both disparate impact and disparate treatment. Given the model's advantages and performance, we introduce a novel learning framework to transfer a pre-trained FairGAN to other tasks. This reprogramming process has the goal of maintaining the Fa…
▽ More
Fairness-aware GANs (FairGANs) exploit the mechanisms of Generative Adversarial Networks (GANs) to impose fairness on the generated data, freeing them from both disparate impact and disparate treatment. Given the model's advantages and performance, we introduce a novel learning framework to transfer a pre-trained FairGAN to other tasks. This reprogramming process has the goal of maintaining the FairGAN's main targets of data utility, classification utility, and data fairness, while widening its applicability and ease of use. In this paper we present the technical extensions required to adapt the original architecture to this new framework (and in particular the use of Variational Auto-Encoders), and discuss the benefits, trade-offs, and limitations of the new model.
△ Less
Submitted 11 March, 2022;
originally announced March 2022.
-
Towards global monitoring: equating the Food Insecurity Experience Scale (FIES) and food insecurity scales in Latin America
Authors:
Federica Onori,
Sara Viviani,
Pierpaolo Brutti
Abstract:
In order to face food insecurity as a global phenomenon, it is essential to rely on measurement tools that guarantee comparability across countries. Although the official indicators adopted by the United Nations in the context of the Sustainable Development Goals (SDGs) and based on the Food Insecurity Experience Scale (FIES) already embeds cross-country comparability, other experiential scales of…
▽ More
In order to face food insecurity as a global phenomenon, it is essential to rely on measurement tools that guarantee comparability across countries. Although the official indicators adopted by the United Nations in the context of the Sustainable Development Goals (SDGs) and based on the Food Insecurity Experience Scale (FIES) already embeds cross-country comparability, other experiential scales of food insecurity currently employ national thresholds and issues of comparability thus arise. In this work we address comparability of food insecurity experience-based scales by presenting two different studies. The first one involves the FIES and three national scales (ELCSA, EMSA and EBIA) currently included in national surveys in Guatemala, Ecuador, Mexico and Brazil. The second study concerns the adult and children versions of these national scales. Different methods from the equating practice of the educational testing field are explored: classical and based on the Item Response Theory (IRT).
△ Less
Submitted 19 February, 2021;
originally announced February 2021.
-
Supervised Learning with Indefinite Topological Kernels
Authors:
Tullia Padellini,
Pierpaolo Brutti
Abstract:
Topological Data Analysis (TDA) is a recent and growing branch of statistics devoted to the study of the shape of the data. In this work we investigate the predictive power of TDA in the context of supervised learning. Since topological summaries, most noticeably the Persistence Diagram, are typically defined in complex spaces, we adopt a kernel approach to translate them into more familiar vector…
▽ More
Topological Data Analysis (TDA) is a recent and growing branch of statistics devoted to the study of the shape of the data. In this work we investigate the predictive power of TDA in the context of supervised learning. Since topological summaries, most noticeably the Persistence Diagram, are typically defined in complex spaces, we adopt a kernel approach to translate them into more familiar vector spaces. We define a topological exponential kernel, we characterize it, and we show that, despite not being positive semi-definite, it can be successfully used in regression and classification tasks.
△ Less
Submitted 20 September, 2017;
originally announced September 2017.
-
Persistence Flamelets: multiscale Persistent Homology for kernel density exploration
Authors:
Tullia Padellini,
Pierpaolo Brutti
Abstract:
In recent years there has been noticeable interest in the study of the "shape of data". Among the many ways a "shape" could be defined, topology is the most general one, as it describes an object in terms of its connectivity structure: connected components (topological features of dimension 0), cycles (features of dimension 1) and so on. There is a growing number of techniques, generally denoted a…
▽ More
In recent years there has been noticeable interest in the study of the "shape of data". Among the many ways a "shape" could be defined, topology is the most general one, as it describes an object in terms of its connectivity structure: connected components (topological features of dimension 0), cycles (features of dimension 1) and so on. There is a growing number of techniques, generally denoted as Topological Data Analysis, aimed at estimating topological invariants of a fixed object; when we allow this object to change, however, little has been done to investigate the evolution in its topology. In this work we define the Persistence Flamelets, a multiscale version of one of the most popular tool in TDA, the Persistence Landscape. We examine its theoretical properties and we show how it could be used to gain insights on KDEs bandwidth parameter.
△ Less
Submitted 20 September, 2017;
originally announced September 2017.
-
A note on an Adaptive Goodness-of-Fit test with Finite Sample Validity for Random Design Regression Models
Authors:
Pierpaolo Brutti
Abstract:
Given an i.i.d. sample $\{(X_i,Y_i)\}_{i \in \{1 \ldots n\}}$ from the random design regression model $Y = f(X) + ε$ with $(X,Y) \in [0,1] \times [-M,M]$, in this paper we consider the problem of testing the (simple) null hypothesis $f = f_0$, against the alternative $f \neq f_0$ for a fixed $f_0 \in L^2([0,1],G_X)$, where $G_X(\cdot)$ denotes the marginal distribution of the design variable $X$.…
▽ More
Given an i.i.d. sample $\{(X_i,Y_i)\}_{i \in \{1 \ldots n\}}$ from the random design regression model $Y = f(X) + ε$ with $(X,Y) \in [0,1] \times [-M,M]$, in this paper we consider the problem of testing the (simple) null hypothesis $f = f_0$, against the alternative $f \neq f_0$ for a fixed $f_0 \in L^2([0,1],G_X)$, where $G_X(\cdot)$ denotes the marginal distribution of the design variable $X$. The procedure proposed is an adaptation to the regression setting of a multiple testing technique introduced by Fromont and Laurent (2005), and it amounts to consider a suitable collection of unbiased estimators of the $L^2$--distance $d_2(f,f_0) = \int {[f(x) - f_0 (x)]^2 d\,G_X (x)}$, rejecting the null hypothesis when at least one of them is greater than its $(1-u_α)$ quantile, with $u_α$ calibrated to obtain a level--$α$ test. To build these estimators, we will use the warped wavelet basis introduced by Picard and Kerkyacharian (2004). We do not assume that the errors are normally distributed, and we do not assume that $X$ and $ε$ are independent but, mainly for technical reasons, we will assume, as in most part of the current literature in learning theory, that $|f(x) - y|$ is uniformly bounded (almost everywhere). We show that our test is adaptive over a particular collection of approximation spaces linked to the classical Besov spaces.
△ Less
Submitted 18 February, 2015;
originally announced February 2015.
-
Warped Wavelet and Vertical Thresholding
Authors:
Pierpaolo Brutti
Abstract:
Let $\{(X_i,Y_i)\}_{i\in \{1,..., n\}}$ be an i.i.d. sample from the random design regression model $Y=f(X)+ε$ with $(X,Y)\in [0,1]\times [-M,M]$. In dealing with such a model, adaptation is naturally to be intended in terms of $L^2([0,1],G_X)$ norm where $G_X(\cdot)$ denotes the (known) marginal distribution of the design variable $X$. Recently much work has been devoted to the construction of…
▽ More
Let $\{(X_i,Y_i)\}_{i\in \{1,..., n\}}$ be an i.i.d. sample from the random design regression model $Y=f(X)+ε$ with $(X,Y)\in [0,1]\times [-M,M]$. In dealing with such a model, adaptation is naturally to be intended in terms of $L^2([0,1],G_X)$ norm where $G_X(\cdot)$ denotes the (known) marginal distribution of the design variable $X$. Recently much work has been devoted to the construction of estimators that adapts in this setting (see, for example, [5,24,25,32]), but only a few of them come along with a easy--to--implement computational scheme. Here we propose a family of estimators based on the warped wavelet basis recently introduced by Picard and Kerkyacharian [36] and a tree-like thresholding rule that takes into account the hierarchical (across-scale) structure of the wavelet coefficients. We show that, if the regression function belongs to a certain class of approximation spaces defined in terms of $G_X(\cdot)$, then our procedure is adaptive and converge to the true regression function with an optimal rate. The results are stated in terms of excess probabilities as in [19].
△ Less
Submitted 22 January, 2008;
originally announced January 2008.