Search | arXiv e-print repository

A finite-infinite shared atoms nested model for the Bayesian analysis of large grouped data

Authors: Laura D'Angelo, Francesco Denti

Abstract: The use of hierarchical mixture priors with shared atoms has recently flourished in the Bayesian literature for partially exchangeable data. Leveraging on nested levels of mixtures, these models allow the estimation of a two-layered data partition: across groups and across observations. This paper discusses and compares the properties of such modeling strategies when the mixing weights are assigne… ▽ More The use of hierarchical mixture priors with shared atoms has recently flourished in the Bayesian literature for partially exchangeable data. Leveraging on nested levels of mixtures, these models allow the estimation of a two-layered data partition: across groups and across observations. This paper discusses and compares the properties of such modeling strategies when the mixing weights are assigned either a finite-dimensional Dirichlet distribution or a Dirichlet process prior. Based on these considerations, we introduce a novel hierarchical nonparametric prior based on a finite set of shared atoms, a specification that enhances the flexibility of the induced random measures and the availability of fast posterior inference. To support these findings, we analytically derive the induced prior correlation structure and partially exchangeable partition probability function. Additionally, we develop a novel mean-field variational algorithm for posterior inference to boost the applicability of our nested model to large multivariate data. We then assess and compare the performance of the different shared-atom specifications via simulation. We also show that our variational proposal is highly scalable and that the accuracy of the posterior density estimate and the estimated partition is comparable with state-of-the-art Gibbs sampler algorithms. Finally, we apply our model to a real dataset of Spotify's song features, simultaneously segmenting artists and songs with similar characteristics. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2212.01865 [pdf, other]

Variational Inference for Semiparametric Bayesian Novelty Detection in Large Datasets

Authors: Luca Benedetti, Eric Boniardi, Leonardo Chiani, Jacopo Ghirri, Marta Mastropietro, Andrea Cappozzo, Francesco Denti

Abstract: After being trained on a fully-labeled training set, where the observations are grouped into a certain number of known classes, novelty detection methods aim to classify the instances of an unlabeled test set while allowing for the presence of previously unseen classes. These models are valuable in many areas, ranging from social network and food adulteration analyses to biology, where an evolving… ▽ More After being trained on a fully-labeled training set, where the observations are grouped into a certain number of known classes, novelty detection methods aim to classify the instances of an unlabeled test set while allowing for the presence of previously unseen classes. These models are valuable in many areas, ranging from social network and food adulteration analyses to biology, where an evolving population may be present. In this paper, we focus on a two-stage Bayesian semiparametric novelty detector, also known as Brand, recently introduced in the literature. Leveraging on a model-based mixture representation, Brand allows clustering the test observations into known training terms or a single novelty term. Furthermore, the novelty term is modeled with a Dirichlet Process mixture model to flexibly capture any departure from the known patterns. Brand was originally estimated using MCMC schemes, which are prohibitively costly when applied to high-dimensional data. To scale up Brand applicability to large datasets, we propose to resort to a variational Bayes approach, providing an efficient algorithm for posterior approximation. We demonstrate a significant gain in efficiency and excellent classification performance with thorough simulation studies. Finally, to showcase its applicability, we perform a novelty detection analysis using the openly-available Statlog dataset, a large collection of satellite imaging spectra, to search for novel soil types. △ Less

Submitted 4 December, 2022; originally announced December 2022.

arXiv:2205.00930 [pdf, other]

Multiple hypothesis screening using mixtures of non-local distributions with applications to genomic studies

Authors: Francesco Denti, Stefano Peluso, Michele Guindani, Antonietta Mira

Abstract: The analysis of large-scale datasets, especially in biomedical contexts, frequently involves a principled screening of multiple hypotheses. The celebrated two-group model jointly models the distribution of the test statistics with mixtures of two competing densities, the null and the alternative distributions. We investigate the use of weighted densities and, in particular, non-local densities as… ▽ More The analysis of large-scale datasets, especially in biomedical contexts, frequently involves a principled screening of multiple hypotheses. The celebrated two-group model jointly models the distribution of the test statistics with mixtures of two competing densities, the null and the alternative distributions. We investigate the use of weighted densities and, in particular, non-local densities as working alternative distributions, to enforce separation from the null and thus refine the screening procedure. We show how these weighted alternatives improve various operating characteristics, such as the Bayesian False Discovery rate, of the resulting tests for a fixed mixture proportion with respect to a local, unweighted likelihood approach. Parametric and nonparametric model specifications are proposed, along with efficient samplers for posterior inference. By means of a simulation study, we exhibit how our model compares with both well-established and state-of-the-art alternatives in terms of various operating characteristics. Finally, to illustrate the versatility of our method, we conduct three differential expression analyses with publicly-available datasets from genomic studies of heterogeneous nature. △ Less

Submitted 9 March, 2023; v1 submitted 2 May, 2022; originally announced May 2022.

arXiv:2203.04165 [pdf, other]

On the intrinsic dimensionality of Covid-19 data: a global perspective

Authors: Abhishek Varghese, Edgar Santos-Fernandez, Francesco Denti, Antonietta Mira, Kerrie Mengersen

Abstract: This paper aims to develop a global perspective of the complexity of the relationship between the standardised per-capita growth rate of Covid-19 cases, deaths, and the OxCGRT Covid-19 Stringency Index, a measure describing a country's stringency of lockdown policies. To achieve our goal, we use a heterogeneous intrinsic dimension estimator implemented as a Bayesian mixture model, called Hidalgo.… ▽ More This paper aims to develop a global perspective of the complexity of the relationship between the standardised per-capita growth rate of Covid-19 cases, deaths, and the OxCGRT Covid-19 Stringency Index, a measure describing a country's stringency of lockdown policies. To achieve our goal, we use a heterogeneous intrinsic dimension estimator implemented as a Bayesian mixture model, called Hidalgo. We identify that the Covid-19 dataset may project onto two low-dimensional manifolds without significant information loss. The low dimensionality suggests strong dependency among the standardised growth rates of cases and deaths per capita and the OxCGRT Covid-19 Stringency Index for a country over 2020-2021. Given the low dimensional structure, it may be feasible to model observable Covid-19 dynamics with few parameters. Importantly, we identify spatial autocorrelation in the intrinsic dimension distribution worldwide. Moreover, we highlight that high-income countries are more likely to lie on low-dimensional manifolds, likely arising from aging populations, comorbidities, and increased per capita mortality burden from Covid-19. Finally, we temporally stratify the dataset to examine the intrinsic dimension at a more granular level throughout the Covid-19 pandemic. △ Less

Submitted 8 March, 2022; originally announced March 2022.

MSC Class: 62P10

arXiv:2106.08281 [pdf, other]

A Horseshoe mixture model for Bayesian screening with an application to light sheet fluorescence microscopy in brain imaging

Authors: Francesco Denti, Ricardo Azevedo, Chelsie Lo, Damian Wheeler, Sunil P. Gandhi, Michele Guindani, Babak Shahbaba

Abstract: In this paper, we focus on identifying differentially activated brain regions using a light sheet fluorescence microscopy - a recently developed technique for whole-brain imaging. Most existing statistical methods solve this problem by partitioning the brain regions into two classes: significantly and non-significantly activated. However, for the brain imaging problem at the center of our study, s… ▽ More In this paper, we focus on identifying differentially activated brain regions using a light sheet fluorescence microscopy - a recently developed technique for whole-brain imaging. Most existing statistical methods solve this problem by partitioning the brain regions into two classes: significantly and non-significantly activated. However, for the brain imaging problem at the center of our study, such binary grou** may provide overly simplistic discoveries by filtering out weak but important signals, that are typically adulterated by the noise present in the data. To overcome this limitation, we introduce a new Bayesian approach that allows classifying the brain regions into several tiers with varying degrees of relevance. Our approach is based on a combination of shrinkage priors - widely used in regression and multiple hypothesis testing problems - and mixture models - commonly used in model-based clustering. In contrast to the existing regularizing prior distributions, which use either the spike-and-slab prior or continuous scale mixtures, our class of priors is based on a discrete mixture of continuous scale mixtures and devises a cluster-shrinkage version of the Horseshoe prior. As a result, our approach provides a more general setting for Bayesian sparse estimation, drastically reduces the number of shrinkage parameters needed, and creates a framework for sharing information across units of interest. We show that this approach leads to more biologically meaningful and interpretable results in our brain imaging problem, since it allows the discrimination between active and inactive regions, while at the same time ranking the discoveries into clusters representing tiers of similar importance. △ Less

Submitted 27 January, 2023; v1 submitted 15 June, 2021; originally announced June 2021.

arXiv:2104.13832 [pdf, other]

Distributional Results for Model-Based Intrinsic Dimension Estimators

Authors: Francesco Denti, Diego Doimo, Alessandro Laio, Antonietta Mira

Abstract: Modern datasets are characterized by a large number of features that may conceal complex dependency structures. To deal with this type of data, dimensionality reduction techniques are essential. Numerous dimensionality reduction methods rely on the concept of intrinsic dimension, a measure of the complexity of the dataset. In this article, we first review the TWO-NN model, a likelihood-based intri… ▽ More Modern datasets are characterized by a large number of features that may conceal complex dependency structures. To deal with this type of data, dimensionality reduction techniques are essential. Numerous dimensionality reduction methods rely on the concept of intrinsic dimension, a measure of the complexity of the dataset. In this article, we first review the TWO-NN model, a likelihood-based intrinsic dimension estimator recently introduced in the literature. The TWO-NN estimator is based on the statistical properties of the ratio of the distances between a point and its first two nearest neighbors, assuming that the points are a realization from an homogeneous Poisson point process. We extend the TWO-NN theoretical framework by providing novel distributional results of consecutive and generic ratios of distances. These distributional results are then employed to derive intrinsic dimension estimators, called Cride and Gride. These novel estimators are more robust to noisy measurements than the TWO-NN and allow the study of the evolution of the intrinsic dimension as a function of the scale used to analyze the dataset. We discuss the properties of the different estimators with the help of simulation scenarios. △ Less

Submitted 1 June, 2021; v1 submitted 28 April, 2021; originally announced April 2021.

arXiv:2102.11425 [pdf, other]

intRinsic: an R Package for Model-Based Estimation of the Intrinsic Dimension of a Dataset

Authors: Francesco Denti

Abstract: This article illustrates intRinsic, an R package that implements novel state-of-the-art likelihood-based estimators of the intrinsic dimension of a dataset, an essential quantity for most dimensionality reduction techniques. In order to make these novel estimators easily accessible, the package contains a small number of high-level functions that rely on a broader set of efficient, low-level routi… ▽ More This article illustrates intRinsic, an R package that implements novel state-of-the-art likelihood-based estimators of the intrinsic dimension of a dataset, an essential quantity for most dimensionality reduction techniques. In order to make these novel estimators easily accessible, the package contains a small number of high-level functions that rely on a broader set of efficient, low-level routines. Generally speaking, intRinsic encompasses models that fall into two categories: homogeneous and heterogeneous intrinsic dimension estimators. The first category contains the two nearest neighbors estimator, a method derived from the distributional properties of the ratios of the distances between each data point and its first two closest neighbors. The functions dedicated to this method carry out inference under both the frequentist and Bayesian frameworks. In the second category, we find the heterogeneous intrinsic dimension algorithm, a Bayesian mixture model for which an efficient Gibbs sampler is implemented. After presenting the theoretical background, we demonstrate the performance of the models on simulated datasets. This way, we can facilitate the exposition by immediately assessing the validity of the results. Then, we employ the package to study the intrinsic dimension of the Alon dataset, obtained from a famous microarray experiment. Finally, we show how the estimation of homogeneous and heterogeneous intrinsic dimensions allows us to gain valuable insights into the topological structure of a dataset. △ Less

Submitted 23 February, 2023; v1 submitted 22 February, 2021; originally announced February 2021.

arXiv:2008.07077 [pdf, other]

A Common Atom Model for the Bayesian Nonparametric Analysis of Nested Data

Authors: Francesco Denti, Federico Camerlenghi, Michele Guindani, Antonietta Mira

Abstract: The use of high-dimensional data for targeted therapeutic interventions requires new ways to characterize the heterogeneity observed across subgroups of a specific population. In particular, models for partially exchangeable data are needed for inference on nested datasets, where the observations are assumed to be organized in different units and some sharing of information is required to learn di… ▽ More The use of high-dimensional data for targeted therapeutic interventions requires new ways to characterize the heterogeneity observed across subgroups of a specific population. In particular, models for partially exchangeable data are needed for inference on nested datasets, where the observations are assumed to be organized in different units and some sharing of information is required to learn distinctive features of the units. In this manuscript, we propose a nested Common Atoms Model (CAM) that is particularly suited for the analysis of nested datasets where the distributions of the units are expected to differ only over a small fraction of the observations sampled from each unit. The proposed CAM allows a two-layered clustering at the distributional and observational level and is amenable to scalable posterior inference through the use of a computationally efficient nested slice-sampler algorithm. We further discuss how to extend the proposed modeling framework to handle discrete measurements, and we conduct posterior inference on a real microbiome dataset from a diet swap study to investigate how the alterations in intestinal microbiota composition are associated with different eating habits. We further investigate the performance of our model in capturing true distributional structures in the population by means of a simulation study. △ Less

Submitted 17 August, 2020; originally announced August 2020.

arXiv:2006.09012 [pdf, other]

doi 10.1007/s11222-021-10017-7

A Two-Stage Bayesian Semiparametric Model for Novelty Detection with Robust Prior Information

Authors: Francesco Denti, Andrea Cappozzo, Francesca Greselin

Abstract: Novelty detection methods aim at partitioning the test units into already observed and previously unseen patterns. However, two significant issues arise: there may be considerable interest in identifying specific structures within the novelty, and contamination in the known classes could completely blur the actual separation between manifest and new groups. Motivated by these problems, we propose… ▽ More Novelty detection methods aim at partitioning the test units into already observed and previously unseen patterns. However, two significant issues arise: there may be considerable interest in identifying specific structures within the novelty, and contamination in the known classes could completely blur the actual separation between manifest and new groups. Motivated by these problems, we propose a two-stage Bayesian semiparametric novelty detector, building upon prior information robustly extracted from a set of complete learning units. We devise a general-purpose multivariate methodology that we also extend to handle functional data objects. We provide insights on the model behavior by investigating the theoretical properties of the associated semiparametric prior. From the computational point of view, we propose a suitable $\boldsymbolξ$-sequence to construct an independent slice-efficient sampler that takes into account the difference between manifest and novelty components. We showcase our model performance through an extensive simulation study and applications on both multivariate and functional datasets, in which diverse and distinctive unknown patterns are discovered. △ Less

Submitted 17 June, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

arXiv:2002.04148 [pdf, other]

The role of intrinsic dimension in high-resolution player tracking data -- Insights in basketball

Authors: Edgar Santos-Fernandez, Francesco Denti, Kerrie Mengersen, Antonietta Mira

Abstract: A new range of statistical analysis has emerged in sports after the introduction of the high-resolution player tracking technology, specifically in basketball. However, this high dimensional data is often challenging for statistical inference and decision making. In this article, we employ Hidalgo, a state-of-the-art Bayesian mixture model that allows the estimation of heterogeneous intrinsic dime… ▽ More A new range of statistical analysis has emerged in sports after the introduction of the high-resolution player tracking technology, specifically in basketball. However, this high dimensional data is often challenging for statistical inference and decision making. In this article, we employ Hidalgo, a state-of-the-art Bayesian mixture model that allows the estimation of heterogeneous intrinsic dimensions (ID) within a dataset and propose some theoretical enhancements. ID results can be interpreted as indicators of variability and complexity of basketball plays and games. This technique allows classification and clustering of NBA basketball player's movement and shot charts data. Analyzing movement data, Hidalgo identifies key stages of offensive actions such as creating space for passing, preparation/shooting and following through. We found that the ID value spikes reaching a peak between 4 and 8 seconds in the offensive part of the court after which it declines. In shot charts, we obtained groups of shots that produce substantially higher and lower successes. Overall, game-winners tend to have a larger intrinsic dimension which is an indication of more unpredictability and unique shot placements. Similarly, we found higher ID values in plays when the score margin is small compared to large margin ones. These outcomes could be exploited by coaches to obtain better offensive/defensive results. △ Less

Submitted 10 February, 2020; originally announced February 2020.

Comments: 21 pages, 16 figures, Codes + data + results can be found in https://github.com/EdgarSantos-Fernandez/id_basketball, Submitted

arXiv:1902.10459 [pdf, other]

Data segmentation based on the local intrinsic dimension

Authors: Michele Allegra, Elena Facco, Francesco Denti, Alessandro Laio, Antonietta Mira

Abstract: One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploi… ▽ More One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded vs unfolded configurations in a protein molecular dynamics trajectory, active vs non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms. △ Less

Submitted 13 July, 2020; v1 submitted 27 February, 2019; originally announced February 2019.

Comments: 11 pages, 6 figures + 9 pages Supplementary Information

Showing 1–11 of 11 results for author: Denti, F