Search | arXiv e-print repository

Accurate and fast anomaly detection in industrial processes and IoT environments

Authors: Simone Tonini, Andrea Vandin, Francesca Chiaromonte, Daniele Licari, Fernando Barsacchi

Abstract: We present a novel, simple and widely applicable semi-supervised procedure for anomaly detection in industrial and IoT environments, SAnD (Simple Anomaly Detection). SAnD comprises 5 steps, each leveraging well-known statistical tools, namely; smoothing filters, variance inflation factors, the Mahalanobis distance, threshold selection algorithms and feature importance techniques. To our knowledge,… ▽ More We present a novel, simple and widely applicable semi-supervised procedure for anomaly detection in industrial and IoT environments, SAnD (Simple Anomaly Detection). SAnD comprises 5 steps, each leveraging well-known statistical tools, namely; smoothing filters, variance inflation factors, the Mahalanobis distance, threshold selection algorithms and feature importance techniques. To our knowledge, SAnD is the first procedure that integrates these tools to identify anomalies and help decipher their putative causes. We show how each step contributes to tackling technical challenges that practitioners face when detecting anomalies in industrial contexts, where signals can be highly multicollinear, have unknown distributions, and intertwine short-lived noise with the long(er)-lived actual anomalies. The development of SAnD was motivated by a concrete case study from our industrial partner, which we use here to show its effectiveness. We also evaluate the performance of SAnD by comparing it with a selection of semi-supervised methods on public datasets from the literature on anomaly detection. We conclude that SAnD is effective, broadly applicable, and outperforms existing approaches in both anomaly detection and runtime. △ Less

Submitted 27 April, 2024; originally announced April 2024.

arXiv:2312.16346 [pdf, other]

An efficient approach to characterize spatio-temporal dependence in cortical surface fMRI data

Authors: Huy Dang, Marzia Cremona, Nicole Lazar, Francesca Chiaromonte

Abstract: Functional magnetic resonance imaging (fMRI) is a neuroimaging technique known for its ability to capture brain activity non-invasively and at fine spatial resolution (2-3mm). Cortical surface fMRI (cs-fMRI) is a recent development of fMRI that focuses on signals from tissues that have neuronal activities, as opposed to the whole brain. cs-fMRI data is plagued with non-stationary spatial correlati… ▽ More Functional magnetic resonance imaging (fMRI) is a neuroimaging technique known for its ability to capture brain activity non-invasively and at fine spatial resolution (2-3mm). Cortical surface fMRI (cs-fMRI) is a recent development of fMRI that focuses on signals from tissues that have neuronal activities, as opposed to the whole brain. cs-fMRI data is plagued with non-stationary spatial correlations and long temporal dependence which, if inadequately accounted for, can hinder downstream statistical analyses. We propose a fully integrated approach that captures both spatial non-stationarity and varying ranges of temporal dependence across regions of interest. More specifically, we impose non-stationary spatial priors on the latent activation fields and model temporal dependence via fractional Gaussian errors of varying Hurst parameters, which can be studied through a wavelet transformation and its coefficients' variances at different scales. We demonstrate the performance of our proposed approach through simulations and an application to a visual working memory task cs-fMRI dataset. △ Less

Submitted 26 December, 2023; originally announced December 2023.

arXiv:2307.09820 [pdf, other]

Contrasting pre-vaccine COVID-19 waves in Italy through Functional Data Analysis

Authors: Tobia Boschi, Jacopo Di Iorio, Lorenzo Testa, Marzia A. Cremona, Francesca Chiaromonte

Abstract: We use data from 107 Italian provinces to characterize and compare mortality patterns in the first two COVID-19 epidemic waves, which occurred prior to the introduction of vaccines. We also associate these patterns with mobility, timing of government restrictions, and socio-demographic, infrastructural, and environmental covariates. Notwithstanding limitations in the accuracy and reliability of pu… ▽ More We use data from 107 Italian provinces to characterize and compare mortality patterns in the first two COVID-19 epidemic waves, which occurred prior to the introduction of vaccines. We also associate these patterns with mobility, timing of government restrictions, and socio-demographic, infrastructural, and environmental covariates. Notwithstanding limitations in the accuracy and reliability of publicly available data, we are able to exploit information in curves and shapes through Functional Data Analysis techniques. Specifically, we document differences in magnitude and variability between the two waves; while both were characterized by a co-occurrence of 'exponential' and 'mild' mortality patterns, the second spread much more broadly and asynchronously through the country. Moreover, we find evidence of a significant positive association between local mobility and mortality in both epidemic waves and corroborate the effectiveness of timely restrictions in curbing mortality. The techniques we describe could capture additional signals of interest if applied, for instance, to data on cases and positivity rates. However, we show that the quality of such data, at least in the case of Italian provinces, was too poor to support meaningful analyses. △ Less

Submitted 19 July, 2023; originally announced July 2023.

Comments: main: 12 pages, 5 figures supplement: 8 pages, 11 figures

arXiv:2306.04254 [pdf, other]

funBIalign: a hierachical algorithm for functional motif discovery based on mean squared residue scores

Authors: Jacopo Di Iorio, Marzia A. Cremona, Francesca Chiaromonte

Abstract: Motif discovery is gaining increasing attention in the domain of functional data analysis. Functional motifs are typical "shapes" or "patterns" that recur multiple times in different portions of a single curve and/or in misaligned portions of multiple curves. In this paper, we define functional motifs using an additive model and we propose funBIalign for their discovery and evaluation. Inspired by… ▽ More Motif discovery is gaining increasing attention in the domain of functional data analysis. Functional motifs are typical "shapes" or "patterns" that recur multiple times in different portions of a single curve and/or in misaligned portions of multiple curves. In this paper, we define functional motifs using an additive model and we propose funBIalign for their discovery and evaluation. Inspired by clustering and biclustering techniques, funBIalign is a multi-step procedure which uses agglomerative hierarchical clustering with complete linkage and a functional distance based on mean squared residue scores to discover functional motifs, both in a single curve (e.g., time series) and in a set of curves. We assess its performance and compare it to other recent methods through extensive simulations. Moreover, we use funBIalign for discovering motifs in two real-data case studies; one on food price inflation and one on temperature changes. △ Less

Submitted 7 June, 2023; originally announced June 2023.

arXiv:2303.14801 [pdf, other]

FAStEN: an efficient adaptive method for feature selection and estimation in high-dimensional functional regressions

Authors: Tobia Boschi, Lorenzo Testa, Francesca Chiaromonte, Matthew Reimherr

Abstract: Functional regression analysis is an established tool for many contemporary scientific applications. Regression problems involving large and complex data sets are ubiquitous, and feature selection is crucial for avoiding overfitting and achieving accurate predictions. We propose a new, flexible and ultra-efficient approach to perform feature selection in a sparse high dimensional function-on-funct… ▽ More Functional regression analysis is an established tool for many contemporary scientific applications. Regression problems involving large and complex data sets are ubiquitous, and feature selection is crucial for avoiding overfitting and achieving accurate predictions. We propose a new, flexible and ultra-efficient approach to perform feature selection in a sparse high dimensional function-on-function regression problem, and we show how to extend it to the scalar-on-function framework. Our method, called FAStEN, combines functional data, optimization, and machine learning techniques to perform feature selection and parameter estimation simultaneously. We exploit the properties of Functional Principal Components and the sparsity inherent to the Dual Augmented Lagrangian problem to significantly reduce computational cost, and we introduce an adaptive scheme to improve selection accuracy. In addition, we derive asymptotic oracle properties, which guarantee estimation and selection consistency for the proposed FAStEN estimator. Through an extensive simulation study, we benchmark our approach to the best existing competitors and demonstrate a massive gain in terms of CPU time and selection performance, without sacrificing the quality of the coefficients' estimation. The theoretical derivations and the simulation study provide a strong motivation for our approach. Finally, we present an application to brain fMRI data from the AOMIC PIOP1 study. △ Less

Submitted 4 September, 2023; v1 submitted 26 March, 2023; originally announced March 2023.

arXiv:2206.05718 [pdf, other]

smoothEM: a new approach for the simultaneous assessment of smooth patterns and spikes

Authors: Huy Dang, Marzia Cremona, Francesca Chiaromonte

Abstract: We consider functional data where an underlying smooth curve is composed not just with errors, but also with irregular spikes. We propose an approach that, combining regularized spline smoothing and an Expectation-Maximization algorithm, allows one to both identify spikes and estimate the smooth component. Imposing some assumptions on the error distribution, we prove consistency of EM estimates. N… ▽ More We consider functional data where an underlying smooth curve is composed not just with errors, but also with irregular spikes. We propose an approach that, combining regularized spline smoothing and an Expectation-Maximization algorithm, allows one to both identify spikes and estimate the smooth component. Imposing some assumptions on the error distribution, we prove consistency of EM estimates. Next, we demonstrate the performance of our proposal on finite samples and its robustness to assumptions violations through simulations. Finally, we apply our proposal to data on the annual heatwaves index in the US and on weekly electricity consumption in Ireland. In both datasets, we are able to characterize underlying smooth trends and to pinpoint irregular/extreme behaviors. △ Less

Submitted 16 July, 2023; v1 submitted 12 June, 2022; originally announced June 2022.

arXiv:2202.12859 [pdf, other]

doi 10.1007/s41109-022-00482-y

Venture Capital investments through the lens of Network and Functional Data Analysis

Authors: Christian Esposito, Marco Gortan, Lorenzo Testa, Francesca Chiaromonte, Giorgio Fagiolo, Andrea Mina, Giulio Rossetti

Abstract: In this paper we characterize the performance of venture capital-backed firms based on their ability to attract investment. The aim of the study is to identify relevant predictors of success built from the network structure of firms' and investors' relations. Focusing on deal-level data for the health sector, we first create a bipartite network among firms and investors, and then apply functional… ▽ More In this paper we characterize the performance of venture capital-backed firms based on their ability to attract investment. The aim of the study is to identify relevant predictors of success built from the network structure of firms' and investors' relations. Focusing on deal-level data for the health sector, we first create a bipartite network among firms and investors, and then apply functional data analysis (FDA) to derive progressively more refined indicators of success captured by a binary, a scalar and a functional outcome. More specifically, we use different network centrality measures to capture the role of early investments for the success of the firm. Our results, which are robust to different specifications, suggest that success has a strong positive association with centrality measures of the firm and of its large investors, and a weaker but still detectable association with centrality measures of small investors and features describing firms as knowledge bridges. Finally, based on our analyses, success is not associated with firms' and investors' spreading power (harmonic centrality), nor with the tightness of investors' community (clustering coefficient) and spreading ability (VoteRank). △ Less

Submitted 10 August, 2022; v1 submitted 25 February, 2022; originally announced February 2022.

Comments: 17 pages, 9 figures, supplementary material attached

Journal ref: Applied Network Science 7, 42 (2022)

arXiv:2111.06371 [pdf, ps, other]

doi 10.1007/978-3-030-93409-5_61

Can you always reap what you sow? Network and functional data analysis of VC investments in health-tech companies

Authors: Christian Esposito, Marco Gortan, Lorenzo Testa, Francesca Chiaromonte, Giorgio Fagiolo, Andrea Mina, Giulio Rossetti

Abstract: "Success" of firms in venture capital markets is hard to define, and its determinants are still poorly understood. We build a bipartite network of investors and firms in the healthcare sector, describing its structure and its communities. Then, we characterize "success" introducing progressively more refined definitions, and we find a positive association between such definitions and the centralit… ▽ More "Success" of firms in venture capital markets is hard to define, and its determinants are still poorly understood. We build a bipartite network of investors and firms in the healthcare sector, describing its structure and its communities. Then, we characterize "success" introducing progressively more refined definitions, and we find a positive association between such definitions and the centrality of a company. In particular, we are able to cluster funding trajectories of firms into two groups capturing different "success" regimes and to link the probability of belonging to one or the other to their network features (in particular their centrality and the one of their investors). We further investigate this positive association by introducing scalar as well as functional "success" outcomes, confirming our findings and their robustness. △ Less

Submitted 9 November, 2021; originally announced November 2021.

Comments: 12 pages, 4 figures, accepted for publication in the proceedings of the 10th International Conference on Complex Networks and Their Applications

Journal ref: Proceedings of 10th International Conference of Complex Networks and their applications 2021

arXiv:2106.11941 [pdf, ps, other]

Doubly Robust Feature Selection with Mean and Variance Outlier Detection and Oracle Properties

Authors: Luca Insolia, Francesca Chiaromonte, Runze Li, Marco Riani

Abstract: We propose a general approach to handle data contaminations that might disrupt the performance of feature selection and estimation procedures for high-dimensional linear models. Specifically, we consider the co-occurrence of mean-shift and variance-inflation outliers, which can be modeled as additional fixed and random components, respectively, and evaluated independently. Our proposal performs fe… ▽ More We propose a general approach to handle data contaminations that might disrupt the performance of feature selection and estimation procedures for high-dimensional linear models. Specifically, we consider the co-occurrence of mean-shift and variance-inflation outliers, which can be modeled as additional fixed and random components, respectively, and evaluated independently. Our proposal performs feature selection while detecting and down-weighting variance-inflation outliers, detecting and excluding mean-shift outliers, and retaining non-outlying cases with full weights. Feature selection and mean-shift outlier detection are performed through a robust class of nonconcave penalization methods. Variance-inflation outlier detection is based on the penalization of the restricted posterior mode. The resulting approach satisfies a robust oracle property for feature selection in the presence of data contamination -- which allows the number of features to exponentially increase with the sample size -- and detects truly outlying cases of each type with asymptotic probability one. This provides an optimal trade-off between a high breakdown point and efficiency. Computationally efficient heuristic procedures are also presented. We illustrate the finite-sample performance of our proposal through an extensive simulation study and a real-world application. △ Less

Submitted 22 June, 2021; originally announced June 2021.

Comments: 35 pages, 9 figures (including supplementary material)

arXiv:2104.09452 [pdf, other]

Epsilon Consistent Mixup: Structural Regularization with an Adaptive Consistency-Interpolation Tradeoff

Authors: Vincent Pisztora, Yanglan Ou, Xiaolei Huang, Francesca Chiaromonte, Jia Li

Abstract: In this paper we propose $ε$-Consistent Mixup ($ε$mu). $ε$mu is a data-based structural regularization technique that combines Mixup's linear interpolation with consistency regularization in the Mixup direction, by compelling a simple adaptive tradeoff between the two. This learnable combination of consistency and interpolation induces a more flexible structure on the evolution of the response acr… ▽ More In this paper we propose $ε$-Consistent Mixup ($ε$mu). $ε$mu is a data-based structural regularization technique that combines Mixup's linear interpolation with consistency regularization in the Mixup direction, by compelling a simple adaptive tradeoff between the two. This learnable combination of consistency and interpolation induces a more flexible structure on the evolution of the response across the feature space and is shown to improve semi-supervised classification accuracy on the SVHN and CIFAR10 benchmark datasets, yielding the largest gains in the most challenging low label-availability scenarios. Empirical studies comparing $ε$mu and Mixup are presented and provide insight into the mechanisms behind $ε$mu's effectiveness. In particular, $ε$mu is found to produce more accurate synthetic labels and more confident predictions than Mixup. △ Less

Submitted 29 September, 2021; v1 submitted 19 April, 2021; originally announced April 2021.

arXiv:2008.04700 [pdf, other]

doi 10.1038/s41598-021-95866-y

The shapes of an epidemic: using Functional Data Analysis to characterize COVID-19 in Italy

Authors: Tobia Boschi, Jacopo Di Iorio, Lorenzo Testa, Marzia A. Cremona, Francesca Chiaromonte

Abstract: We investigate patterns of COVID-19 mortality across 20 Italian regions and their association with mobility, positivity, and socio-demographic, infrastructural and environmental covariates. Notwithstanding limitations in accuracy and resolution of the data available from public sources, we pinpoint significant trends exploiting information in curves and shapes with Functional Data Analysis techniq… ▽ More We investigate patterns of COVID-19 mortality across 20 Italian regions and their association with mobility, positivity, and socio-demographic, infrastructural and environmental covariates. Notwithstanding limitations in accuracy and resolution of the data available from public sources, we pinpoint significant trends exploiting information in curves and shapes with Functional Data Analysis techniques. These depict two starkly different epidemics; an "exponential" one unfolding in Lombardia and the worst hit areas of the north, and a milder, "flat(tened)" one in the rest of the country -- including Veneto, where cases appeared concurrently with Lombardia but aggressive testing was implemented early on. We find that mobility and positivity can predict COVID-19 mortality, also when controlling for relevant covariates. Among the latter, primary care appears to mitigate mortality, and contacts in hospitals, schools and work places to aggravate it. The techniques we describe could capture additional and potentially sharper signals if applied to richer data. △ Less

Submitted 11 August, 2020; originally announced August 2020.

MSC Class: 62P10; 62R10 ACM Class: J.3

Journal ref: Scientific Reports volume 11, Article number: 17054 (2021)

arXiv:2007.06114 [pdf, ps, other]

Simultaneous Feature Selection and Outlier Detection with Optimality Guarantees

Authors: Luca Insolia, Ana Kenney, Francesca Chiaromonte, Giovanni Felici

Abstract: Sparse estimation methods capable of tolerating outliers have been broadly investigated in the last decade. We contribute to this research considering high-dimensional regression problems contaminated by multiple mean-shift outliers which affect both the response and the design matrix. We develop a general framework for this class of problems and propose the use of mixed-integer programming to sim… ▽ More Sparse estimation methods capable of tolerating outliers have been broadly investigated in the last decade. We contribute to this research considering high-dimensional regression problems contaminated by multiple mean-shift outliers which affect both the response and the design matrix. We develop a general framework for this class of problems and propose the use of mixed-integer programming to simultaneously perform feature selection and outlier detection with provably optimal guarantees. We characterize the theoretical properties of our approach, i.e. a necessary and sufficient condition for the robustly strong oracle property, which allows the number of features to exponentially increase with the sample size; the optimal estimation of the parameters; and the breakdown point of the resulting estimates. Moreover, we provide computationally efficient procedures to tune integer constraints and to warm-start the algorithm. We show the superior performance of our proposal compared to existing heuristic methods through numerical simulations and an application investigating the relationships between the human microbiome and childhood obesity. △ Less

Submitted 12 July, 2020; originally announced July 2020.

arXiv:2006.03970 [pdf, other]

An Efficient Semi-smooth Newton Augmented Lagrangian Method for Elastic Net

Authors: Tobia Boschi, Matthew Reimherr, Francesca Chiaromonte

Abstract: Feature selection is an important and active research area in statistics and machine learning. The Elastic Net is often used to perform selection when the features present non-negligible collinearity or practitioners wish to incorporate additional known structure. In this article, we propose a new Semi-smooth Newton Augmented Lagrangian Method to efficiently solve the Elastic Net in ultra-high dim… ▽ More Feature selection is an important and active research area in statistics and machine learning. The Elastic Net is often used to perform selection when the features present non-negligible collinearity or practitioners wish to incorporate additional known structure. In this article, we propose a new Semi-smooth Newton Augmented Lagrangian Method to efficiently solve the Elastic Net in ultra-high dimensional settings. Our new algorithm exploits both the sparsity induced by the Elastic Net penalty and the sparsity due to the second order information of the augmented Lagrangian. This greatly reduces the computational cost of the problem. Using simulations on both synthetic and real datasets, we demonstrate that our approach outperforms its best competitors by at least an order of magnitude in terms of CPU time. We also apply our approach to a Genome Wide Association Study on childhood obesity. △ Less

Submitted 6 June, 2020; originally announced June 2020.

MSC Class: 62J07 ACM Class: G.3

arXiv:2006.03141 [pdf, other]

The relationship between human mobility and viral transmissibility during the COVID-19 epidemics in Italy

Authors: Paolo Cintia, Luca Pappalardo, Salvatore Rinzivillo, Daniele Fadda, Tobia Boschi, Fosca Giannotti, Francesca Chiaromonte, Pietro Bonato, Francesco Fabbri, Francesco Penone, Marcello Savarese, Francesco Calabrese, Giorgio Guzzetta, Flavia Riccardo, Valentina Marziano, Piero Poletti, Filippo Trentini, Antonino Bella, Xanthi Andrianou, Martina Del Manso, Massimo Fabiani, Stefania Bellino, Stefano Boros, Alberto Mateo Urdiales, Maria Fenicia Vescio , et al. (7 additional authors not shown)

Abstract: In 2020, countries affected by the COVID-19 pandemic implemented various non-pharmaceutical interventions to contrast the spread of the virus and its impact on their healthcare systems and economies. Using Italian data at different geographic scales, we investigate the relationship between human mobility, which subsumes many facets of the population's response to the changing situation, and the sp… ▽ More In 2020, countries affected by the COVID-19 pandemic implemented various non-pharmaceutical interventions to contrast the spread of the virus and its impact on their healthcare systems and economies. Using Italian data at different geographic scales, we investigate the relationship between human mobility, which subsumes many facets of the population's response to the changing situation, and the spread of COVID-19. Leveraging mobile phone data from February through September 2020, we find a striking relationship between the decrease in mobility flows and the net reproduction number. We find that the time needed to switch off mobility and bring the net reproduction number below the critical threshold of 1 is about one week. Moreover, we observe a strong relationship between the number of days spent above such threshold before the lockdown-induced drop in mobility flows and the total number of infections per 100k inhabitants. Estimating the statistical effect of mobility flows on the net reproduction number over time, we document a 2-week lag positive association, strong in March and April, and weaker but still significant in June. Our study demonstrates the value of big mobility data to monitor the epidemic and inform control interventions during its unfolding. △ Less

Submitted 1 April, 2021; v1 submitted 4 June, 2020; originally announced June 2020.

arXiv:1907.11142 [pdf, other]

doi 10.1093/bioinformatics/btaa060

On the bias of H-scores for comparing biclusters, and how to correct it

Authors: Jacopo Di Iorio, Francesca Chiaromonte, Marzia A. Cremona

Abstract: In the last two decades several biclustering methods have been developed as new unsupervised learning techniques to simultaneously cluster rows and columns of a data matrix. These algorithms play a central role in contemporary machine learning and in many applications, e.g. to computational biology and bioinformatics. The H-score is the evaluation score underlying the seminal biclustering algorith… ▽ More In the last two decades several biclustering methods have been developed as new unsupervised learning techniques to simultaneously cluster rows and columns of a data matrix. These algorithms play a central role in contemporary machine learning and in many applications, e.g. to computational biology and bioinformatics. The H-score is the evaluation score underlying the seminal biclustering algorithm by Cheng and Church, as well as many other subsequent biclustering methods. In this paper, we characterize a potentially troublesome bias in this score, that can distort biclustering results. We prove, both analytically and by simulation, that the average H-score increases with the number of rows/columns in a bicluster. This makes the H-score, and hence all algorithms based on it, biased towards small clusters. Based on our analytical proof, we are able to provide a straightforward way to correct this bias, allowing users to accurately compare biclusters. △ Less

Submitted 24 July, 2019; originally announced July 2019.

Comments: 12 pages, 3 figures

Journal ref: Bioinformatics 2020, 36(1): 2955-2957

arXiv:1808.04773 [pdf, other]

doi 10.1080/10618600.2022.2156522

Probabilistic $K$-mean with local alignment for clustering and motif discovery in functional data

Authors: Marzia A. Cremona, Francesca Chiaromonte

Abstract: We develop a new method to locally cluster curves and discover functional motifs, i.e.~typical ``shapes'' that may recur several times along and across the curves capturing important local characteristics. In order to identify these shared curve portions, our method leverages ideas from functional data analysis (joint clustering and alignment of curves), bioinformatics (local alignment through the… ▽ More We develop a new method to locally cluster curves and discover functional motifs, i.e.~typical ``shapes'' that may recur several times along and across the curves capturing important local characteristics. In order to identify these shared curve portions, our method leverages ideas from functional data analysis (joint clustering and alignment of curves), bioinformatics (local alignment through the extension of high similarity seeds) and fuzzy clustering (curves belonging to more than one cluster, if they contain more than one typical ``shape''). It can employ various dissimilarity measures and incorporate derivatives in the discovery process, thus exploiting complex facets of shapes. We demonstrate the performance of our method with an extensive simulation study, and show how it generalizes other clustering methods for functional data. Finally, we provide real data applications to Berkeley growth data, Italian Covid-19 death curves and ``Omics'' data related to mutagenesis. △ Less

Submitted 7 July, 2020; v1 submitted 14 August, 2018; originally announced August 2018.

Comments: 22 pages, 6 figures. This work has been presented at various conferences

Journal ref: Journal of Computational and Graphical Statistics 2022

arXiv:1808.02526 [pdf, other]

MIP-BOOST: Efficient and Effective $L_0$ Feature Selection for Linear Regression

Authors: Ana Kenney, Francesca Chiaromonte, Giovanni Felici

Abstract: Recent advances in mathematical programming have made Mixed Integer Optimization a competitive alternative to popular regularization methods for selecting features in regression problems. The approach exhibits unquestionable foundational appeal and versatility, but also poses important challenges. Here we propose MIP-BOOST, a revision of standard Mixed Integer Programming feature selection that re… ▽ More Recent advances in mathematical programming have made Mixed Integer Optimization a competitive alternative to popular regularization methods for selecting features in regression problems. The approach exhibits unquestionable foundational appeal and versatility, but also poses important challenges. Here we propose MIP-BOOST, a revision of standard Mixed Integer Programming feature selection that reduces the computational burden of tuning the critical sparsity bound parameter and improves performance in the presence of feature collinearity and of signals that vary in nature and strength. The final outcome is a more efficient and effective $L_0$ Feature Selection method for applications of realistic size and complexity, grounded on rigorous cross-validation tuning and exact optimization of the associated Mixed Integer Program. Computational viability and improved performance in realistic scenarios is achieved through three independent but synergistic proposals. △ Less

Submitted 30 September, 2019; v1 submitted 7 August, 2018; originally announced August 2018.

Comments: This work has been presented at JSM 2018 (Vancouver, Canada), ISNPS 2018 (Salerno, Italy), and various other conferences

arXiv:1506.08278 [pdf, other]

Composite likelihood inference in a discrete latent variable model for two-way "clustering-by-segmentation" problems

Authors: Francesco Bartolucci, Francesca Chiaromonte, Prabhani Kuruppumullage Don, Bruce George Lindsay

Abstract: We consider a discrete latent variable model for two-way data arrays, which allows one to simultaneously produce clusters along one of the data dimensions (e.g. exchangeable observational units or features) and contiguous groups, or segments, along the other (e.g. consecutively ordered times or locations). The model relies on a hidden Markov structure but, given its complexity, cannot be estimated… ▽ More We consider a discrete latent variable model for two-way data arrays, which allows one to simultaneously produce clusters along one of the data dimensions (e.g. exchangeable observational units or features) and contiguous groups, or segments, along the other (e.g. consecutively ordered times or locations). The model relies on a hidden Markov structure but, given its complexity, cannot be estimated by full maximum likelihood. We therefore introduce composite likelihood methodology based on considering different subsets of the data. The proposed approach is illustrated by simulation, and with an application to genomic data. △ Less

Submitted 27 June, 2015; originally announced June 2015.

arXiv:1401.5506 [pdf, other]

An attraction-repulsion point process model for respiratory syncytial virus infections

Authors: Joshua Goldstein, Murali Haran, Ivan Simeonov, John Fricks, Francesca Chiaromonte

Abstract: How is the progression of a virus influenced by properties intrinsic to individual cells? We address this question by studying the susceptibility of cells infected with two strains of the human respiratory syncytial virus (RSV-A and RSV-B) in an in vitro experiment. Spatial patterns of infected cells give us insight into how local conditions influence susceptibility to the virus. We observe a comp… ▽ More How is the progression of a virus influenced by properties intrinsic to individual cells? We address this question by studying the susceptibility of cells infected with two strains of the human respiratory syncytial virus (RSV-A and RSV-B) in an in vitro experiment. Spatial patterns of infected cells give us insight into how local conditions influence susceptibility to the virus. We observe a complicated attraction and repulsion behavior, a tendency for infected cells to lump together or remain apart. We develop a new spatial point process model to describe this behavior. Inference on spatial point processes is difficult because the likelihood functions of these models contain intractable normalizing constants; we adapt an MCMC algorithm called double Metropolis-Hastings to overcome this computational challenge. Our methods are computationally efficient even for large point patterns consisting of over 10,000 points. We illustrate the application of our model and inferential approach to simulated data examples and fit our model to various RSV experiments. Because our model parameters are easy to interpret, we are able to draw meaningful scientific conclusions from the fitted models. △ Less

Submitted 13 July, 2014; v1 submitted 21 January, 2014; originally announced January 2014.

Showing 1–19 of 19 results for author: Chiaromonte, F