Search | arXiv e-print repository

arXiv:2404.17886 [pdf, other]

Feature graphs for interpretable unsupervised tree ensembles: centrality, interaction, and application in disease subty**

Authors: Christel Sirocchi, Martin Urschler, Bastian Pfeifer

Abstract: Interpretable machine learning has emerged as central in leveraging artificial intelligence within high-stakes domains such as healthcare, where understanding the rationale behind model predictions is as critical as achieving high predictive accuracy. In this context, feature selection assumes a pivotal role in enhancing model interpretability by identifying the most important input features in bl… ▽ More Interpretable machine learning has emerged as central in leveraging artificial intelligence within high-stakes domains such as healthcare, where understanding the rationale behind model predictions is as critical as achieving high predictive accuracy. In this context, feature selection assumes a pivotal role in enhancing model interpretability by identifying the most important input features in black-box models. While random forests are frequently used in biomedicine for their remarkable performance on tabular datasets, the accuracy gained from aggregating decision trees comes at the expense of interpretability. Consequently, feature selection for enhancing interpretability in random forests has been extensively explored in supervised settings. However, its investigation in the unsupervised regime remains notably limited. To address this gap, the study introduces novel methods to construct feature graphs from unsupervised random forests and feature selection strategies to derive effective feature combinations from these graphs. Feature graphs are constructed for the entire dataset as well as individual clusters leveraging the parent-child node splits within the trees, such that feature centrality captures their relevance to the clustering task, while edge weights reflect the discriminating power of feature pairs. Graph-based feature selection methods are extensively evaluated on synthetic and benchmark datasets both in terms of their ability to reduce dimensionality while improving clustering performance, as well as to enhance model interpretability. An application on omics data for disease subty** identifies the top features for each cluster, showcasing the potential of the proposed approach to enhance interpretability in clustering analyses and its utility in a real-world biomedical application. △ Less

Submitted 27 April, 2024; originally announced April 2024.

ACM Class: I.2.1; I.5.3; J.3

arXiv:2401.16094 [pdf, other]

Federated unsupervised random forest for privacy-preserving patient stratification

Authors: Bastian Pfeifer, Christel Sirocchi, Marcus D. Bloice, Markus Kreuzthaler, Martin Urschler

Abstract: In the realm of precision medicine, effective patient stratification and disease subty** demand innovative methodologies tailored for multi-omics data. Clustering techniques applied to multi-omics data have become instrumental in identifying distinct subgroups of patients, enabling a finer-grained understanding of disease variability. This work establishes a powerful framework for advancing prec… ▽ More In the realm of precision medicine, effective patient stratification and disease subty** demand innovative methodologies tailored for multi-omics data. Clustering techniques applied to multi-omics data have become instrumental in identifying distinct subgroups of patients, enabling a finer-grained understanding of disease variability. This work establishes a powerful framework for advancing precision medicine through unsupervised random-forest-based clustering and federated computing. We introduce a novel multi-omics clustering approach utilizing unsupervised random-forests. The unsupervised nature of the random forest enables the determination of cluster-specific feature importance, unraveling key molecular contributors to distinct patient groups. Moreover, our methodology is designed for federated execution, a crucial aspect in the medical domain where privacy concerns are paramount. We have validated our approach on machine learning benchmark data sets as well as on cancer data from The Cancer Genome Atlas (TCGA). Our method is competitive with the state-of-the-art in terms of disease subty**, but at the same time substantially improves the cluster interpretability. Experiments indicate that local clustering performance can be improved through federated computing. △ Less

Submitted 29 January, 2024; originally announced January 2024.

arXiv:2309.01144 [pdf, other]

Distributed averaging for accuracy prediction in networked systems

Authors: Christel Sirocchi, Alessandro Bogliolo

Abstract: Distributed averaging is among the most relevant cooperative control problems, with applications in sensor and robotic networks, distributed signal processing, data fusion, and load balancing. Consensus and gossip algorithms have been investigated and successfully deployed in multi-agent systems to perform distributed averaging in synchronous and asynchronous settings. This study proposes a heuris… ▽ More Distributed averaging is among the most relevant cooperative control problems, with applications in sensor and robotic networks, distributed signal processing, data fusion, and load balancing. Consensus and gossip algorithms have been investigated and successfully deployed in multi-agent systems to perform distributed averaging in synchronous and asynchronous settings. This study proposes a heuristic approach to estimate the convergence rate of averaging algorithms in a distributed manner, relying on the computation and propagation of local graph metrics while entailing simple data elaboration and small message passing. The protocol enables nodes to predict the time (or the number of interactions) needed to estimate the global average with the desired accuracy. Consequently, nodes can make informed decisions on their use of measured and estimated data while gaining awareness of the global structure of the network, as well as their role in it. The study presents relevant applications to outliers identification and performance evaluation in switching topologies. △ Less

Submitted 3 September, 2023; originally announced September 2023.

ACM Class: C.2.4; C.4

arXiv:2205.14740 [pdf, other]

Investigating Participation Mechanisms in EU Code Week

Authors: Christel Sirocchi, Annika Ostergren Pofantis, Alessandro Bogliolo

Abstract: Digital competence (DC) is a broad set of skills, attitudes, and knowledge for confident, critical and responsible use of digital technologies in every aspect of life. DC is fundamental to all people in conducting a productive and fulfilling life in an increasingly digital world. However, prejudices, misconceptions, and lack of awareness reduce the diffusion of DC, hindering digital transformation… ▽ More Digital competence (DC) is a broad set of skills, attitudes, and knowledge for confident, critical and responsible use of digital technologies in every aspect of life. DC is fundamental to all people in conducting a productive and fulfilling life in an increasingly digital world. However, prejudices, misconceptions, and lack of awareness reduce the diffusion of DC, hindering digital transformation and preventing countries and people from realising their full potential. Teaching Informatics in the curriculum is increasingly supported by the institutions but faces serious challenges, such as teacher upskilling and support, and will require several years to observe sizeable outcomes. In response, grassroots movements promoting computing literacy in an informal setting have grown, including EU Code Week, whose vision is to develop computing skills while promoting diversity and raising awareness of the importance of digital skills. Code Week participation is a form of public engagement that could be affected by socio-economic and demographic factors, as any other form of participation. The aim of the manuscript is twofold: first, to offer a detailed and comprehensive statistical description of Code Week's participation in the EU Member States in terms of penetration, retention, demographic composition, and spatial distribution in order to inform more effective awareness-raising campaigns; second, to investigate the impact of socio-economic factors on Code Week involvement. The study identifies a strong negative correlation between participation and income at different geographical scales. It also suggests underlying mechanisms driving participation that are coherent with the "psychosocial" and the "resource" views, i.e. the two most widely accepted explanations of the effect of income on public engagement. △ Less

Submitted 29 May, 2022; originally announced May 2022.

Comments: 39 pages, 12 figures

ACM Class: K.3.2

Showing 1–4 of 4 results for author: Sirocchi, C