-
Federated Learning for Non-factorizable Models using Deep Generative Prior Approximations
Authors:
Conor Hassan,
Joshua J Bon,
Elizaveta Semenova,
Antonietta Mira,
Kerrie Mengersen
Abstract:
Federated learning (FL) allows for collaborative model training across decentralized clients while preserving privacy by avoiding data sharing. However, current FL methods assume conditional independence between client models, limiting the use of priors that capture dependence, such as Gaussian processes (GPs). We introduce the Structured Independence via deep Generative Model Approximation (SIGMA…
▽ More
Federated learning (FL) allows for collaborative model training across decentralized clients while preserving privacy by avoiding data sharing. However, current FL methods assume conditional independence between client models, limiting the use of priors that capture dependence, such as Gaussian processes (GPs). We introduce the Structured Independence via deep Generative Model Approximation (SIGMA) prior which enables FL for non-factorizable models across clients, expanding the applicability of FL to fields such as spatial statistics, epidemiology, environmental science, and other domains where modeling dependencies is crucial. The SIGMA prior is a pre-trained deep generative model that approximates the desired prior and induces a specified conditional independence structure in the latent variables, creating an approximate model suitable for FL settings. We demonstrate the SIGMA prior's effectiveness on synthetic data and showcase its utility in a real-world example of FL for spatial data, using a conditional autoregressive prior to model spatial dependence across Australia. Our work enables new FL applications in domains where modeling dependent data is essential for accurate predictions and decision-making.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
Scalable Vertical Federated Learning via Data Augmentation and Amortized Inference
Authors:
Conor Hassan,
Matthew Sutton,
Antonietta Mira,
Kerrie Mengersen
Abstract:
Vertical federated learning (VFL) has emerged as a paradigm for collaborative model estimation across multiple clients, each holding a distinct set of covariates. This paper introduces the first comprehensive framework for fitting Bayesian models in the VFL setting. We propose a novel approach that leverages data augmentation techniques to transform VFL problems into a form compatible with existin…
▽ More
Vertical federated learning (VFL) has emerged as a paradigm for collaborative model estimation across multiple clients, each holding a distinct set of covariates. This paper introduces the first comprehensive framework for fitting Bayesian models in the VFL setting. We propose a novel approach that leverages data augmentation techniques to transform VFL problems into a form compatible with existing Bayesian federated learning algorithms. We present an innovative model formulation for specific VFL scenarios where the joint likelihood factorizes into a product of client-specific likelihoods. To mitigate the dimensionality challenge posed by data augmentation, which scales with the number of observations and clients, we develop a factorized amortized variational approximation that achieves scalability independent of the number of observations. We showcase the efficacy of our framework through extensive numerical experiments on logistic regression, multilevel regression, and a novel hierarchical Bayesian split neural net model. Our work paves the way for privacy-preserving, decentralized Bayesian inference in vertically partitioned data scenarios, opening up new avenues for research and applications in various domains.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
REAL-Colon: A dataset for develo** real-world AI applications in colonoscopy
Authors:
Carlo Biffi,
Giulio Antonelli,
Sebastian Bernhofer,
Cesare Hassan,
Daizen Hirata,
Mineo Iwatate,
Andreas Maieron,
Pietro Salvagnini,
Andrea Cherubini
Abstract:
Detection and diagnosis of colon polyps are key to preventing colorectal cancer. Recent evidence suggests that AI-based computer-aided detection (CADe) and computer-aided diagnosis (CADx) systems can enhance endoscopists' performance and boost colonoscopy effectiveness. However, most available public datasets primarily consist of still images or video clips, often at a down-sampled resolution, and…
▽ More
Detection and diagnosis of colon polyps are key to preventing colorectal cancer. Recent evidence suggests that AI-based computer-aided detection (CADe) and computer-aided diagnosis (CADx) systems can enhance endoscopists' performance and boost colonoscopy effectiveness. However, most available public datasets primarily consist of still images or video clips, often at a down-sampled resolution, and do not accurately represent real-world colonoscopy procedures. We introduce the REAL-Colon (Real-world multi-center Endoscopy Annotated video Library) dataset: a compilation of 2.7M native video frames from sixty full-resolution, real-world colonoscopy recordings across multiple centers. The dataset contains 350k bounding-box annotations, each created under the supervision of expert gastroenterologists. Comprehensive patient clinical data, colonoscopy acquisition information, and polyp histopathological information are also included in each video. With its unprecedented size, quality, and heterogeneity, the REAL-Colon dataset is a unique resource for researchers and developers aiming to advance AI research in colonoscopy. Its openness and transparency facilitate rigorous and reproducible research, fostering the development and benchmarking of more accurate and reliable colonoscopy-related algorithms and models.
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
Bayesian Cluster Geographically Weighted Regression for Spatial Heterogeneous Data
Authors:
Wala Draidi Areed,
Aiden Price,
Helen Thompson,
Conor Hassan,
Reid Malseed,
Kerrie Mengersen
Abstract:
Spatial statistical models are commonly used in geographical scenarios to ensure spatial variation is captured effectively. However, spatial models and cluster algorithms can be complicated and expensive. This paper pursues three main objectives. First, it introduces covariate effect clustering by integrating a Bayesian Geographically Weighted Regression (BGWR) with a Gaussian mixture model and th…
▽ More
Spatial statistical models are commonly used in geographical scenarios to ensure spatial variation is captured effectively. However, spatial models and cluster algorithms can be complicated and expensive. This paper pursues three main objectives. First, it introduces covariate effect clustering by integrating a Bayesian Geographically Weighted Regression (BGWR) with a Gaussian mixture model and the Dirichlet process mixture model. Second, this paper examines situations in which a particular covariate holds significant importance in one region but not in another in the Bayesian framework. Lastly, it addresses computational challenges present in existing BGWR, leading to notable enhancements in Markov chain Monte Carlo estimation suitable for large spatial datasets. The efficacy of the proposed method is demonstrated using simulated data and is further validated in a case study examining children's development domains in Queensland, Australia, using data provided by Children's Health Queensland and Australia's Early Development Census.
△ Less
Submitted 20 November, 2023;
originally announced November 2023.
-
Deep Generative Models, Synthetic Tabular Data, and Differential Privacy: An Overview and Synthesis
Authors:
Conor Hassan,
Robert Salomone,
Kerrie Mengersen
Abstract:
This article provides a comprehensive synthesis of the recent developments in synthetic data generation via deep generative models, focusing on tabular datasets. We specifically outline the importance of synthetic data generation in the context of privacy-sensitive data. Additionally, we highlight the advantages of using deep generative models over other methods and provide a detailed explanation…
▽ More
This article provides a comprehensive synthesis of the recent developments in synthetic data generation via deep generative models, focusing on tabular datasets. We specifically outline the importance of synthetic data generation in the context of privacy-sensitive data. Additionally, we highlight the advantages of using deep generative models over other methods and provide a detailed explanation of the underlying concepts, including unsupervised learning, neural networks, and generative models. The paper covers the challenges and considerations involved in using deep generative models for tabular datasets, such as data normalization, privacy concerns, and model evaluation. This review provides a valuable resource for researchers and practitioners interested in synthetic data generation and its applications.
△ Less
Submitted 27 August, 2023; v1 submitted 28 July, 2023;
originally announced July 2023.
-
Federated Variational Inference Methods for Structured Latent Variable Models
Authors:
Conor Hassan,
Robert Salomone,
Kerrie Mengersen
Abstract:
Federated learning methods enable model training across distributed data sources without data leaving their original locations and have gained increasing interest in various fields. However, existing approaches are limited, excluding many structured probabilistic models. We present a general and elegant solution based on structured variational inference, widely used in Bayesian machine learning, a…
▽ More
Federated learning methods enable model training across distributed data sources without data leaving their original locations and have gained increasing interest in various fields. However, existing approaches are limited, excluding many structured probabilistic models. We present a general and elegant solution based on structured variational inference, widely used in Bayesian machine learning, adapted for the federated setting. Additionally, we provide a communication-efficient variant analogous to the canonical FedAvg algorithm. The proposed algorithms' effectiveness is demonstrated, and their performance is compared with hierarchical Bayesian neural networks and topic models.
△ Less
Submitted 7 July, 2023; v1 submitted 7 February, 2023;
originally announced February 2023.
-
Being Bayesian in the 2020s: opportunities and challenges in the practice of modern applied Bayesian statistics
Authors:
Joshua J. Bon,
Adam Bretherton,
Katie Buchhorn,
Susanna Cramb,
Christopher Drovandi,
Conor Hassan,
Adrianne L. Jenner,
Helen J. Mayfield,
James M. McGree,
Kerrie Mengersen,
Aiden Price,
Robert Salomone,
Edgar Santos-Fernandez,
Julie Vercelloni,
Xiaoyu Wang
Abstract:
Building on a strong foundation of philosophy, theory, methods and computation over the past three decades, Bayesian approaches are now an integral part of the toolkit for most statisticians and data scientists. Whether they are dedicated Bayesians or opportunistic users, applied professionals can now reap many of the benefits afforded by the Bayesian paradigm. In this paper, we touch on six moder…
▽ More
Building on a strong foundation of philosophy, theory, methods and computation over the past three decades, Bayesian approaches are now an integral part of the toolkit for most statisticians and data scientists. Whether they are dedicated Bayesians or opportunistic users, applied professionals can now reap many of the benefits afforded by the Bayesian paradigm. In this paper, we touch on six modern opportunities and challenges in applied Bayesian statistics: intelligent data collection, new data sources, federated analysis, inference for implicit models, model transfer and purposeful software products.
△ Less
Submitted 17 January, 2023; v1 submitted 17 November, 2022;
originally announced November 2022.