Search | arXiv e-print repository

LaTable: Towards Large Tabular Models

Authors: Boris van Breugel, Jonathan Crabbé, Rob Davis, Mihaela van der Schaar

Abstract: Tabular data is one of the most ubiquitous modalities, yet the literature on tabular generative foundation models is lagging far behind its text and vision counterparts. Creating such a model is hard, due to the heterogeneous feature spaces of different tabular datasets, tabular metadata (e.g. dataset description and feature headers), and tables lacking prior knowledge (e.g. feature order). In thi… ▽ More Tabular data is one of the most ubiquitous modalities, yet the literature on tabular generative foundation models is lagging far behind its text and vision counterparts. Creating such a model is hard, due to the heterogeneous feature spaces of different tabular datasets, tabular metadata (e.g. dataset description and feature headers), and tables lacking prior knowledge (e.g. feature order). In this work we propose LaTable: a novel tabular diffusion model that addresses these challenges and can be trained across different datasets. Through extensive experiments we find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples. On the other hand, we explore the poor zero-shot performance of LaTable, and what it may teach us about building generative tabular foundation models with better zero- and few-shot generation capabilities. △ Less

Submitted 25 June, 2024; originally announced June 2024.

arXiv:2405.01147 [pdf, other]

Why Tabular Foundation Models Should Be a Research Priority

Authors: Boris van Breugel, Mihaela van der Schaar

Abstract: Recent text and image foundation models are incredibly impressive, and these models are attracting an ever-increasing portion of research resources. In this position piece we aim to shift the ML research community's priorities ever so slightly to a different modality: tabular data. Tabular data is the dominant modality in many fields, yet it is given hardly any research attention and significantly… ▽ More Recent text and image foundation models are incredibly impressive, and these models are attracting an ever-increasing portion of research resources. In this position piece we aim to shift the ML research community's priorities ever so slightly to a different modality: tabular data. Tabular data is the dominant modality in many fields, yet it is given hardly any research attention and significantly lags behind in terms of scale and power. We believe the time is now to start develo** tabular foundation models, or what we coin a Large Tabular Model (LTM). LTMs could revolutionise the way science and ML use tabular data: not as single datasets that are analyzed in a vacuum, but contextualized with respect to related datasets. The potential impact is far-reaching: from few-shot tabular models to automating data science; from out-of-distribution synthetic data to empowering multidisciplinary scientific discovery. We intend to excite reflections on the modalities we study, and convince some researchers to study large tabular models. △ Less

Submitted 2 June, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

Comments: Accepted at International Conference on Machine Learning (ICML 2024)

arXiv:2312.12865 [pdf, other]

RadEdit: stress-testing biomedical vision models via diffusion image editing

Authors: Fernando Pérez-García, Sam Bond-Taylor, Pedro P. Sanchez, Boris van Breugel, Daniel C. Castro, Harshita Sharma, Valentina Salvatelli, Maria T. A. Wetscherek, Hannah Richardson, Matthew P. Lungren, Aditya Nori, Javier Alvarez-Valle, Ozan Oktay, Maximilian Ilse

Abstract: Biomedical imaging datasets are often small and biased, meaning that real-world performance of predictive models can be substantially lower than expected from internal testing. This work proposes using generative image editing to simulate dataset shifts and diagnose failure modes of biomedical vision models; this can be used in advance of deployment to assess readiness, potentially reducing cost a… ▽ More Biomedical imaging datasets are often small and biased, meaning that real-world performance of predictive models can be substantially lower than expected from internal testing. This work proposes using generative image editing to simulate dataset shifts and diagnose failure modes of biomedical vision models; this can be used in advance of deployment to assess readiness, potentially reducing cost and patient harm. Existing editing methods can produce undesirable changes, with spurious correlations learned due to the co-occurrence of disease and treatment interventions, limiting practical applicability. To address this, we train a text-to-image diffusion model on multiple chest X-ray datasets and introduce a new editing method RadEdit that uses multiple masks, if present, to constrain changes and ensure consistency in the edited images. We consider three types of dataset shifts: acquisition shift, manifestation shift, and population shift, and demonstrate that our approach can diagnose failures and quantify model robustness without additional data collection, complementing more qualitative tools for explainable AI. △ Less

Submitted 3 April, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

arXiv:2312.12112 [pdf, other]

Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes

Authors: Nabeel Seedat, Nicolas Huynh, Boris van Breugel, Mihaela van der Schaar

Abstract: Machine Learning (ML) in low-data settings remains an underappreciated yet crucial problem. Hence, data augmentation methods to increase the sample size of datasets needed for ML are key to unlocking the transformative potential of ML in data-deprived regions and domains. Unfortunately, the limited training set constrains traditional tabular synthetic data generators in their ability to generate a… ▽ More Machine Learning (ML) in low-data settings remains an underappreciated yet crucial problem. Hence, data augmentation methods to increase the sample size of datasets needed for ML are key to unlocking the transformative potential of ML in data-deprived regions and domains. Unfortunately, the limited training set constrains traditional tabular synthetic data generators in their ability to generate a large and diverse augmented dataset needed for ML tasks. To address this challenge, we introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime. However, not all the data generated by LLMs will improve downstream utility, as for any generative model. Consequently, we introduce a principled curation mechanism, leveraging learning dynamics, coupled with confidence and uncertainty metrics, to obtain a high-quality dataset. Empirically, on multiple real-world datasets, we demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators. Additionally, we provide insights into the LLM generation and curation mechanism, shedding light on the features that enable them to output high-quality augmented datasets. △ Less

Submitted 30 June, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

Comments: Presented at the 41st International Conference on Machine Learning (ICML) 2024. *Seedat & Huynh contributed equally

arXiv:2310.16524 [pdf, other]

Can You Rely on Your Model Evaluation? Improving Model Evaluation with Synthetic Test Data

Authors: Boris van Breugel, Nabeel Seedat, Fergus Imrie, Mihaela van der Schaar

Abstract: Evaluating the performance of machine learning models on diverse and underrepresented subgroups is essential for ensuring fairness and reliability in real-world applications. However, accurately assessing model performance becomes challenging due to two main issues: (1) a scarcity of test data, especially for small subgroups, and (2) possible distributional shifts in the model's deployment setting… ▽ More Evaluating the performance of machine learning models on diverse and underrepresented subgroups is essential for ensuring fairness and reliability in real-world applications. However, accurately assessing model performance becomes challenging due to two main issues: (1) a scarcity of test data, especially for small subgroups, and (2) possible distributional shifts in the model's deployment setting, which may not align with the available test data. In this work, we introduce 3S Testing, a deep generative modeling framework to facilitate model evaluation by generating synthetic test sets for small subgroups and simulating distributional shifts. Our experiments demonstrate that 3S Testing outperforms traditional baselines -- including real test data alone -- in estimating model performance on minority subgroups and under plausible distributional shifts. In addition, 3S offers intervals around its performance estimates, exhibiting superior coverage of the ground truth compared to existing approaches. Overall, these results raise the question of whether we need a paradigm shift away from limited real test data towards synthetic test data. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: Advances in Neural Information Processing Systems 36 (NeurIPS 2023). Van Breugel & Seedat contributed equally

arXiv:2309.14068 [pdf, other]

Soft Mixture Denoising: Beyond the Expressive Bottleneck of Diffusion Models

Authors: Yangming Li, Boris van Breugel, Mihaela van der Schaar

Abstract: Because diffusion models have shown impressive performances in a number of tasks, such as image synthesis, there is a trend in recent works to prove (with certain assumptions) that these models have strong approximation capabilities. In this paper, we show that current diffusion models actually have an expressive bottleneck in backward denoising and some assumption made by existing theoretical gua… ▽ More Because diffusion models have shown impressive performances in a number of tasks, such as image synthesis, there is a trend in recent works to prove (with certain assumptions) that these models have strong approximation capabilities. In this paper, we show that current diffusion models actually have an expressive bottleneck in backward denoising and some assumption made by existing theoretical guarantees is too strong. Based on this finding, we prove that diffusion models have unbounded errors in both local and global denoising. In light of our theoretical studies, we introduce soft mixture denoising (SMD), an expressive and efficient model for backward denoising. SMD not only permits diffusion models to well approximate any Gaussian mixture distributions in theory, but also is simple and efficient for implementation. Our experiments on multiple image datasets show that SMD significantly improves different types of diffusion models (e.g., DDPM), espeically in the situation of few backward iterations. △ Less

Submitted 18 January, 2024; v1 submitted 25 September, 2023; originally announced September 2023.

Comments: Accepted by ICLR-2024

arXiv:2307.05359 [pdf, other]

The Spherical Grasshopper Problem

Authors: Boris van Breugel

Abstract: The aim of this essay is to better understand the Grasshopper Problem on the surface of the unit sphere. The problem is motivated by analysing Bell inequalities, but can be formulated as a geometric puzzle as follows. Given a white sphere and a bucket of black paint, one is asked to paint half of the sphere, such that antipodal pairs of points are oppositely coloured. A grasshopper lands on the sp… ▽ More The aim of this essay is to better understand the Grasshopper Problem on the surface of the unit sphere. The problem is motivated by analysing Bell inequalities, but can be formulated as a geometric puzzle as follows. Given a white sphere and a bucket of black paint, one is asked to paint half of the sphere, such that antipodal pairs of points are oppositely coloured. A grasshopper lands on the sphere, and jumps a fixed distance in a random direction. How should the sphere be coloured such that the probability of the grasshopper landing on the same colour is maximized? Goulko and Kent have explored this problem on the plane without an antipodality constraint. This essay gives clear indication that the spherical problem with the antipodality constraint yields colourings with similar shapes as the planar problem does. This research has discretised the problem and used a simulated annealing algorithm to search for the optimal solution. Results are consistent with the planar results of \cite{goulkokent}. For $0.10π\leqθ\leq0.44π$ cogwheel solutions are found to be optimal, with odd integer of cogs $n_o$ such that $n_o$ is close to $\frac{2π}θ$. For $0.45π\leqθ\leq 0.55π$ critical solutions are found, in which domains of identical colour decrease in size towards $0.5π$ (moving from either side). For $θ\geq 0.55$ colourings are found consisting of stripes with cogs. Towards $θ=π$ colourings are generated that display just stripes that scale in width with $π-θ$. △ Less

Submitted 8 July, 2023; originally announced July 2023.

arXiv:2305.09235 [pdf, other]

Synthetic data, real errors: how (not) to publish and use synthetic data

Authors: Boris van Breugel, Zhaozhi Qian, Mihaela van der Schaar

Abstract: Generating synthetic data through generative models is gaining interest in the ML community and beyond, promising a future where datasets can be tailored to individual needs. Unfortunately, synthetic data is usually not perfect, resulting in potential errors in downstream tasks. In this work we explore how the generative process affects the downstream ML task. We show that the naive synthetic data… ▽ More Generating synthetic data through generative models is gaining interest in the ML community and beyond, promising a future where datasets can be tailored to individual needs. Unfortunately, synthetic data is usually not perfect, resulting in potential errors in downstream tasks. In this work we explore how the generative process affects the downstream ML task. We show that the naive synthetic data approach -- using synthetic data as if it is real -- leads to downstream models and analyses that do not generalize well to real data. As a first step towards better ML in the synthetic data regime, we introduce Deep Generative Ensemble (DGE) -- a framework inspired by Deep Ensembles that aims to implicitly approximate the posterior distribution over the generative process model parameters. DGE improves downstream model training, evaluation, and uncertainty quantification, vastly outperforming the naive approach on average. The largest improvements are achieved for minority classes and low-density regions of the original data, for which the generative uncertainty is largest. △ Less

Submitted 8 July, 2023; v1 submitted 16 May, 2023; originally announced May 2023.

Comments: Proceedings of the 40th International Conference on Machine Learning (ICML 2023)

arXiv:2304.03722 [pdf, other]

Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data

Authors: Boris van Breugel, Mihaela van der Schaar

Abstract: Generating synthetic data through generative models is gaining interest in the ML community and beyond. In the past, synthetic data was often regarded as a means to private data release, but a surge of recent papers explore how its potential reaches much further than this -- from creating more fair data to data augmentation, and from simulation to text generated by ChatGPT. In this perspective we… ▽ More Generating synthetic data through generative models is gaining interest in the ML community and beyond. In the past, synthetic data was often regarded as a means to private data release, but a surge of recent papers explore how its potential reaches much further than this -- from creating more fair data to data augmentation, and from simulation to text generated by ChatGPT. In this perspective we explore whether, and how, synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs. Just as importantly, we discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data -- the most important of which is quantifying how much we can trust any finding or prediction drawn from synthetic data. △ Less

Submitted 7 April, 2023; originally announced April 2023.

arXiv:2302.12580 [pdf, other]

Membership Inference Attacks against Synthetic Data through Overfitting Detection

Authors: Boris van Breugel, Hao Sun, Zhaozhi Qian, Mihaela van der Schaar

Abstract: Data is the foundation of most science. Unfortunately, sharing data can be obstructed by the risk of violating data privacy, impeding research in fields like healthcare. Synthetic data is a potential solution. It aims to generate data that has the same distribution as the original data, but that does not disclose information about individuals. Membership Inference Attacks (MIAs) are a common priva… ▽ More Data is the foundation of most science. Unfortunately, sharing data can be obstructed by the risk of violating data privacy, impeding research in fields like healthcare. Synthetic data is a potential solution. It aims to generate data that has the same distribution as the original data, but that does not disclose information about individuals. Membership Inference Attacks (MIAs) are a common privacy attack, in which the attacker attempts to determine whether a particular real sample was used for training of the model. Previous works that propose MIAs against generative models either display low performance -- giving the false impression that data is highly private -- or need to assume access to internal generative model parameters -- a relatively low-risk scenario, as the data publisher often only releases synthetic data, not the model. In this work we argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution. We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model. Experimentally we show that DOMIAS is significantly more successful at MIA than previous work, especially at attacking uncommon samples. The latter is disconcerting since these samples may correspond to underrepresented groups. We also demonstrate how DOMIAS' MIA performance score provides an interpretable metric for privacy, giving data publishers a new tool for achieving the desired privacy-utility trade-off in their synthetic data. △ Less

Submitted 24 February, 2023; originally announced February 2023.

arXiv:2211.06138 [pdf, other]

Practical Approaches for Fair Learning with Multitype and Multivariate Sensitive Attributes

Authors: Tennison Liu, Alex J. Chan, Boris van Breugel, Mihaela van der Schaar

Abstract: It is important to guarantee that machine learning algorithms deployed in the real world do not result in unfairness or unintended social consequences. Fair ML has largely focused on the protection of single attributes in the simpler setting where both attributes and target outcomes are binary. However, the practical application in many a real-world problem entails the simultaneous protection of m… ▽ More It is important to guarantee that machine learning algorithms deployed in the real world do not result in unfairness or unintended social consequences. Fair ML has largely focused on the protection of single attributes in the simpler setting where both attributes and target outcomes are binary. However, the practical application in many a real-world problem entails the simultaneous protection of multiple sensitive attributes, which are often not simply binary, but continuous or categorical. To address this more challenging task, we introduce FairCOCCO, a fairness measure built on cross-covariance operators on reproducing kernel Hilbert Spaces. This leads to two practical tools: first, the FairCOCCO Score, a normalised metric that can quantify fairness in settings with single or multiple sensitive attributes of arbitrary type; and second, a subsequent regularisation term that can be incorporated into arbitrary learning objectives to obtain fair predictors. These contributions address crucial gaps in the algorithmic fairness literature, and we empirically demonstrate consistent improvements against state-of-the-art techniques in balancing predictive power and fairness on real-world datasets. △ Less

Submitted 11 November, 2022; originally announced November 2022.

arXiv:2207.05161 [pdf, other]

What is Flagged in Uncertainty Quantification? Latent Density Models for Uncertainty Categorization

Authors: Hao Sun, Boris van Breugel, Jonathan Crabbe, Nabeel Seedat, Mihaela van der Schaar

Abstract: Uncertainty Quantification (UQ) is essential for creating trustworthy machine learning models. Recent years have seen a steep rise in UQ methods that can flag suspicious examples, however, it is often unclear what exactly these methods identify. In this work, we propose a framework for categorizing uncertain examples flagged by UQ methods in classification tasks. We introduce the confusion density… ▽ More Uncertainty Quantification (UQ) is essential for creating trustworthy machine learning models. Recent years have seen a steep rise in UQ methods that can flag suspicious examples, however, it is often unclear what exactly these methods identify. In this work, we propose a framework for categorizing uncertain examples flagged by UQ methods in classification tasks. We introduce the confusion density matrix -- a kernel-based approximation of the misclassification density -- and use this to categorize suspicious examples identified by a given uncertainty method into three classes: out-of-distribution (OOD) examples, boundary (Bnd) examples, and examples in regions of high in-distribution misclassification (IDM). Through extensive experiments, we show that our framework provides a new and distinct perspective for assessing differences between uncertainty quantification methods, thereby forming a valuable assessment benchmark. △ Less

Submitted 27 October, 2023; v1 submitted 11 July, 2022; originally announced July 2022.

arXiv:2110.12884 [pdf, other]

DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks

Authors: Boris van Breugel, Trent Kyono, Jeroen Berrevoets, Mihaela van der Schaar

Abstract: Machine learning models have been criticized for reflecting unfair biases in the training data. Instead of solving for this by introducing fair learning algorithms directly, we focus on generating fair synthetic data, such that any downstream learner is fair. Generating fair synthetic data from unfair data - while remaining truthful to the underlying data-generating process (DGP) - is non-trivial.… ▽ More Machine learning models have been criticized for reflecting unfair biases in the training data. Instead of solving for this by introducing fair learning algorithms directly, we focus on generating fair synthetic data, such that any downstream learner is fair. Generating fair synthetic data from unfair data - while remaining truthful to the underlying data-generating process (DGP) - is non-trivial. In this paper, we introduce DECAF: a GAN-based fair synthetic data generator for tabular data. With DECAF we embed the DGP explicitly as a structural causal model in the input layers of the generator, allowing each variable to be reconstructed conditioned on its causal parents. This procedure enables inference time debiasing, where biased edges can be strategically removed for satisfying user-defined fairness requirements. The DECAF framework is versatile and compatible with several popular definitions of fairness. In our experiments, we show that DECAF successfully removes undesired bias and - in contrast to existing methods - is capable of generating high-quality synthetic data. Furthermore, we provide theoretical guarantees on the generator's convergence and the fairness of downstream models. △ Less

Submitted 4 November, 2021; v1 submitted 25 October, 2021; originally announced October 2021.

arXiv:2102.08921 [pdf, other]

How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models

Authors: Ahmed M. Alaa, Boris van Breugel, Evgeny Saveliev, Mihaela van der Schaar

Abstract: Devising domain- and model-agnostic evaluation metrics for generative models is an important and as yet unresolved problem. Most existing metrics, which were tailored solely to the image synthesis setup, exhibit a limited capacity for diagnosing the different modes of failure of generative models across broader application domains. In this paper, we introduce a 3-dimensional evaluation metric, (… ▽ More Devising domain- and model-agnostic evaluation metrics for generative models is an important and as yet unresolved problem. Most existing metrics, which were tailored solely to the image synthesis setup, exhibit a limited capacity for diagnosing the different modes of failure of generative models across broader application domains. In this paper, we introduce a 3-dimensional evaluation metric, ($α$-Precision, $β$-Recall, Authenticity), that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion. Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity. We introduce generalization as an additional, independent dimension (to the fidelity-diversity trade-off) that quantifies the extent to which a model copies training data -- a crucial performance indicator when modeling sensitive data with requirements on privacy. The three metric components correspond to (interpretable) probabilistic quantities, and are estimated via sample-level binary classification. The sample-level nature of our metric inspires a novel use case which we call model auditing, wherein we judge the quality of individual samples generated by a (black-box) model, discarding low-quality samples and hence improving the overall model performance in a post-hoc manner. △ Less

Submitted 13 July, 2022; v1 submitted 17 February, 2021; originally announced February 2021.

arXiv:2101.09688 [pdf, other]

Stereotype and Skew: Quantifying Gender Bias in Pre-trained and Fine-tuned Language Models

Authors: Daniel de Vassimon Manela, David Errington, Thomas Fisher, Boris van Breugel, Pasquale Minervini

Abstract: This paper proposes two intuitive metrics, skew and stereotype, that quantify and analyse the gender bias present in contextual language models when tackling the WinoBias pronoun resolution task. We find evidence that gender stereotype correlates approximately negatively with gender skew in out-of-the-box models, suggesting that there is a trade-off between these two forms of bias. We investigate… ▽ More This paper proposes two intuitive metrics, skew and stereotype, that quantify and analyse the gender bias present in contextual language models when tackling the WinoBias pronoun resolution task. We find evidence that gender stereotype correlates approximately negatively with gender skew in out-of-the-box models, suggesting that there is a trade-off between these two forms of bias. We investigate two methods to mitigate bias. The first approach is an online method which is effective at removing skew at the expense of stereotype. The second, inspired by previous work on ELMo, involves the fine-tuning of BERT using an augmented gender-balanced dataset. We show that this reduces both skew and stereotype relative to its unaugmented fine-tuned counterpart. However, we find that existing gender bias benchmarks do not fully probe professional bias as pronoun resolution may be obfuscated by cross-correlations from other manifestations of gender prejudice. Our code is available online, at https://github.com/12kleingordon34/NLP_masters_project. △ Less

Submitted 16 February, 2021; v1 submitted 24 January, 2021; originally announced January 2021.

Comments: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021)

Showing 1–15 of 15 results for author: van Breugel, B