Skip to main content

Showing 1–50 of 288 results for author: van der Schaar, M

.
  1. arXiv:2406.17673  [pdf, other

    cs.LG

    LaTable: Towards Large Tabular Models

    Authors: Boris van Breugel, Jonathan Crabbé, Rob Davis, Mihaela van der Schaar

    Abstract: Tabular data is one of the most ubiquitous modalities, yet the literature on tabular generative foundation models is lagging far behind its text and vision counterparts. Creating such a model is hard, due to the heterogeneous feature spaces of different tabular datasets, tabular metadata (e.g. dataset description and feature headers), and tables lacking prior knowledge (e.g. feature order). In thi… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  2. arXiv:2406.13733  [pdf, other

    cs.LG cs.AI

    You can't handle the (dirty) truth: Data-centric insights improve pseudo-labeling

    Authors: Nabeel Seedat, Nicolas Huynh, Fergus Imrie, Mihaela van der Schaar

    Abstract: Pseudo-labeling is a popular semi-supervised learning technique to leverage unlabeled data when labeled samples are scarce. The generation and selection of pseudo-labels heavily rely on labeled data. Existing approaches implicitly assume that the labeled data is gold standard and 'perfect'. However, this can be violated in reality with issues such as mislabeling or ambiguity. We address this overl… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: Published in the Journal of Data-centric Machine Learning Research (DMLR) *Seedat & Huynh contributed equally

  3. arXiv:2406.08414  [pdf, other

    cs.LG

    Discovering Preference Optimization Algorithms with and for Large Language Models

    Authors: Chris Lu, Samuel Holt, Claudio Fanconi, Alex J. Chan, Jakob Foerster, Mihaela van der Schaar, Robert Tjarko Lange

    Abstract: Offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs. Typically, preference optimization is approached as an offline supervised learning task using manually-crafted convex loss functions. While these methods are based on theoretical insights, they are inherently constrained by human creativity, so the large search space of… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  4. arXiv:2406.03258  [pdf, other

    stat.ML cs.LG

    Relaxed Quantile Regression: Prediction Intervals for Asymmetric Noise

    Authors: Thomas Pouplin, Alan Jeffares, Nabeel Seedat, Mihaela van der Schaar

    Abstract: Constructing valid prediction intervals rather than point estimates is a well-established approach for uncertainty quantification in the regression setting. Models equipped with this capacity output an interval of values in which the ground truth target will fall with some prespecified probability. This is an essential requirement in many real-world applications where simple point predictions' ina… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Accepted at International Conference on Machine Learning (ICML) 2024

  5. arXiv:2406.02464  [pdf, other

    cs.LG cs.AI stat.ML

    Meta-Learners for Partially-Identified Treatment Effects Across Multiple Environments

    Authors: Jonas Schweisthal, Dennis Frauen, Mihaela van der Schaar, Stefan Feuerriegel

    Abstract: Estimating the conditional average treatment effect (CATE) from observational data is relevant for many applications such as personalized medicine. Here, we focus on the widespread setting where the observational data come from multiple environments, such as different hospitals, physicians, or countries. Furthermore, we allow for violations of standard causal assumptions, namely, overlap within th… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted at ICML 2024

  6. arXiv:2405.15624  [pdf, other

    cs.LG cs.AI

    Inverse-RLignment: Inverse Reinforcement Learning from Demonstrations for LLM Alignment

    Authors: Hao Sun, Mihaela van der Schaar

    Abstract: Aligning Large Language Models (LLMs) is crucial for enhancing their safety and utility. However, existing methods, primarily based on preference datasets, face challenges such as noisy labels, high annotation costs, and privacy concerns. In this work, we introduce Alignment from Demonstrations (AfD), a novel approach leveraging high-quality demonstration data to overcome these challenges. We form… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

  7. arXiv:2405.14021  [pdf, other

    cs.LG

    A Study of Posterior Stability for Time-Series Latent Diffusion

    Authors: Yangming Li, Mihaela van der Schaar

    Abstract: Latent diffusion has shown promising results in image generation and permits efficient sampling. However, this framework might suffer from the problem of posterior collapse when applied to time series. In this paper, we conduct an impact analysis of this problem. With a theoretical insight, we first explain that posterior collapse reduces latent diffusion to a VAE, making it less expressive. Then,… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: Paper under review

  8. arXiv:2405.01147  [pdf, other

    cs.LG

    Why Tabular Foundation Models Should Be a Research Priority

    Authors: Boris van Breugel, Mihaela van der Schaar

    Abstract: Recent text and image foundation models are incredibly impressive, and these models are attracting an ever-increasing portion of research resources. In this position piece we aim to shift the ML research community's priorities ever so slightly to a different modality: tabular data. Tabular data is the dominant modality in many fields, yet it is given hardly any research attention and significantly… ▽ More

    Submitted 2 June, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

    Comments: Accepted at International Conference on Machine Learning (ICML 2024)

  9. arXiv:2404.09788  [pdf, other

    cs.LG stat.ML

    Shape Arithmetic Expressions: Advancing Scientific Discovery Beyond Closed-Form Equations

    Authors: Krzysztof Kacprzyk, Mihaela van der Schaar

    Abstract: Symbolic regression has excelled in uncovering equations from physics, chemistry, biology, and related disciplines. However, its effectiveness becomes less certain when applied to experimental data lacking inherent closed-form expressions. Empirically derived relationships, such as entire stress-strain curves, may defy concise closed-form representation, compelling us to explore more adaptive mode… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    Comments: To appear in the Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS) 2024, Valencia, Spain. PMLR: Volume 238

  10. arXiv:2403.10766  [pdf, other

    cs.LG stat.ME

    ODE Discovery for Longitudinal Heterogeneous Treatment Effects Inference

    Authors: Krzysztof Kacprzyk, Samuel Holt, Jeroen Berrevoets, Zhaozhi Qian, Mihaela van der Schaar

    Abstract: Inferring unbiased treatment effects has received widespread attention in the machine learning community. In recent years, our community has proposed numerous solutions in standard settings, high-dimensional treatment settings, and even longitudinal settings. While very diverse, the solution has mostly relied on neural networks for inference and simultaneous correction of assignment bias. New appr… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: Published in The Twelfth International Conference on Learning Representations (ICLR). Copyright 2024 by the author(s)

  11. arXiv:2403.04551  [pdf, other

    cs.LG

    Dissecting Sample Hardness: A Fine-Grained Analysis of Hardness Characterization Methods for Data-Centric AI

    Authors: Nabeel Seedat, Fergus Imrie, Mihaela van der Schaar

    Abstract: Characterizing samples that are difficult to learn from is crucial to develo** highly performant ML models. This has led to numerous Hardness Characterization Methods (HCMs) that aim to identify "hard" samples. However, there is a lack of consensus regarding the definition and evaluation of "hardness". Unfortunately, current HCMs have only been evaluated on specific types of hardness and often o… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

    Comments: Published at International Conference on Learning Representations (ICLR) 2024

  12. arXiv:2403.00694  [pdf, other

    stat.ML cs.AI cs.LG stat.ME

    Defining Expertise: Applications to Treatment Effect Estimation

    Authors: Alihan Hüyük, Qiyao Wei, Alicia Curth, Mihaela van der Schaar

    Abstract: Decision-makers are often experts of their domain and take actions based on their domain knowledge. Doctors, for instance, may prescribe treatments by predicting the likely outcome of each available treatment. Actions of an expert thus naturally encode part of their domain knowledge, and can help make inferences within the same domain: Knowing doctors try to prescribe the best treatment for their… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.

    Comments: The 12th International Conference on Learning Representations (ICLR 2024)

  13. arXiv:2402.17599  [pdf, other

    cs.LG cs.AI stat.ML

    DAGnosis: Localized Identification of Data Inconsistencies using Structures

    Authors: Nicolas Huynh, Jeroen Berrevoets, Nabeel Seedat, Jonathan Crabbé, Zhaozhi Qian, Mihaela van der Schaar

    Abstract: Identification and appropriate handling of inconsistencies in data at deployment time is crucial to reliably use machine learning models. While recent data-centric methods are able to identify such inconsistencies with respect to the training set, they suffer from two key limitations: (1) suboptimality in settings where features exhibit statistical independencies, due to their usage of compressive… ▽ More

    Submitted 28 February, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

    Comments: AISTATS 2024; added correspondance email

  14. arXiv:2402.16105  [pdf, other

    cs.LG

    Informed Meta-Learning

    Authors: Katarzyna Kobalczyk, Mihaela van der Schaar

    Abstract: In noisy and low-data regimes prevalent in real-world applications, a key challenge of machine learning lies in effectively incorporating inductive biases that promote data efficiency and robustness. Meta-learning and informed ML stand out as two approaches for incorporating prior knowledge into ML pipelines. While the former relies on a purely data-driven source of priors, the latter is guided by… ▽ More

    Submitted 24 May, 2024; v1 submitted 25 February, 2024; originally announced February 2024.

  15. arXiv:2402.07812  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    Retrieval-Augmented Thought Process as Sequential Decision Making

    Authors: Thomas Pouplin, Hao Sun, Samuel Holt, Mihaela van der Schaar

    Abstract: Large Language Models (LLMs) have demonstrated their strong ability to assist people and show "sparks of intelligence". However, several open challenges hinder their wider application: such as concerns over privacy, tendencies to produce hallucinations, and difficulties in handling long contexts. In this work, we address those challenges by introducing the Retrieval-Augmented Thought Process (RATP… ▽ More

    Submitted 12 February, 2024; originally announced February 2024.

    Comments: 17 pages, 18 figures

    ACM Class: H.3.3; I.2.6; I.2.7; I.2.8

  16. arXiv:2402.05933  [pdf, other

    cs.LG cs.AI

    Time Series Diffusion in the Frequency Domain

    Authors: Jonathan Crabbé, Nicolas Huynh, Jan Stanczuk, Mihaela van der Schaar

    Abstract: Fourier analysis has been an instrumental tool in the development of signal processing. This leads us to wonder whether this framework could similarly benefit generative modelling. In this paper, we explore this question through the scope of time series diffusion models. More specifically, we analyze whether representing time series in the frequency domain is a useful inductive bias for score-base… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

    Comments: 27 pages, 12 figures

  17. arXiv:2402.03921  [pdf, other

    cs.LG cs.AI

    Large Language Models to Enhance Bayesian Optimization

    Authors: Tennison Liu, Nicolás Astorga, Nabeel Seedat, Mihaela van der Schaar

    Abstract: Bayesian optimization (BO) is a powerful approach for optimizing complex and expensive-to-evaluate black-box functions. Its importance is underscored in many applications, notably including hyperparameter tuning, but its efficacy depends on efficiently balancing exploration and exploitation. While there has been substantial progress in BO methods, striking this balance remains a delicate process.… ▽ More

    Submitted 8 March, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

    Comments: Accepted as Poster at ICLR2024

  18. arXiv:2402.02081  [pdf, other

    cs.LG

    Risk-Sensitive Diffusion for Perturbation-Robust Optimization

    Authors: Yangming Li, Max Ruiz Luyten, Mihaela van der Schaar

    Abstract: The essence of score-based generative models (SGM) is to optimize a score-based model towards the score function. However, we show that noisy samples incur another objective function, rather than the one with score function, which will wrongly optimize the model. To address this problem, we first consider a new setting where every noisy sample is paired with a risk vector, indicating the data qual… ▽ More

    Submitted 5 April, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

    Comments: Under review paper

  19. arXiv:2402.01502  [pdf, other

    stat.ML cs.LG

    Why do Random Forests Work? Understanding Tree Ensembles as Self-Regularizing Adaptive Smoothers

    Authors: Alicia Curth, Alan Jeffares, Mihaela van der Schaar

    Abstract: Despite their remarkable effectiveness and broad application, the drivers of success underlying ensembles of trees are still not fully understood. In this paper, we highlight how interpreting tree ensembles as adaptive and self-regularizing smoothers can provide new intuition and deeper insight to this topic. We use this perspective to show that, when studied as smoothers, randomized tree ensemble… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

  20. arXiv:2402.00782  [pdf, other

    cs.LG

    Dense Reward for Free in Reinforcement Learning from Human Feedback

    Authors: Alex J. Chan, Hao Sun, Samuel Holt, Mihaela van der Schaar

    Abstract: Reinforcement Learning from Human Feedback (RLHF) has been credited as the key advance that has allowed Large Language Models (LLMs) to effectively follow instructions and produce useful assistance. Classically, this involves generating completions from the LLM in response to a query before using a separate reward model to assign a score to the full completion. As an auto-regressive process, the L… ▽ More

    Submitted 1 February, 2024; originally announced February 2024.

  21. arXiv:2401.17205  [pdf, other

    stat.ML cs.LG

    Adaptive Experiment Design with Synthetic Controls

    Authors: Alihan Hüyük, Zhaozhi Qian, Mihaela van der Schaar

    Abstract: Clinical trials are typically run in order to understand the effects of a new treatment on a given population of patients. However, patients in large populations rarely respond the same way to the same treatment. This heterogeneity in patient responses necessitates trials that investigate effects on multiple subpopulations - especially when a treatment has marginal or no benefit for the overall po… ▽ More

    Submitted 9 February, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

    Comments: Proceedings of the 27th International Conference on Artificial Intelligence and Statistics

  22. arXiv:2401.00282  [pdf, other

    cs.LG

    Deep Generative Symbolic Regression

    Authors: Samuel Holt, Zhaozhi Qian, Mihaela van der Schaar

    Abstract: Symbolic regression (SR) aims to discover concise closed-form mathematical equations from data, a task fundamental to scientific discovery. However, the problem is highly challenging because closed-form equations lie in a complex combinatorial search space. Existing methods, ranging from heuristic search to reinforcement learning, fail to scale with the number of input variables. We make the obser… ▽ More

    Submitted 30 December, 2023; originally announced January 2024.

    Comments: In the proceedings of the Eleventh International Conference on Learning Representations (ICLR 2023). https://iclr.cc/virtual/2023/poster/11782

    ACM Class: I.2.6; I.2.5

    Journal ref: International Conference on Learning Representations (ICLR), 2023

  23. arXiv:2312.12112  [pdf, other

    cs.LG cs.AI

    Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes

    Authors: Nabeel Seedat, Nicolas Huynh, Boris van Breugel, Mihaela van der Schaar

    Abstract: Machine Learning (ML) in low-data settings remains an underappreciated yet crucial problem. Hence, data augmentation methods to increase the sample size of datasets needed for ML are key to unlocking the transformative potential of ML in data-deprived regions and domains. Unfortunately, the limited training set constrains traditional tabular synthetic data generators in their ability to generate a… ▽ More

    Submitted 30 June, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

    Comments: Presented at the 41st International Conference on Machine Learning (ICML) 2024. *Seedat & Huynh contributed equally

  24. arXiv:2312.03666  [pdf, other

    cs.SD cs.LG eess.AS

    Towards small and accurate convolutional neural networks for acoustic biodiversity monitoring

    Authors: Serge Zaugg, Mike van der Schaar, Florence Erbs, Antonio Sanchez, Joan V. Castell, Emiliano Ramallo, Michel André

    Abstract: Automated classification of animal sounds is a prerequisite for large-scale monitoring of biodiversity. Convolutional Neural Networks (CNNs) are among the most promising algorithms but they are slow, often achieve poor classification in the field and typically require large training data sets. Our objective was to design CNNs that are fast at inference time and achieve good classification performa… ▽ More

    Submitted 6 December, 2023; originally announced December 2023.

  25. arXiv:2311.16195  [pdf, other

    cs.LG cs.AI

    A Foundational Framework and Methodology for Personalized Early and Timely Diagnosis

    Authors: Tim Schubert, Richard W Peck, Alexander Gimson, Camelia Davtyan, Mihaela van der Schaar

    Abstract: Early diagnosis of diseases holds the potential for deep transformation in healthcare by enabling better treatment options, improving long-term survival and quality of life, and reducing overall cost. With the advent of medical big data, advances in diagnostic tests as well as in machine learning and statistics, early or timely diagnosis seems within reach. Early diagnosis research often neglects… ▽ More

    Submitted 26 November, 2023; originally announced November 2023.

    Comments: 10 pages, 2 figures

  26. arXiv:2311.16026  [pdf, other

    cs.LG stat.ML

    A Neural Framework for Generalized Causal Sensitivity Analysis

    Authors: Dennis Frauen, Fergus Imrie, Alicia Curth, Valentyn Melnychuk, Stefan Feuerriegel, Mihaela van der Schaar

    Abstract: Unobserved confounding is common in many applications, making causal inference from observational data challenging. As a remedy, causal sensitivity analysis is an important tool to draw causal conclusions under unobserved confounding with mathematical guarantees. In this paper, we propose NeuralCSA, a neural framework for generalized causal sensitivity analysis. Unlike previous work, our framework… ▽ More

    Submitted 9 April, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: Accepted at ICLR 2024

  27. arXiv:2311.14110  [pdf, other

    cs.LG cs.AI

    When is Off-Policy Evaluation Useful? A Data-Centric Perspective

    Authors: Hao Sun, Alex J. Chan, Nabeel Seedat, Alihan Hüyük, Mihaela van der Schaar

    Abstract: Evaluating the value of a hypothetical target policy with only a logged dataset is important but challenging. On the one hand, it brings opportunities for safe policy improvement under high-stakes scenarios like clinical guidelines. On the other hand, such opportunities raise a need for precise off-policy evaluation (OPE). While previous work on OPE focused on improving the algorithm in value esti… ▽ More

    Submitted 23 November, 2023; originally announced November 2023.

    Comments: Off-Policy Evaluation, Data-Centric AI, Data-Centric Reinforcement Learning, Reinforcement Learning

  28. arXiv:2311.13028  [pdf, other

    cs.LG cs.AI cs.DC eess.SP

    DMLR: Data-centric Machine Learning Research -- Past, Present and Future

    Authors: Luis Oala, Manil Maskey, Lilith Bat-Leah, Alicia Parrish, Nezihe Merve Gürel, Tzu-Sheng Kuo, Yang Liu, Rotem Dror, Danilo Brajovic, Xiaozhe Yao, Max Bartolo, William A Gaviria Rojas, Ryan Hileman, Rainier Aliment, Michael W. Mahoney, Meg Risdal, Matthew Lease, Wojciech Samek, Debojyoti Dutta, Curtis G Northcutt, Cody Coleman, Braden Hancock, Bernard Koch, Girmaw Abebe Tadesse, Bojan Karlaš , et al. (13 additional authors not shown)

    Abstract: Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods tow… ▽ More

    Submitted 1 June, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

    Comments: Published in the Journal of Data-centric Machine Learning Research (DMLR) at https://data.mlr.press/assets/pdf/v01-5.pdf

  29. arXiv:2311.10051  [pdf, other

    cs.LG

    Tabular Few-Shot Generalization Across Heterogeneous Feature Spaces

    Authors: Max Zhu, Katarzyna Kobalczyk, Andrija Petrovic, Mladen Nikolic, Mihaela van der Schaar, Boris Delibasic, Petro Lio

    Abstract: Despite the prevalence of tabular datasets, few-shot learning remains under-explored within this domain. Existing few-shot methods are not directly applicable to tabular datasets due to varying column relationships, meanings, and permutational invariance. To address these challenges, we propose FLAT-a novel approach to tabular few-shot learning, encompassing knowledge sharing between datasets with… ▽ More

    Submitted 16 November, 2023; originally announced November 2023.

    Comments: Tabular learning, Deep learning, Few shot learning

  30. arXiv:2311.07426  [pdf, other

    cs.LG cs.CV cs.HC

    Optimising Human-AI Collaboration by Learning Convincing Explanations

    Authors: Alex J. Chan, Alihan Huyuk, Mihaela van der Schaar

    Abstract: Machine learning models are being increasingly deployed to take, or assist in taking, complicated and high-impact decisions, from quasi-autonomous vehicles to clinical decision support systems. This poses challenges, particularly when models have hard-to-detect failure modes and are able to take actions without oversight. In order to handle this challenge, we propose a method for a collaborative s… ▽ More

    Submitted 13 November, 2023; originally announced November 2023.

  31. arXiv:2311.01489  [pdf, other

    stat.ML cs.LG

    Invariant Causal Imitation Learning for Generalizable Policies

    Authors: Ioana Bica, Daniel Jarrett, Mihaela van der Schaar

    Abstract: Consider learning an imitation policy on the basis of demonstrated behavior from multiple environments, with an eye towards deployment in an unseen environment. Since the observable features from each setting may be different, directly learning individual policies as map**s from features to actions is prone to spurious correlations -- and may not generalize well. However, the expert's policy is… ▽ More

    Submitted 2 November, 2023; originally announced November 2023.

    Journal ref: In Proc. 35th International Conference on Neural Information Processing Systems (NeurIPS 2021)

  32. arXiv:2311.01388  [pdf, other

    stat.ML cs.LG

    Time-series Generation by Contrastive Imitation

    Authors: Daniel Jarrett, Ioana Bica, Mihaela van der Schaar

    Abstract: Consider learning a generative model for time-series data. The sequential setting poses a unique challenge: Not only should the generator capture the conditional dynamics of (stepwise) transitions, but its open-loop rollouts should also preserve the joint distribution of (multi-step) trajectories. On one hand, autoregressive models trained by MLE allow learning and computing explicit transition di… ▽ More

    Submitted 2 November, 2023; originally announced November 2023.

    Journal ref: In Proc. 35th International Conference on Neural Information Processing Systems (NeurIPS 2021)

  33. arXiv:2310.19831  [pdf, other

    stat.ML cs.LG

    Explaining by Imitating: Understanding Decisions by Interpretable Policy Learning

    Authors: Alihan Hüyük, Daniel Jarrett, Mihaela van der Schaar

    Abstract: Understanding human behavior from observed data is critical for transparency and accountability in decision-making. Consider real-world settings such as healthcare, in which modeling a decision-maker's policy is challenging -- with no access to underlying states, no knowledge of environment dynamics, and no allowance for live experimentation. We desire learning a data-driven representation of deci… ▽ More

    Submitted 28 October, 2023; originally announced October 2023.

    Journal ref: In Proc. 9th International Conference on Learning Representations (ICLR 2021)

  34. arXiv:2310.18988  [pdf, other

    stat.ML cs.LG

    A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning

    Authors: Alicia Curth, Alan Jeffares, Mihaela van der Schaar

    Abstract: Conventional statistical wisdom established a well-understood relationship between model complexity and prediction error, typically presented as a U-shaped curve reflecting a transition between under- and overfitting regimes. However, motivated by the success of overparametrized neural networks, recent influential work has suggested this theory to be generally incomplete, introducing an additional… ▽ More

    Submitted 29 October, 2023; originally announced October 2023.

    Comments: To appear in the Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

  35. arXiv:2310.18970  [pdf, other

    cs.LG

    TRIAGE: Characterizing and auditing training data for improved regression

    Authors: Nabeel Seedat, Jonathan Crabbé, Zhaozhi Qian, Mihaela van der Schaar

    Abstract: Data quality is crucial for robust machine learning algorithms, with the recent interest in data-centric AI emphasizing the importance of training data characterization. However, current data characterization methods are largely focused on classification settings, with regression settings largely understudied. To address this, we introduce TRIAGE, a novel data characterization framework tailored t… ▽ More

    Submitted 29 October, 2023; originally announced October 2023.

    Comments: Presented at NeurIPS 2023

  36. arXiv:2310.18688  [pdf, other

    cs.LG

    Clairvoyance: A Pipeline Toolkit for Medical Time Series

    Authors: Daniel Jarrett, **sung Yoon, Ioana Bica, Zhaozhi Qian, Ari Ercole, Mihaela van der Schaar

    Abstract: Time-series learning is the bread and butter of data-driven *clinical decision support*, and the recent explosion in ML research has demonstrated great potential in various healthcare settings. At the same time, medical time-series problems in the wild are challenging due to their highly *composite* nature: They entail design choices and interactions among components that preprocess data, impute m… ▽ More

    Submitted 28 October, 2023; originally announced October 2023.

    Journal ref: In Proc. 9th International Conference on Learning Representations (ICLR 2021)

  37. arXiv:2310.18601  [pdf, other

    stat.ML cs.LG

    Online Decision Mediation

    Authors: Daniel Jarrett, Alihan Hüyük, Mihaela van der Schaar

    Abstract: Consider learning a decision support assistant to serve as an intermediary between (oracle) expert behavior and (imperfect) human behavior: At each time, the algorithm observes an action chosen by a fallible agent, and decides whether to *accept* that agent's decision, *intervene* with an alternative, or *request* the expert's opinion. For instance, in clinical diagnosis, fully-autonomous machine… ▽ More

    Submitted 28 October, 2023; originally announced October 2023.

    Journal ref: In Proc. 36th International Conference on Neural Information Processing Systems (NeurIPS 2022)

  38. arXiv:2310.18591  [pdf, other

    stat.ML cs.LG

    Inverse Decision Modeling: Learning Interpretable Representations of Behavior

    Authors: Daniel Jarrett, Alihan Hüyük, Mihaela van der Schaar

    Abstract: Decision analysis deals with modeling and enhancing decision processes. A principal challenge in improving behavior is in obtaining a transparent description of existing behavior in the first place. In this paper, we develop an expressive, unifying perspective on inverse decision modeling: a framework for learning parameterized representations of sequential decision behavior. First, we formalize t… ▽ More

    Submitted 28 October, 2023; originally announced October 2023.

    Journal ref: In Proc. 38th International Conference on Machine Learning (ICML 2021)

  39. arXiv:2310.16981  [pdf, other

    cs.LG

    Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark

    Authors: Lasse Hansen, Nabeel Seedat, Mihaela van der Schaar, Andrija Petrovic

    Abstract: Synthetic data serves as an alternative in training machine learning models, particularly when real-world data is limited or inaccessible. However, ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper addresses this issue by exploring the potential of integrating data-centric AI techniques which profile the data to guide the synthetic data g… ▽ More

    Submitted 25 October, 2023; originally announced October 2023.

    Comments: Presented at NeurIPS 2023 (Datasets & Benchmarks). *Hansen & Seedat contributed equally

  40. arXiv:2310.16524  [pdf, other

    cs.LG

    Can You Rely on Your Model Evaluation? Improving Model Evaluation with Synthetic Test Data

    Authors: Boris van Breugel, Nabeel Seedat, Fergus Imrie, Mihaela van der Schaar

    Abstract: Evaluating the performance of machine learning models on diverse and underrepresented subgroups is essential for ensuring fairness and reliability in real-world applications. However, accurately assessing model performance becomes challenging due to two main issues: (1) a scarcity of test data, especially for small subgroups, and (2) possible distributional shifts in the model's deployment setting… ▽ More

    Submitted 25 October, 2023; originally announced October 2023.

    Comments: Advances in Neural Information Processing Systems 36 (NeurIPS 2023). Van Breugel & Seedat contributed equally

  41. arXiv:2310.07747  [pdf, other

    cs.LG cs.AI cs.RO eess.SY

    Accountability in Offline Reinforcement Learning: Explaining Decisions with a Corpus of Examples

    Authors: Hao Sun, Alihan Hüyük, Daniel Jarrett, Mihaela van der Schaar

    Abstract: Learning controllers with offline data in decision-making systems is an essential area of research due to its potential to reduce the risk of applications in real-world systems. However, in responsibility-sensitive settings such as healthcare, decision accountability is of paramount importance, yet has not been adequately addressed by the literature. This paper introduces the Accountable Offline C… ▽ More

    Submitted 27 October, 2023; v1 submitted 11 October, 2023; originally announced October 2023.

  42. arXiv:2310.03560  [pdf, other

    cs.CL

    Redefining Digital Health Interfaces with Large Language Models

    Authors: Fergus Imrie, Paulius Rauba, Mihaela van der Schaar

    Abstract: Digital health tools have the potential to significantly improve the delivery of healthcare services. However, their adoption remains comparatively limited due, in part, to challenges surrounding usability and trust. Large Language Models (LLMs) have emerged as general-purpose models with the ability to process complex information and produce human-quality text, presenting a wealth of potential ap… ▽ More

    Submitted 29 February, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

  43. arXiv:2310.02003  [pdf, other

    cs.SE cs.AI cs.LG cs.PL

    L2MAC: Large Language Model Automatic Computer for Extensive Code Generation

    Authors: Samuel Holt, Max Ruiz Luyten, Mihaela van der Schaar

    Abstract: Transformer-based large language models (LLMs) are constrained by the fixed context window of the underlying transformer architecture, hindering their ability to produce long and coherent outputs. Memory-augmented LLMs are a promising solution, but current approaches cannot handle long output generation tasks since they (1) only focus on reading memory and reduce its evolution to the concatenation… ▽ More

    Submitted 10 April, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

    Comments: Published in The Twelfth International Conference on Learning Representations (ICLR), 2024. Copyright 2023 by the author(s)

    ACM Class: I.2.7; I.2.6; I.2.5; D.2.2; D.2.3; D.3.4

  44. arXiv:2309.14068  [pdf, other

    cs.LG cs.CV

    Soft Mixture Denoising: Beyond the Expressive Bottleneck of Diffusion Models

    Authors: Yangming Li, Boris van Breugel, Mihaela van der Schaar

    Abstract: Because diffusion models have shown impressive performances in a number of tasks, such as image synthesis, there is a trend in recent works to prove (with certain assumptions) that these models have strong approximation capabilities. In this paper, we show that current diffusion models actually have an expressive bottleneck in backward denoising and some assumption made by existing theoretical gua… ▽ More

    Submitted 18 January, 2024; v1 submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted by ICLR-2024

  45. arXiv:2309.06553  [pdf, other

    cs.CL cs.AI cs.LG

    Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL

    Authors: Hao Sun, Alihan Hüyük, Mihaela van der Schaar

    Abstract: In this study, we aim to enhance the arithmetic reasoning ability of Large Language Models (LLMs) through zero-shot prompt optimization. We identify a previously overlooked objective of query dependency in such optimization and elucidate two ensuing challenges that impede the successful and economical design of prompt optimization techniques. One primary issue is the absence of an effective method… ▽ More

    Submitted 7 March, 2024; v1 submitted 12 September, 2023; originally announced September 2023.

  46. arXiv:2308.05021  [pdf, other

    cs.LG cs.CV

    On Error Propagation of Diffusion Models

    Authors: Yangming Li, Mihaela van der Schaar

    Abstract: Although diffusion models (DMs) have shown promising performances in a number of tasks (e.g., speech synthesis and image generation), they might suffer from error propagation because of their sequential structure. However, this is not certain because some sequential models, such as Conditional Random Field (CRF), are free from this problem. To address this issue, we develop a theoretical framework… ▽ More

    Submitted 18 January, 2024; v1 submitted 9 August, 2023; originally announced August 2023.

    Comments: Accepted by ICLR-2024

  47. arXiv:2306.05052  [pdf, other

    cs.LG cs.AI cs.CL

    Interpretable Medical Diagnostics with Structured Data Extraction by Large Language Models

    Authors: Aleksa Bisercic, Mladen Nikolic, Mihaela van der Schaar, Boris Delibasic, Pietro Lio, Andrija Petrovic

    Abstract: Tabular data is often hidden in text, particularly in medical diagnostic reports. Traditional machine learning (ML) models designed to work with tabular data, cannot effectively process information in such form. On the other hand, large language models (LLMs) which excel at textual tasks, are probably not the best tool for modeling tabular data. Therefore, we propose a novel, simple, and effective… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

  48. arXiv:2306.04663  [pdf, ps, other

    eess.SP cs.LG

    U-PASS: an Uncertainty-guided deep learning Pipeline for Automated Sleep Staging

    Authors: Elisabeth R. M. Heremans, Nabeel Seedat, Bertien Buyse, Dries Testelmans, Mihaela van der Schaar, Maarten De Vos

    Abstract: As machine learning becomes increasingly prevalent in critical fields such as healthcare, ensuring the safety and reliability of machine learning systems becomes paramount. A key component of reliability is the ability to estimate uncertainty, which enables the identification of areas of high and low confidence and helps to minimize the risk of error. In this study, we propose a machine learning p… ▽ More

    Submitted 7 June, 2023; originally announced June 2023.

  49. arXiv:2306.04255  [pdf, other

    stat.ML cs.LG

    Accounting For Informative Sampling When Learning to Forecast Treatment Outcomes Over Time

    Authors: Toon Vanderschueren, Alicia Curth, Wouter Verbeke, Mihaela van der Schaar

    Abstract: Machine learning (ML) holds great potential for accurately forecasting treatment outcomes over time, which could ultimately enable the adoption of more individualized treatment strategies in many practical applications. However, a significant challenge that has been largely overlooked by the ML literature on this topic is the presence of informative sampling in observational data. When instances a… ▽ More

    Submitted 7 June, 2023; originally announced June 2023.

    Comments: To appear in the Proceedings of the 40th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023

  50. arXiv:2305.19726  [pdf, other

    cs.LG

    Learning Representations without Compositional Assumptions

    Authors: Tennison Liu, Jeroen Berrevoets, Zhaozhi Qian, Mihaela van der Schaar

    Abstract: This paper addresses unsupervised representation learning on tabular data containing multiple views generated by distinct sources of measurement. Traditional methods, which tackle this problem using the multi-view framework, are constrained by predefined assumptions that assume feature sets share the same information and representations should learn globally shared factors. However, this assumptio… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.