Skip to main content

Showing 1–6 of 6 results for author: Wilde, H

Searching in archive stat. Search in all archives.
.
  1. arXiv:2203.01363  [pdf, other

    cs.LG stat.AP

    Faking feature importance: A cautionary tale on the use of differentially-private synthetic data

    Authors: Oscar Giles, Kasra Hosseini, Grigorios Mingas, Oliver Strickson, Louise Bowler, Camila Rangel Smith, Harrison Wilde, Jen Ning Lim, Bilal Mateen, Kasun Amarasinghe, Rayid Ghani, Alison Heppenstall, Nik Lomax, Nick Malleson, Martin O'Reilly, Sebastian Vollmerteke

    Abstract: Synthetic datasets are often presented as a silver-bullet solution to the problem of privacy-preserving data publishing. However, for many applications, synthetic data has been shown to have limited utility when used to train predictive models. One promising potential application of these data is in the exploratory phase of the machine learning workflow, which involves understanding, engineering a… ▽ More

    Submitted 2 March, 2022; originally announced March 2022.

    Comments: 27 pages, 8 figures

  2. arXiv:2108.10934  [pdf, other

    stat.ML cs.CR cs.LG

    Mitigating Statistical Bias within Differentially Private Synthetic Data

    Authors: Sahra Ghalebikesabi, Harrison Wilde, Jack Jewson, Arnaud Doucet, Sebastian Vollmer, Chris Holmes

    Abstract: Increasing interest in privacy-preserving machine learning has led to new and evolved approaches for generating private synthetic data from undisclosed real data. However, mechanisms of privacy preservation can significantly reduce the utility of synthetic data, which in turn impacts downstream tasks such as learning predictive models or inference. We propose several re-weighting strategies using… ▽ More

    Submitted 19 May, 2022; v1 submitted 24 August, 2021; originally announced August 2021.

  3. arXiv:2011.08299  [pdf, other

    cs.LG stat.AP stat.ME stat.ML

    Foundations of Bayesian Learning from Synthetic Data

    Authors: Harrison Wilde, Jack Jewson, Sebastian Vollmer, Chris Holmes

    Abstract: There is significant growth and interest in the use of synthetic data as an enabler for machine learning in environments where the release of real data is restricted due to privacy or availability constraints. Despite a large number of methods for synthetic data generation, there are comparatively few results on the statistical properties of models learnt on synthetic data, and fewer still for sit… ▽ More

    Submitted 24 November, 2020; v1 submitted 16 November, 2020; originally announced November 2020.

    Comments: 43 pages (10 main text, 33 supplement), 32 figures (4 main text, 28 supplement)

  4. arXiv:2008.04295  [pdf, other

    stat.AP stat.ML

    Segmentation analysis and the recovery of queuing parameters via the Wasserstein distance: a study of administrative data for patients with chronic obstructive pulmonary disease

    Authors: Henry Wilde, Vincent Knight, Jonathan Gillard, Kendal Smith

    Abstract: This work uses a data-driven approach to analyse how the resource requirements of patients with chronic obstructive pulmonary disease (COPD) may change, quantifying how those changes impact the hospital system with which the patients interact. This approach is composed of a novel combination of often distinct modes of analysis: segmentation, operational queuing theory, and the recovery of paramete… ▽ More

    Submitted 14 August, 2020; v1 submitted 10 August, 2020; originally announced August 2020.

    Comments: 24 pages, 11 figures (19 including subfigures)

  5. arXiv:2007.15326  [pdf, other

    stat.AP stat.ML

    A Recommendation and Risk Classification System for Connecting Rough Sleepers to Essential Outreach Services

    Authors: Harrison Wilde, Lucia Lushi Chen, Austin Nguyen, Zoe Kimpel, Joshua Sidgwick, Adolfo De Unanue, Davide Veronese, Bilal Mateen, Rayid Ghani, Sebastian Vollmer

    Abstract: Rough slee** is a chronic problem faced by some of the most disadvantaged people in modern society. This paper describes work carried out in partnership with Homeless Link, a UK-based charity, in develo** a data-driven approach to assess the quality of incoming alerts from members of the public aimed at connecting people slee** rough on the streets with outreach service providers. Alerts are… ▽ More

    Submitted 30 July, 2020; originally announced July 2020.

    Comments: 10 pages, 5 figures, 5 tables

  6. arXiv:2002.02701  [pdf, other

    cs.LG cs.GT stat.ML

    A novel initialisation based on hospital-resident assignment for the k-modes algorithm

    Authors: Henry Wilde, Vincent Knight, Jonathan Gillard

    Abstract: This paper presents a new way of selecting an initial solution for the k-modes algorithm that allows for a notion of mathematical fairness and a leverage of the data that the common initialisations from literature do not. The method, which utilises the Hospital-Resident Assignment Problem to find the set of initial cluster centroids, is compared with the current initialisations on both benchmark d… ▽ More

    Submitted 7 February, 2020; originally announced February 2020.

    Comments: 23 pages, 11 figures (31 panels)