-
Joint Selection: Adaptively Incorporating Public Information for Private Synthetic Data
Authors:
Miguel Fuentes,
Brett Mullins,
Ryan McKenna,
Gerome Miklau,
Daniel Sheldon
Abstract:
Mechanisms for generating differentially private synthetic data based on marginals and graphical models have been successful in a wide range of settings. However, one limitation of these methods is their inability to incorporate public data. Initializing a data generating model by pre-training on public data has shown to improve the quality of synthetic data, but this technique is not applicable w…
▽ More
Mechanisms for generating differentially private synthetic data based on marginals and graphical models have been successful in a wide range of settings. However, one limitation of these methods is their inability to incorporate public data. Initializing a data generating model by pre-training on public data has shown to improve the quality of synthetic data, but this technique is not applicable when model structure is not determined a priori. We develop the mechanism jam-pgm, which expands the adaptive measurements framework to jointly select between measuring public data and private data. This technique allows for public data to be included in a graphical-model-based mechanism. We show that jam-pgm is able to outperform both publicly assisted and non publicly assisted synthetic data generation mechanisms even when the public data distribution is biased.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
The Shape of Explanations: A Topological Account of Rule-Based Explanations in Machine Learning
Authors:
Brett Mullins
Abstract:
Rule-based explanations provide simple reasons explaining the behavior of machine learning classifiers at given points in the feature space. Several recent methods (Anchors, LORE, etc.) purport to generate rule-based explanations for arbitrary or black-box classifiers. But what makes these methods work in general? We introduce a topological framework for rule-based explanation methods and provide…
▽ More
Rule-based explanations provide simple reasons explaining the behavior of machine learning classifiers at given points in the feature space. Several recent methods (Anchors, LORE, etc.) purport to generate rule-based explanations for arbitrary or black-box classifiers. But what makes these methods work in general? We introduce a topological framework for rule-based explanation methods and provide a characterization of explainability in terms of the definability of a classifier relative to an explanation scheme. We employ this framework to consider various explanation schemes and argue that the preferred scheme depends on how much the user knows about the domain and the probability measure over the feature space.
△ Less
Submitted 21 January, 2023;
originally announced January 2023.
-
AIM: An Adaptive and Iterative Mechanism for Differentially Private Synthetic Data
Authors:
Ryan McKenna,
Brett Mullins,
Daniel Sheldon,
Gerome Miklau
Abstract:
We propose AIM, a new algorithm for differentially private synthetic data generation. AIM is a workload-adaptive algorithm within the paradigm of algorithms that first selects a set of queries, then privately measures those queries, and finally generates synthetic data from the noisy measurements. It uses a set of innovative features to iteratively select the most useful measurements, reflecting b…
▽ More
We propose AIM, a new algorithm for differentially private synthetic data generation. AIM is a workload-adaptive algorithm within the paradigm of algorithms that first selects a set of queries, then privately measures those queries, and finally generates synthetic data from the noisy measurements. It uses a set of innovative features to iteratively select the most useful measurements, reflecting both their relevance to the workload and their value in approximating the input data. We also provide analytic expressions to bound per-query error with high probability which can be used to construct confidence intervals and inform users about the accuracy of generated data. We show empirically that AIM consistently outperforms a wide variety of existing mechanisms across a variety of experimental settings.
△ Less
Submitted 12 June, 2024; v1 submitted 29 January, 2022;
originally announced January 2022.
-
Identifying the Most Explainable Classifier
Authors:
Brett Mullins
Abstract:
We introduce the notion of pointwise coverage to measure the explainability properties of machine learning classifiers. An explanation for a prediction is a definably simple region of the feature space sharing the same label as the prediction, and the coverage of an explanation measures its size or generalizability. With this notion of explanation, we investigate whether or not there is a natural…
▽ More
We introduce the notion of pointwise coverage to measure the explainability properties of machine learning classifiers. An explanation for a prediction is a definably simple region of the feature space sharing the same label as the prediction, and the coverage of an explanation measures its size or generalizability. With this notion of explanation, we investigate whether or not there is a natural characterization of the most explainable classifier. According with our intuitions, we prove that the binary linear classifier is uniquely the most explainable classifier up to negligible sets.
△ Less
Submitted 22 October, 2019; v1 submitted 18 October, 2019;
originally announced October 2019.
-
Unsupervised Time Series Extraction from Controller Area Network Payloads
Authors:
Brent J. Stone,
Scott Graham,
Barry Mullins,
Christine Schubert Kabban
Abstract:
This paper introduces a method for unsupervised tokenization of Controller Area Network (CAN) data payloads using bit level transition analysis and a greedy grou** strategy. The primary goal of this proposal is to extract individual time series which have been concatenated together before transmission onto a vehicle's CAN bus. This process is necessary because the documentation for how to proper…
▽ More
This paper introduces a method for unsupervised tokenization of Controller Area Network (CAN) data payloads using bit level transition analysis and a greedy grou** strategy. The primary goal of this proposal is to extract individual time series which have been concatenated together before transmission onto a vehicle's CAN bus. This process is necessary because the documentation for how to properly extract data from a network may not always be available; passenger vehicle CAN configurations are protected as trade secrets. At least one major manufacturer has also been found to deliberately misconfigure their documented extraction methods. Thus, this proposal serves as a critical enabler for robust third-party security auditing and intrusion detection systems which do not rely on manufacturers sharing confidential information.
△ Less
Submitted 5 April, 2019;
originally announced April 2019.