-
Keyed Non-Parametric Hypothesis Tests
Authors:
Yao Cheng,
Cheng-Kang Chu,
Hsiao-Ying Lin,
Marius Lombard-Platet,
David Naccache
Abstract:
The recent popularity of machine learning calls for a deeper understanding of AI security. Amongst the numerous AI threats published so far, poisoning attacks currently attract considerable attention. In a poisoning attack the opponent partially tampers the dataset used for learning to mislead the classifier during the testing phase.
This paper proposes a new protection strategy against poisonin…
▽ More
The recent popularity of machine learning calls for a deeper understanding of AI security. Amongst the numerous AI threats published so far, poisoning attacks currently attract considerable attention. In a poisoning attack the opponent partially tampers the dataset used for learning to mislead the classifier during the testing phase.
This paper proposes a new protection strategy against poisoning attacks. The technique relies on a new primitive called keyed non-parametric hypothesis tests allowing to evaluate under adversarial conditions the training input's conformance with a previously learned distribution $\mathfrak{D}$. To do so we use a secret key $κ$ unknown to the opponent.
Keyed non-parametric hypothesis tests differs from classical tests in that the secrecy of $κ$ prevents the opponent from misleading the keyed test into concluding that a (significantly) tampered dataset belongs to $\mathfrak{D}$.
△ Less
Submitted 25 May, 2020;
originally announced May 2020.
-
Approaching Optimal Duplicate Detection in a Sliding Window
Authors:
Rémi Géraud-Stewart,
Marius Lombard-Platet,
David Naccache
Abstract:
Duplicate detection is the problem of identifying whether a given item has previously appeared in a (possibly infinite) stream of data, when only a limited amount of memory is available.
Unfortunately the infinite stream setting is ill-posed, and error rates of duplicate detection filters turn out to be heavily constrained: consequently they appear to provide no advantage, asymptotically, over a…
▽ More
Duplicate detection is the problem of identifying whether a given item has previously appeared in a (possibly infinite) stream of data, when only a limited amount of memory is available.
Unfortunately the infinite stream setting is ill-posed, and error rates of duplicate detection filters turn out to be heavily constrained: consequently they appear to provide no advantage, asymptotically, over a biased coin toss [8].
In this paper we formalize the sliding window setting introduced by [13,16], and show that a perfect (zero error) solution can be used up to a maximal window size $w_\text{max}$. Above this threshold we show that some existing duplicate detection filters (designed for the $\textit{non-windowed}$ setting) perform better that those targeting the windowed problem. Finally, we introduce a "queuing construction" that improves on the performance of some duplicate detection filters in the windowed setting.
We also analyse the security of our filters in an adversarial setting.
△ Less
Submitted 10 May, 2020;
originally announced May 2020.
-
Quotient Hash Tables - Efficiently Detecting Duplicates in Streaming Data
Authors:
Rémi Géraud,
Marius Lombard-Platet,
David Naccache
Abstract:
This article presents the Quotient Hash Table (QHT) a new data structure for duplicate detection in unbounded streams. QHTs stem from a corrected analysis of streaming quotient filters (SQFs), resulting in a 33\% reduction in memory usage for equal performance. We provide a new and thorough analysis of both algorithms, with results of interest to other existing constructions.
We also introduce a…
▽ More
This article presents the Quotient Hash Table (QHT) a new data structure for duplicate detection in unbounded streams. QHTs stem from a corrected analysis of streaming quotient filters (SQFs), resulting in a 33\% reduction in memory usage for equal performance. We provide a new and thorough analysis of both algorithms, with results of interest to other existing constructions.
We also introduce an optimised version of our new data structure dubbed Queued QHT with Duplicates (QQHTD).
Finally we discuss the effect of adversarial inputs for hash-based duplicate filters similar to QHT.
△ Less
Submitted 14 January, 2019;
originally announced January 2019.