-
A Strong Direct Sum Theorem for Distributional Query Complexity
Authors:
Guy Blanc,
Caleb Koch,
Carmen Strassle,
Li-Yang Tan
Abstract:
Consider the expected query complexity of computing the $k$-fold direct product $f^{\otimes k}$ of a function $f$ to error $\varepsilon$ with respect to a distribution $μ^k$. One strategy is to sequentially compute each of the $k$ copies to error $\varepsilon/k$ with respect to $μ$ and apply the union bound. We prove a strong direct sum theorem showing that this naive strategy is essentially optim…
▽ More
Consider the expected query complexity of computing the $k$-fold direct product $f^{\otimes k}$ of a function $f$ to error $\varepsilon$ with respect to a distribution $μ^k$. One strategy is to sequentially compute each of the $k$ copies to error $\varepsilon/k$ with respect to $μ$ and apply the union bound. We prove a strong direct sum theorem showing that this naive strategy is essentially optimal. In particular, computing a direct product necessitates a blowup in both query complexity and error.
Strong direct sum theorems contrast with results that only show a blowup in query complexity or error but not both. There has been a long line of such results for distributional query complexity, dating back to (Impagliazzo, Raz, Wigderson 1994) and (Nisan, Rudich, Saks 1994), but a strong direct sum theorem had been elusive.
A key idea in our work is the first use of the Hardcore Theorem (Impagliazzo 1995) in the context of query complexity. We prove a new "resilience lemma" that accompanies it, showing that the hardcore of $f^{\otimes k}$ is likely to remain dense under arbitrary partitions of the input space.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
Harnessing the Power of Choices in Decision Tree Learning
Authors:
Guy Blanc,
Jane Lange,
Chirag Pabbaraju,
Colin Sullivan,
Li-Yang Tan,
Mo Tiwari
Abstract:
We propose a simple generalization of standard and empirically successful decision tree learning algorithms such as ID3, C4.5, and CART. These algorithms, which have been central to machine learning for decades, are greedy in nature: they grow a decision tree by iteratively splitting on the best attribute. Our algorithm, Top-$k$, considers the $k$ best attributes as possible splits instead of just…
▽ More
We propose a simple generalization of standard and empirically successful decision tree learning algorithms such as ID3, C4.5, and CART. These algorithms, which have been central to machine learning for decades, are greedy in nature: they grow a decision tree by iteratively splitting on the best attribute. Our algorithm, Top-$k$, considers the $k$ best attributes as possible splits instead of just the single best attribute. We demonstrate, theoretically and empirically, the power of this simple generalization. We first prove a {\sl greediness hierarchy theorem} showing that for every $k \in \mathbb{N}$, Top-$(k+1)$ can be dramatically more powerful than Top-$k$: there are data distributions for which the former achieves accuracy $1-\varepsilon$, whereas the latter only achieves accuracy $\frac1{2}+\varepsilon$. We then show, through extensive experiments, that Top-$k$ outperforms the two main approaches to decision tree learning: classic greedy algorithms and more recent "optimal decision tree" algorithms. On one hand, Top-$k$ consistently enjoys significant accuracy gains over greedy algorithms across a wide range of benchmarks. On the other hand, Top-$k$ is markedly more scalable than optimal decision tree algorithms and is able to handle dataset and feature set sizes that remain far beyond the reach of these algorithms.
△ Less
Submitted 25 October, 2023; v1 submitted 2 October, 2023;
originally announced October 2023.
-
A Strong Composition Theorem for Junta Complexity and the Boosting of Property Testers
Authors:
Guy Blanc,
Caleb Koch,
Carmen Strassle,
Li-Yang Tan
Abstract:
We prove a strong composition theorem for junta complexity and show how such theorems can be used to generically boost the performance of property testers.
The $\varepsilon$-approximate junta complexity of a function $f$ is the smallest integer $r$ such that $f$ is $\varepsilon$-close to a function that depends only on $r$ variables. A strong composition theorem states that if $f$ has large…
▽ More
We prove a strong composition theorem for junta complexity and show how such theorems can be used to generically boost the performance of property testers.
The $\varepsilon$-approximate junta complexity of a function $f$ is the smallest integer $r$ such that $f$ is $\varepsilon$-close to a function that depends only on $r$ variables. A strong composition theorem states that if $f$ has large $\varepsilon$-approximate junta complexity, then $g \circ f$ has even larger $\varepsilon'$-approximate junta complexity, even for $\varepsilon' \gg \varepsilon$. We develop a fairly complete understanding of this behavior, proving that the junta complexity of $g \circ f$ is characterized by that of $f$ along with the multivariate noise sensitivity of $g$. For the important case of symmetric functions $g$, we relate their multivariate noise sensitivity to the simpler and well-studied case of univariate noise sensitivity.
We then show how strong composition theorems yield boosting algorithms for property testers: with a strong composition theorem for any class of functions, a large-distance tester for that class is immediately upgraded into one for small distances. Combining our contributions yields a booster for junta testers, and with it new implications for junta testing. This is the first boosting-type result in property testing, and we hope that the connection to composition theorems adds compelling motivation to the study of both topics.
△ Less
Submitted 8 July, 2023;
originally announced July 2023.
-
Lifting uniform learners via distributional decomposition
Authors:
Guy Blanc,
Jane Lange,
Ali Malik,
Li-Yang Tan
Abstract:
We show how any PAC learning algorithm that works under the uniform distribution can be transformed, in a blackbox fashion, into one that works under an arbitrary and unknown distribution $\mathcal{D}$. The efficiency of our transformation scales with the inherent complexity of $\mathcal{D}$, running in $\mathrm{poly}(n, (md)^d)$ time for distributions over $\{\pm 1\}^n$ whose pmfs are computed by…
▽ More
We show how any PAC learning algorithm that works under the uniform distribution can be transformed, in a blackbox fashion, into one that works under an arbitrary and unknown distribution $\mathcal{D}$. The efficiency of our transformation scales with the inherent complexity of $\mathcal{D}$, running in $\mathrm{poly}(n, (md)^d)$ time for distributions over $\{\pm 1\}^n$ whose pmfs are computed by depth-$d$ decision trees, where $m$ is the sample complexity of the original algorithm. For monotone distributions our transformation uses only samples from $\mathcal{D}$, and for general ones it uses subcube conditioning samples.
A key technical ingredient is an algorithm which, given the aforementioned access to $\mathcal{D}$, produces an optimal decision tree decomposition of $\mathcal{D}$: an approximation of $\mathcal{D}$ as a mixture of uniform distributions over disjoint subcubes. With this decomposition in hand, we run the uniform-distribution learner on each subcube and combine the hypotheses using the decision tree. This algorithmic decomposition lemma also yields new algorithms for learning decision tree distributions with runtimes that exponentially improve on the prior state of the art -- results of independent interest in distribution learning.
△ Less
Submitted 29 March, 2023; v1 submitted 27 March, 2023;
originally announced March 2023.
-
Subsampling Suffices for Adaptive Data Analysis
Authors:
Guy Blanc
Abstract:
Ensuring that analyses performed on a dataset are representative of the entire population is one of the central problems in statistics. Most classical techniques assume that the dataset is independent of the analyst's query and break down in the common setting where a dataset is reused for multiple, adaptively chosen, queries. This problem of \emph{adaptive data analysis} was formalized in the sem…
▽ More
Ensuring that analyses performed on a dataset are representative of the entire population is one of the central problems in statistics. Most classical techniques assume that the dataset is independent of the analyst's query and break down in the common setting where a dataset is reused for multiple, adaptively chosen, queries. This problem of \emph{adaptive data analysis} was formalized in the seminal works of Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014).
We identify a remarkably simple set of assumptions under which the queries will continue to be representative even when chosen adaptively: The only requirements are that each query takes as input a random subsample and outputs few bits. This result shows that the noise inherent in subsampling is sufficient to guarantee that query responses generalize. The simplicity of this subsampling-based framework allows it to model a variety of real-world scenarios not covered by prior work.
In addition to its simplicity, we demonstrate the utility of this framework by designing mechanisms for two foundational tasks, statistical queries and median finding. In particular, our mechanism for answering the broadly applicable class of statistical queries is both extremely simple and state of the art in many parameter regimes.
△ Less
Submitted 20 September, 2023; v1 submitted 16 February, 2023;
originally announced February 2023.
-
Certification with an NP Oracle
Authors:
Guy Blanc,
Caleb Koch,
Jane Lange,
Carmen Strassle,
Li-Yang Tan
Abstract:
In the certification problem, the algorithm is given a function $f$ with certificate complexity $k$ and an input $x^\star$, and the goal is to find a certificate of size $\le \text{poly}(k)$ for $f$'s value at $x^\star$. This problem is in $\mathsf{NP}^{\mathsf{NP}}$, and assuming $\mathsf{P} \ne \mathsf{NP}$, is not in $\mathsf{P}$. Prior works, dating back to Valiant in 1984, have therefore soug…
▽ More
In the certification problem, the algorithm is given a function $f$ with certificate complexity $k$ and an input $x^\star$, and the goal is to find a certificate of size $\le \text{poly}(k)$ for $f$'s value at $x^\star$. This problem is in $\mathsf{NP}^{\mathsf{NP}}$, and assuming $\mathsf{P} \ne \mathsf{NP}$, is not in $\mathsf{P}$. Prior works, dating back to Valiant in 1984, have therefore sought to design efficient algorithms by imposing assumptions on $f$ such as monotonicity.
Our first result is a $\mathsf{BPP}^{\mathsf{NP}}$ algorithm for the general problem. The key ingredient is a new notion of the balanced influence of variables, a natural variant of influence that corrects for the bias of the function. Balanced influences can be accurately estimated via uniform generation, and classic $\mathsf{BPP}^{\mathsf{NP}}$ algorithms are known for the latter task.
We then consider certification with stricter instance-wise guarantees: for each $x^\star$, find a certificate whose size scales with that of the smallest certificate for $x^\star$. In sharp contrast with our first result, we show that this problem is $\mathsf{NP}^{\mathsf{NP}}$-hard even to approximate. We obtain an optimal inapproximability ratio, adding to a small handful of problems in the higher levels of the polynomial hierarchy for which optimal inapproximability is known. Our proof involves the novel use of bit-fixing dispersers for gap amplification.
△ Less
Submitted 4 November, 2022;
originally announced November 2022.
-
Multitask Learning via Shared Features: Algorithms and Hardness
Authors:
Konstantina Bairaktari,
Guy Blanc,
Li-Yang Tan,
Jonathan Ullman,
Lydia Zakynthinou
Abstract:
We investigate the computational efficiency of multitask learning of Boolean functions over the $d$-dimensional hypercube, that are related by means of a feature representation of size $k \ll d$ shared across all tasks. We present a polynomial time multitask learning algorithm for the concept class of halfspaces with margin $γ$, which is based on a simultaneous boosting technique and requires only…
▽ More
We investigate the computational efficiency of multitask learning of Boolean functions over the $d$-dimensional hypercube, that are related by means of a feature representation of size $k \ll d$ shared across all tasks. We present a polynomial time multitask learning algorithm for the concept class of halfspaces with margin $γ$, which is based on a simultaneous boosting technique and requires only $\textrm{poly}(k/γ)$ samples-per-task and $\textrm{poly}(k\log(d)/γ)$ samples in total.
In addition, we prove a computational separation, showing that assuming there exists a concept class that cannot be learned in the attribute-efficient model, we can construct another concept class such that can be learned in the attribute-efficient model, but cannot be multitask learned efficiently -- multitask learning this concept class either requires super-polynomial time complexity or a much larger total number of samples.
△ Less
Submitted 7 September, 2022;
originally announced September 2022.
-
A Query-Optimal Algorithm for Finding Counterfactuals
Authors:
Guy Blanc,
Caleb Koch,
Jane Lange,
Li-Yang Tan
Abstract:
We design an algorithm for finding counterfactuals with strong theoretical guarantees on its performance. For any monotone model $f : X^d \to \{0,1\}$ and instance $x^\star$, our algorithm makes \[ {S(f)^{O(Δ_f(x^\star))}\cdot \log d}\] queries to $f$ and returns {an {\sl optimal}} counterfactual for $x^\star$: a nearest instance $x'$ to $x^\star$ for which $f(x')\ne f(x^\star)$. Here $S(f)$ is th…
▽ More
We design an algorithm for finding counterfactuals with strong theoretical guarantees on its performance. For any monotone model $f : X^d \to \{0,1\}$ and instance $x^\star$, our algorithm makes \[ {S(f)^{O(Δ_f(x^\star))}\cdot \log d}\] queries to $f$ and returns {an {\sl optimal}} counterfactual for $x^\star$: a nearest instance $x'$ to $x^\star$ for which $f(x')\ne f(x^\star)$. Here $S(f)$ is the sensitivity of $f$, a discrete analogue of the Lipschitz constant, and $Δ_f(x^\star)$ is the distance from $x^\star$ to its nearest counterfactuals. The previous best known query complexity was $d^{\,O(Δ_f(x^\star))}$, achievable by brute-force local search. We further prove a lower bound of $S(f)^{Ω(Δ_f(x^\star))} + Ω(\log d)$ on the query complexity of any algorithm, thereby showing that the guarantees of our algorithm are essentially optimal.
△ Less
Submitted 14 July, 2022;
originally announced July 2022.
-
Open Problem: Properly learning decision trees in polynomial time?
Authors:
Guy Blanc,
Jane Lange,
Mingda Qiao,
Li-Yang Tan
Abstract:
The authors recently gave an $n^{O(\log\log n)}$ time membership query algorithm for properly learning decision trees under the uniform distribution (Blanc et al., 2021). The previous fastest algorithm for this problem ran in $n^{O(\log n)}$ time, a consequence of Ehrenfeucht and Haussler (1989)'s classic algorithm for the distribution-free setting. In this article we highlight the natural open pr…
▽ More
The authors recently gave an $n^{O(\log\log n)}$ time membership query algorithm for properly learning decision trees under the uniform distribution (Blanc et al., 2021). The previous fastest algorithm for this problem ran in $n^{O(\log n)}$ time, a consequence of Ehrenfeucht and Haussler (1989)'s classic algorithm for the distribution-free setting. In this article we highlight the natural open problem of obtaining a polynomial-time algorithm, discuss possible avenues towards obtaining it, and state intermediate milestones that we believe are of independent interest.
△ Less
Submitted 29 June, 2022;
originally announced June 2022.
-
Popular decision tree algorithms are provably noise tolerant
Authors:
Guy Blanc,
Jane Lange,
Ali Malik,
Li-Yang Tan
Abstract:
Using the framework of boosting, we prove that all impurity-based decision tree learning algorithms, including the classic ID3, C4.5, and CART, are highly noise tolerant. Our guarantees hold under the strongest noise model of nasty noise, and we provide near-matching upper and lower bounds on the allowable noise rate. We further show that these algorithms, which are simple and have long been centr…
▽ More
Using the framework of boosting, we prove that all impurity-based decision tree learning algorithms, including the classic ID3, C4.5, and CART, are highly noise tolerant. Our guarantees hold under the strongest noise model of nasty noise, and we provide near-matching upper and lower bounds on the allowable noise rate. We further show that these algorithms, which are simple and have long been central to everyday machine learning, enjoy provable guarantees in the noisy setting that are unmatched by existing algorithms in the theoretical literature on decision tree learning. Taken together, our results add to an ongoing line of research that seeks to place the empirical success of these practical decision tree algorithms on firm theoretical footing.
△ Less
Submitted 17 June, 2022;
originally announced June 2022.
-
The Query Complexity of Certification
Authors:
Guy Blanc,
Caleb Koch,
Jane Lange,
Li-Yang Tan
Abstract:
We study the problem of {\sl certification}: given queries to a function $f : \{0,1\}^n \to \{0,1\}$ with certificate complexity $\le k$ and an input $x^\star$, output a size-$k$ certificate for $f$'s value on $x^\star$. This abstractly models a central problem in explainable machine learning, where we think of $f$ as a blackbox model that we seek to explain the predictions of.
For monotone func…
▽ More
We study the problem of {\sl certification}: given queries to a function $f : \{0,1\}^n \to \{0,1\}$ with certificate complexity $\le k$ and an input $x^\star$, output a size-$k$ certificate for $f$'s value on $x^\star$. This abstractly models a central problem in explainable machine learning, where we think of $f$ as a blackbox model that we seek to explain the predictions of.
For monotone functions, a classic local search algorithm of Angluin accomplishes this task with $n$ queries, which we show is optimal for local search algorithms. Our main result is a new algorithm for certifying monotone functions with $O(k^8 \log n)$ queries, which comes close to matching the information-theoretic lower bound of $Ω(k \log n)$. The design and analysis of our algorithm are based on a new connection to threshold phenomena in monotone functions.
We further prove exponential-in-$k$ lower bounds when $f$ is non-monotone, and when $f$ is monotone but the algorithm is only given random examples of $f$. These lower bounds show that assumptions on the structure of $f$ and query access to it are both necessary for the polynomial dependence on $k$ that we achieve.
△ Less
Submitted 6 April, 2022; v1 submitted 19 January, 2022;
originally announced January 2022.
-
On the power of adaptivity in statistical adversaries
Authors:
Guy Blanc,
Jane Lange,
Ali Malik,
Li-Yang Tan
Abstract:
We study a fundamental question concerning adversarial noise models in statistical problems where the algorithm receives i.i.d. draws from a distribution $\mathcal{D}$. The definitions of these adversaries specify the type of allowable corruptions (noise model) as well as when these corruptions can be made (adaptivity); the latter differentiates between oblivious adversaries that can only corrupt…
▽ More
We study a fundamental question concerning adversarial noise models in statistical problems where the algorithm receives i.i.d. draws from a distribution $\mathcal{D}$. The definitions of these adversaries specify the type of allowable corruptions (noise model) as well as when these corruptions can be made (adaptivity); the latter differentiates between oblivious adversaries that can only corrupt the distribution $\mathcal{D}$ and adaptive adversaries that can have their corruptions depend on the specific sample $S$ that is drawn from $\mathcal{D}$.
In this work, we investigate whether oblivious adversaries are effectively equivalent to adaptive adversaries, across all noise models studied in the literature. Specifically, can the behavior of an algorithm $\mathcal{A}$ in the presence of oblivious adversaries always be well-approximated by that of an algorithm $\mathcal{A}'$ in the presence of adaptive adversaries? Our first result shows that this is indeed the case for the broad class of statistical query algorithms, under all reasonable noise models. We then show that in the specific case of additive noise, this equivalence holds for all algorithms. Finally, we map out an approach towards proving this statement in its fullest generality, for all algorithms and under all reasonable noise models.
△ Less
Submitted 29 June, 2022; v1 submitted 19 November, 2021;
originally announced November 2021.
-
Provably efficient, succinct, and precise explanations
Authors:
Guy Blanc,
Jane Lange,
Li-Yang Tan
Abstract:
We consider the problem of explaining the predictions of an arbitrary blackbox model $f$: given query access to $f$ and an instance $x$, output a small set of $x$'s features that in conjunction essentially determines $f(x)$. We design an efficient algorithm with provable guarantees on the succinctness and precision of the explanations that it returns. Prior algorithms were either efficient but lac…
▽ More
We consider the problem of explaining the predictions of an arbitrary blackbox model $f$: given query access to $f$ and an instance $x$, output a small set of $x$'s features that in conjunction essentially determines $f(x)$. We design an efficient algorithm with provable guarantees on the succinctness and precision of the explanations that it returns. Prior algorithms were either efficient but lacked such guarantees, or achieved such guarantees but were inefficient.
We obtain our algorithm via a connection to the problem of {\sl implicitly} learning decision trees. The implicit nature of this learning task allows for efficient algorithms even when the complexity of $f$ necessitates an intractably large surrogate decision tree. We solve the implicit learning problem by bringing together techniques from learning theory, local computation algorithms, and complexity theory.
Our approach of "explaining by implicit learning" shares elements of two previously disparate methods for post-hoc explanations, global and local explanations, and we make the case that it enjoys advantages of both.
△ Less
Submitted 1 November, 2021;
originally announced November 2021.
-
Properly learning decision trees in almost polynomial time
Authors:
Guy Blanc,
Jane Lange,
Mingda Qiao,
Li-Yang Tan
Abstract:
We give an $n^{O(\log\log n)}$-time membership query algorithm for properly and agnostically learning decision trees under the uniform distribution over $\{\pm 1\}^n$. Even in the realizable setting, the previous fastest runtime was $n^{O(\log n)}$, a consequence of a classic algorithm of Ehrenfeucht and Haussler. Our algorithm shares similarities with practical heuristics for learning decision tr…
▽ More
We give an $n^{O(\log\log n)}$-time membership query algorithm for properly and agnostically learning decision trees under the uniform distribution over $\{\pm 1\}^n$. Even in the realizable setting, the previous fastest runtime was $n^{O(\log n)}$, a consequence of a classic algorithm of Ehrenfeucht and Haussler. Our algorithm shares similarities with practical heuristics for learning decision trees, which we augment with additional ideas to circumvent known lower bounds against these heuristics. To analyze our algorithm, we prove a new structural result for decision trees that strengthens a theorem of O'Donnell, Saks, Schramm, and Servedio. While the OSSS theorem says that every decision tree has an influential variable, we show how every decision tree can be "pruned" so that every variable in the resulting tree is influential.
△ Less
Submitted 1 November, 2021; v1 submitted 1 September, 2021;
originally announced September 2021.
-
Decision tree heuristics can fail, even in the smoothed setting
Authors:
Guy Blanc,
Jane Lange,
Mingda Qiao,
Li-Yang Tan
Abstract:
Greedy decision tree learning heuristics are mainstays of machine learning practice, but theoretical justification for their empirical success remains elusive. In fact, it has long been known that there are simple target functions for which they fail badly (Kearns and Mansour, STOC 1996).
Recent work of Brutzkus, Daniely, and Malach (COLT 2020) considered the smoothed analysis model as a possibl…
▽ More
Greedy decision tree learning heuristics are mainstays of machine learning practice, but theoretical justification for their empirical success remains elusive. In fact, it has long been known that there are simple target functions for which they fail badly (Kearns and Mansour, STOC 1996).
Recent work of Brutzkus, Daniely, and Malach (COLT 2020) considered the smoothed analysis model as a possible avenue towards resolving this disconnect. Within the smoothed setting and for targets $f$ that are $k$-juntas, they showed that these heuristics successfully learn $f$ with depth-$k$ decision tree hypotheses. They conjectured that the same guarantee holds more generally for targets that are depth-$k$ decision trees.
We provide a counterexample to this conjecture: we construct targets that are depth-$k$ decision trees and show that even in the smoothed setting, these heuristics build trees of depth $2^{Ω(k)}$ before achieving high accuracy. We also show that the guarantees of Brutzkus et al. cannot extend to the agnostic setting: there are targets that are very close to $k$-juntas, for which these heuristics build trees of depth $2^{Ω(k)}$ before achieving high accuracy.
△ Less
Submitted 2 July, 2021;
originally announced July 2021.
-
Multiway Online Correlated Selection
Authors:
Guy Blanc,
Moses Charikar
Abstract:
We give a $0.5368$-competitive algorithm for edge-weighted online bipartite matching. Prior to our work, the best competitive ratio was $0.5086$ due to Fahrbach, Huang, Tao, and Zadimoghaddam (FOCS 2020). They achieved their breakthrough result by develo** a subroutine called \emph{online correlated selection} (OCS) which takes as input a sequence of pairs and selects one item from each pair. Im…
▽ More
We give a $0.5368$-competitive algorithm for edge-weighted online bipartite matching. Prior to our work, the best competitive ratio was $0.5086$ due to Fahrbach, Huang, Tao, and Zadimoghaddam (FOCS 2020). They achieved their breakthrough result by develo** a subroutine called \emph{online correlated selection} (OCS) which takes as input a sequence of pairs and selects one item from each pair. Importantly, the selections the OCS makes are negatively correlated.
We achieve our result by defining \emph{multiway} OCSes which receive arbitrarily many elements at each step, rather than just two. In addition to better competitive ratios, our formulation allows for a simpler reduction from edge-weighted online bipartite matching to OCSes. While Fahrbach et al. used a factor-revealing linear program to optimize the competitive ratio, our analysis directly connects the competitive ratio to the parameters of the multiway OCS. Finally, we show that the formulation of Farhbach et al. can achieve a competitive ratio of at most $0.5239$, confirming that multiway OCSes are strictly more powerful.
△ Less
Submitted 20 September, 2021; v1 submitted 10 June, 2021;
originally announced June 2021.
-
Learning stochastic decision trees
Authors:
Guy Blanc,
Jane Lange,
Li-Yang Tan
Abstract:
We give a quasipolynomial-time algorithm for learning stochastic decision trees that is optimally resilient to adversarial noise. Given an $η$-corrupted set of uniform random samples labeled by a size-$s$ stochastic decision tree, our algorithm runs in time $n^{O(\log(s/\varepsilon)/\varepsilon^2)}$ and returns a hypothesis with error within an additive $2η+ \varepsilon$ of the Bayes optimal. An a…
▽ More
We give a quasipolynomial-time algorithm for learning stochastic decision trees that is optimally resilient to adversarial noise. Given an $η$-corrupted set of uniform random samples labeled by a size-$s$ stochastic decision tree, our algorithm runs in time $n^{O(\log(s/\varepsilon)/\varepsilon^2)}$ and returns a hypothesis with error within an additive $2η+ \varepsilon$ of the Bayes optimal. An additive $2η$ is the information-theoretic minimum.
Previously no non-trivial algorithm with a guarantee of $O(η) + \varepsilon$ was known, even for weaker noise models. Our algorithm is furthermore proper, returning a hypothesis that is itself a decision tree; previously no such algorithm was known even in the noiseless setting.
△ Less
Submitted 8 May, 2021;
originally announced May 2021.
-
Reconstructing decision trees
Authors:
Guy Blanc,
Jane Lange,
Li-Yang Tan
Abstract:
We give the first {\sl reconstruction algorithm} for decision trees: given queries to a function $f$ that is $\mathrm{opt}$-close to a size-$s$ decision tree, our algorithm provides query access to a decision tree $T$ where:
$\circ$ $T$ has size $S = s^{O((\log s)^2/\varepsilon^3)}$;
$\circ$ $\mathrm{dist}(f,T)\le O(\mathrm{opt})+\varepsilon$;
$\circ$ Every query to $T$ is answered with…
▽ More
We give the first {\sl reconstruction algorithm} for decision trees: given queries to a function $f$ that is $\mathrm{opt}$-close to a size-$s$ decision tree, our algorithm provides query access to a decision tree $T$ where:
$\circ$ $T$ has size $S = s^{O((\log s)^2/\varepsilon^3)}$;
$\circ$ $\mathrm{dist}(f,T)\le O(\mathrm{opt})+\varepsilon$;
$\circ$ Every query to $T$ is answered with $\mathrm{poly}((\log s)/\varepsilon)\cdot \log n$ queries to $f$ and in $\mathrm{poly}((\log s)/\varepsilon)\cdot n\log n$ time.
This yields a {\sl tolerant tester} that distinguishes functions that are close to size-$s$ decision trees from those that are far from size-$S$ decision trees. The polylogarithmic dependence on $s$ in the efficiency of our tester is exponentially smaller than that of existing testers.
Since decision tree complexity is well known to be related to numerous other boolean function properties, our results also provide a new algorithms for reconstructing and testing these properties.
△ Less
Submitted 22 May, 2022; v1 submitted 15 December, 2020;
originally announced December 2020.
-
Estimating decision tree learnability with polylogarithmic sample complexity
Authors:
Guy Blanc,
Neha Gupta,
Jane Lange,
Li-Yang Tan
Abstract:
We show that top-down decision tree learning heuristics are amenable to highly efficient learnability estimation: for monotone target functions, the error of the decision tree hypothesis constructed by these heuristics can be estimated with polylogarithmically many labeled examples, exponentially smaller than the number necessary to run these heuristics, and indeed, exponentially smaller than info…
▽ More
We show that top-down decision tree learning heuristics are amenable to highly efficient learnability estimation: for monotone target functions, the error of the decision tree hypothesis constructed by these heuristics can be estimated with polylogarithmically many labeled examples, exponentially smaller than the number necessary to run these heuristics, and indeed, exponentially smaller than information-theoretic minimum required to learn a good decision tree. This adds to a small but growing list of fundamental learning algorithms that have been shown to be amenable to learnability estimation.
En route to this result, we design and analyze sample-efficient minibatch versions of top-down decision tree learning heuristics and show that they achieve the same provable guarantees as the full-batch versions. We further give "active local" versions of these heuristics: given a test point $x^\star$, we show how the label $T(x^\star)$ of the decision tree hypothesis $T$ can be computed with polylogarithmically many labeled examples, exponentially smaller than the number necessary to learn $T$.
△ Less
Submitted 3 November, 2020;
originally announced November 2020.
-
Query strategies for priced information, revisited
Authors:
Guy Blanc,
Jane Lange,
Li-Yang Tan
Abstract:
We consider the problem of designing query strategies for priced information, introduced by Charikar et al. In this problem the algorithm designer is given a function $f : \{0,1\}^n \to \{-1,1\}$ and a price associated with each of the $n$ coordinates. The goal is to design a query strategy for determining $f$'s value on unknown inputs for minimum cost.
Prior works on this problem have focused o…
▽ More
We consider the problem of designing query strategies for priced information, introduced by Charikar et al. In this problem the algorithm designer is given a function $f : \{0,1\}^n \to \{-1,1\}$ and a price associated with each of the $n$ coordinates. The goal is to design a query strategy for determining $f$'s value on unknown inputs for minimum cost.
Prior works on this problem have focused on specific classes of functions. We analyze a simple and natural strategy that applies to all functions $f$, and show that its performance relative to the optimal strategy can be expressed in terms of a basic complexity measure of $f$, its influence. For $\varepsilon \in (0,\frac1{2})$, writing $\mathsf{opt}$ to denote the expected cost of the optimal strategy that errs on at most an $\varepsilon$-fraction of inputs, our strategy has expected cost $\mathsf{opt} \cdot \mathrm{Inf}(f)/\varepsilon^2$ and also errs on at most an $O(\varepsilon)$-fraction of inputs. This connection yields new guarantees that complement existing ones for a number of function classes that have been studied in this context, as well as new guarantees for new classes.
Finally, we show that improving on the parameters that we achieve will require making progress on the longstanding open problem of properly learning decision trees.
△ Less
Submitted 21 October, 2020;
originally announced October 2020.
-
Universal guarantees for decision tree induction via a higher-order splitting criterion
Authors:
Guy Blanc,
Neha Gupta,
Jane Lange,
Li-Yang Tan
Abstract:
We propose a simple extension of top-down decision tree learning heuristics such as ID3, C4.5, and CART. Our algorithm achieves provable guarantees for all target functions $f: \{-1,1\}^n \to \{-1,1\}$ with respect to the uniform distribution, circumventing impossibility results showing that existing heuristics fare poorly even for simple target functions. The crux of our extension is a new splitt…
▽ More
We propose a simple extension of top-down decision tree learning heuristics such as ID3, C4.5, and CART. Our algorithm achieves provable guarantees for all target functions $f: \{-1,1\}^n \to \{-1,1\}$ with respect to the uniform distribution, circumventing impossibility results showing that existing heuristics fare poorly even for simple target functions. The crux of our extension is a new splitting criterion that takes into account the correlations between $f$ and small subsets of its attributes. The splitting criteria of existing heuristics (e.g. Gini impurity and information gain), in contrast, are based solely on the correlations between $f$ and its individual attributes.
Our algorithm satisfies the following guarantee: for all target functions $f : \{-1,1\}^n \to \{-1,1\}$, sizes $s\in \mathbb{N}$, and error parameters $ε$, it constructs a decision tree of size $s^{\tilde{O}((\log s)^2/ε^2)}$ that achieves error $\le O(\mathsf{opt}_s) + ε$, where $\mathsf{opt}_s$ denotes the error of the optimal size $s$ decision tree. A key technical notion that drives our analysis is the noise stability of $f$, a well-studied smoothness measure.
△ Less
Submitted 16 October, 2020;
originally announced October 2020.
-
Efficient hyperparameter optimization by way of PAC-Bayes bound minimization
Authors:
John J. Cherian,
Andrew G. Taube,
Robert T. McGibbon,
Panagiotis Angelikopoulos,
Guy Blanc,
Michael Snarski,
Daniel D. Richman,
John L. Klepeis,
David E. Shaw
Abstract:
Identifying optimal values for a high-dimensional set of hyperparameters is a problem that has received growing attention given its importance to large-scale machine learning applications such as neural architecture search. Recently developed optimization methods can be used to select thousands or even millions of hyperparameters. Such methods often yield overfit models, however, leading to poor p…
▽ More
Identifying optimal values for a high-dimensional set of hyperparameters is a problem that has received growing attention given its importance to large-scale machine learning applications such as neural architecture search. Recently developed optimization methods can be used to select thousands or even millions of hyperparameters. Such methods often yield overfit models, however, leading to poor performance on unseen data. We argue that this overfitting results from using the standard hyperparameter optimization objective function. Here we present an alternative objective that is equivalent to a Probably Approximately Correct-Bayes (PAC-Bayes) bound on the expected out-of-sample error. We then devise an efficient gradient-based algorithm to minimize this objective; the proposed method has asymptotic space and time complexity equal to or better than other gradient-based hyperparameter optimization methods. We show that this new method significantly reduces out-of-sample error when applied to hyperparameter optimization problems known to be prone to overfitting.
△ Less
Submitted 14 August, 2020;
originally announced August 2020.
-
Provable guarantees for decision tree induction: the agnostic setting
Authors:
Guy Blanc,
Jane Lange,
Li-Yang Tan
Abstract:
We give strengthened provable guarantees on the performance of widely employed and empirically successful {\sl top-down decision tree learning heuristics}. While prior works have focused on the realizable setting, we consider the more realistic and challenging {\sl agnostic} setting. We show that for all monotone functions~$f$ and parameters $s\in \mathbb{N}$, these heuristics construct a decision…
▽ More
We give strengthened provable guarantees on the performance of widely employed and empirically successful {\sl top-down decision tree learning heuristics}. While prior works have focused on the realizable setting, we consider the more realistic and challenging {\sl agnostic} setting. We show that for all monotone functions~$f$ and parameters $s\in \mathbb{N}$, these heuristics construct a decision tree of size $s^{\tilde{O}((\log s)/\varepsilon^2)}$ that achieves error $\le \mathsf{opt}_s + \varepsilon$, where $\mathsf{opt}_s$ denotes the error of the optimal size-$s$ decision tree for $f$. Previously, such a guarantee was not known to be achievable by any algorithm, even one that is not based on top-down heuristics. We complement our algorithmic guarantee with a near-matching $s^{\tildeΩ(\log s)}$ lower bound.
△ Less
Submitted 1 June, 2020;
originally announced June 2020.
-
Anomalous Communications Detection in IoT Networks Using Sparse Autoencoders
Authors:
Mustafizur Rahman Shahid,
Gregory Blanc,
Zonghua Zhang,
Hervé Debar
Abstract:
Nowadays, IoT devices have been widely deployed for enabling various smart services, such as, smart home or e-healthcare. However, security remains as one of the paramount concern as many IoT devices are vulnerable. Moreover, IoT malware are constantly evolving and getting more sophisticated. IoT devices are intended to perform very specific tasks, so their networking behavior is expected to be re…
▽ More
Nowadays, IoT devices have been widely deployed for enabling various smart services, such as, smart home or e-healthcare. However, security remains as one of the paramount concern as many IoT devices are vulnerable. Moreover, IoT malware are constantly evolving and getting more sophisticated. IoT devices are intended to perform very specific tasks, so their networking behavior is expected to be reasonably stable and predictable. Any significant behavioral deviation from the normal patterns would indicate anomalous events. In this paper, we present a method to detect anomalous network communications in IoT networks using a set of sparse autoencoders. The proposed approach allows us to differentiate malicious communications from legitimate ones. So that, if a device is compromised only malicious communications can be dropped while the service provided by the device is not totally interrupted. To characterize network behavior, bidirectional TCP flows are extracted and described using statistics on the size of the first N packets sent and received, along with statistics on the corresponding inter-arrival times between packets. A set of sparse autoencoders is then trained to learn the profile of the legitimate communications generated by an experimental smart home network. Depending on the value of N, the developed model achieves attack detection rates ranging from 86.9% to 91.2%, and false positive rates ranging from 0.1% to 0.5%.
△ Less
Submitted 26 December, 2019;
originally announced December 2019.
-
Constructive derandomization of query algorithms
Authors:
Guy Blanc,
Jane Lange,
Li-Yang Tan
Abstract:
We give efficient deterministic algorithms for converting randomized query algorithms into deterministic ones. We first give an algorithm that takes as input a randomized $q$-query algorithm $R$ with description length $N$ and a parameter $\varepsilon$, runs in time $\mathrm{poly}(N) \cdot 2^{O(q/\varepsilon)}$, and returns a deterministic $O(q/\varepsilon)$-query algorithm $D$ that $\varepsilon$-…
▽ More
We give efficient deterministic algorithms for converting randomized query algorithms into deterministic ones. We first give an algorithm that takes as input a randomized $q$-query algorithm $R$ with description length $N$ and a parameter $\varepsilon$, runs in time $\mathrm{poly}(N) \cdot 2^{O(q/\varepsilon)}$, and returns a deterministic $O(q/\varepsilon)$-query algorithm $D$ that $\varepsilon$-approximates the acceptance probabilities of $R$. These parameters are near-optimal: runtime $N + 2^{Ω(q/\varepsilon)}$ and query complexity $Ω(q/\varepsilon)$ are necessary.
Next, we give algorithms for instance-optimal and online versions of the problem:
$\circ$ Instance optimal: Construct a deterministic $q^\star_R$-query algorithm $D$, where $q^\star_R$ is minimum query complexity of any deterministic algorithm that $\varepsilon$-approximates $R$.
$\circ$ Online: Deterministically approximate the acceptance probability of $R$ for a specific input $\underline{x}$ in time $\mathrm{poly}(N,q,1/\varepsilon)$, without constructing $D$ in its entirety.
Applying the techniques we develop for these extensions, we constructivize classic results that relate the deterministic, randomized, and quantum query complexities of boolean functions (Nisan, STOC 1989; Beals et al., FOCS 1998). This has direct implications for the Turing machine model of computation: sublinear-time algorithms for total decision problems can be efficiently derandomized and dequantized with a subexponential-time preprocessing step.
△ Less
Submitted 6 December, 2019;
originally announced December 2019.
-
Top-down induction of decision trees: rigorous guarantees and inherent limitations
Authors:
Guy Blanc,
Jane Lange,
Li-Yang Tan
Abstract:
Consider the following heuristic for building a decision tree for a function $f : \{0,1\}^n \to \{\pm 1\}$. Place the most influential variable $x_i$ of $f$ at the root, and recurse on the subfunctions $f_{x_i=0}$ and $f_{x_i=1}$ on the left and right subtrees respectively; terminate once the tree is an $\varepsilon$-approximation of $f$. We analyze the quality of this heuristic, obtaining near-ma…
▽ More
Consider the following heuristic for building a decision tree for a function $f : \{0,1\}^n \to \{\pm 1\}$. Place the most influential variable $x_i$ of $f$ at the root, and recurse on the subfunctions $f_{x_i=0}$ and $f_{x_i=1}$ on the left and right subtrees respectively; terminate once the tree is an $\varepsilon$-approximation of $f$. We analyze the quality of this heuristic, obtaining near-matching upper and lower bounds:
$\circ$ Upper bound: For every $f$ with decision tree size $s$ and every $\varepsilon \in (0,\frac1{2})$, this heuristic builds a decision tree of size at most $s^{O(\log(s/\varepsilon)\log(1/\varepsilon))}$.
$\circ$ Lower bound: For every $\varepsilon \in (0,\frac1{2})$ and $s \le 2^{\tilde{O}(\sqrt{n})}$, there is an $f$ with decision tree size $s$ such that this heuristic builds a decision tree of size $s^{\tildeΩ(\log s)}$.
We also obtain upper and lower bounds for monotone functions: $s^{O(\sqrt{\log s}/\varepsilon)}$ and $s^{\tildeΩ(\sqrt[4]{\log s } )}$ respectively. The lower bound disproves conjectures of Fiat and Pechyony (2004) and Lee (2009).
Our upper bounds yield new algorithms for properly learning decision trees under the uniform distribution. We show that these algorithms---which are motivated by widely employed and empirically successful top-down decision tree learning heuristics such as ID3, C4.5, and CART---achieve provable guarantees that compare favorably with those of the current fastest algorithm (Ehrenfeucht and Haussler, 1989). Our lower bounds shed new light on the limitations of these heuristics.
Finally, we revisit the classic work of Ehrenfeucht and Haussler. We extend it to give the first uniform-distribution proper learning algorithm that achieves polynomial sample and memory complexity, while matching its state-of-the-art quasipolynomial runtime.
△ Less
Submitted 17 November, 2019;
originally announced November 2019.
-
Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process
Authors:
Guy Blanc,
Neha Gupta,
Gregory Valiant,
Paul Valiant
Abstract:
We consider networks, trained via stochastic gradient descent to minimize $\ell_2$ loss, with the training labels perturbed by independent noise at each iteration. We characterize the behavior of the training dynamics near any parameter vector that achieves zero training error, in terms of an implicit regularization term corresponding to the sum over the data points, of the squared $\ell_2$ norm o…
▽ More
We consider networks, trained via stochastic gradient descent to minimize $\ell_2$ loss, with the training labels perturbed by independent noise at each iteration. We characterize the behavior of the training dynamics near any parameter vector that achieves zero training error, in terms of an implicit regularization term corresponding to the sum over the data points, of the squared $\ell_2$ norm of the gradient of the model with respect to the parameter vector, evaluated at each data point. This holds for networks of any connectivity, width, depth, and choice of activation function. We interpret this implicit regularization term for three simple settings: matrix sensing, two layer ReLU networks trained on one-dimensional data, and two layer networks with sigmoid activations trained on a single datapoint. For these settings, we show why this new and general implicit regularization effect drives the networks towards "simple" models.
△ Less
Submitted 22 July, 2020; v1 submitted 19 April, 2019;
originally announced April 2019.
-
Adaptive Sampled Softmax with Kernel Based Sampling
Authors:
Guy Blanc,
Steffen Rendle
Abstract:
Softmax is the most commonly used output function for multiclass problems and is widely used in areas such as vision, natural language processing, and recommendation. A softmax model has linear costs in the number of classes which makes it too expensive for many real-world problems. A common approach to speed up training involves sampling only some of the classes at each training step. It is known…
▽ More
Softmax is the most commonly used output function for multiclass problems and is widely used in areas such as vision, natural language processing, and recommendation. A softmax model has linear costs in the number of classes which makes it too expensive for many real-world problems. A common approach to speed up training involves sampling only some of the classes at each training step. It is known that this method is biased and that the bias increases the more the sampling distribution deviates from the output distribution. Nevertheless, almost any recent work uses simple sampling distributions that require a large sample size to mitigate the bias. In this work, we propose a new class of kernel based sampling methods and develop an efficient sampling algorithm. Kernel based sampling adapts to the model as it is trained, thus resulting in low bias. Kernel based sampling can be easily applied to many models because it relies only on the model's last hidden layer. We empirically study the trade-off of bias, sampling distribution and sample size and show that kernel based sampling results in low bias with few samples.
△ Less
Submitted 1 August, 2018; v1 submitted 1 December, 2017;
originally announced December 2017.
-
Combining Technical and Financial Impacts for Countermeasure Selection
Authors:
Gustavo Gonzalez-Granadillo,
Christophe Ponchel,
Gregory Blanc,
Hervé Debar
Abstract:
Research in information security has generally focused on providing a comprehensive interpretation of threats, vulnerabilities, and attacks, in particular to evaluate their danger and prioritize responses accordingly. Most of the current approaches propose advanced techniques to detect intrusions and complex attacks but few of these approaches propose well defined methodologies to react against a…
▽ More
Research in information security has generally focused on providing a comprehensive interpretation of threats, vulnerabilities, and attacks, in particular to evaluate their danger and prioritize responses accordingly. Most of the current approaches propose advanced techniques to detect intrusions and complex attacks but few of these approaches propose well defined methodologies to react against a given attack. In this paper, we propose a novel and systematic method to select security countermeasures from a pool of candidates, by ranking them based on the technical and financial impact associated to each alternative. The method includes industrial evaluation and simulations of the impact associated to a given security measure which allows to compute the return on response investment for different candidates. A simple case study is proposed at the end of the paper to show the applicability of the model.
△ Less
Submitted 16 October, 2014;
originally announced November 2014.