Search | arXiv e-print repository

Improved Regret Bounds for Bandits with Expert Advice

Authors: Nicolò Cesa-Bianchi, Khaled Eldowa, Emmanuel Esposito, Julia Olkhovskaya

Abstract: In this research note, we revisit the bandits with expert advice problem. Under a restricted feedback model, we prove a lower bound of order $\sqrt{K T \ln(N/K)}$ for the worst-case regret, where $K$ is the number of actions, $N>K$ the number of experts, and $T$ the time horizon. This matches a previously known upper bound of the same order and improves upon the best available lower bound of… ▽ More In this research note, we revisit the bandits with expert advice problem. Under a restricted feedback model, we prove a lower bound of order $\sqrt{K T \ln(N/K)}$ for the worst-case regret, where $K$ is the number of actions, $N>K$ the number of experts, and $T$ the time horizon. This matches a previously known upper bound of the same order and improves upon the best available lower bound of $\sqrt{K T (\ln N) / (\ln K)}$. For the standard feedback model, we prove a new instance-based upper bound that depends on the agreement between the experts and provides a logarithmic improvement compared to prior results. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.10529 [pdf, ps, other]

A Theory of Interpretable Approximations

Authors: Marco Bressan, Nicolò Cesa-Bianchi, Emmanuel Esposito, Yishay Mansour, Shay Moran, Maximilian Thiessen

Abstract: Can a deep neural network be approximated by a small decision tree based on simple features? This question and its variants are behind the growing demand for machine learning models that are *interpretable* by humans. In this work we study such questions by introducing *interpretable approximations*, a notion that captures the idea of approximating a target concept $c$ by a small aggregation of co… ▽ More Can a deep neural network be approximated by a small decision tree based on simple features? This question and its variants are behind the growing demand for machine learning models that are *interpretable* by humans. In this work we study such questions by introducing *interpretable approximations*, a notion that captures the idea of approximating a target concept $c$ by a small aggregation of concepts from some base class $\mathcal{H}$. In particular, we consider the approximation of a binary concept $c$ by decision trees based on a simple class $\mathcal{H}$ (e.g., of bounded VC dimension), and use the tree depth as a measure of complexity. Our primary contribution is the following remarkable trichotomy. For any given pair of $\mathcal{H}$ and $c$, exactly one of these cases holds: (i) $c$ cannot be approximated by $\mathcal{H}$ with arbitrary accuracy; (ii) $c$ can be approximated by $\mathcal{H}$ with arbitrary accuracy, but there exists no universal rate that bounds the complexity of the approximations as a function of the accuracy; or (iii) there exists a constant $κ$ that depends only on $\mathcal{H}$ and $c$ such that, for *any* data distribution and *any* desired accuracy level, $c$ can be approximated by $\mathcal{H}$ with a complexity not exceeding $κ$. This taxonomy stands in stark contrast to the landscape of supervised classification, which offers a complex array of distribution-free and universally learnable scenarios. We show that, in the case of interpretable approximations, even a slightly nontrivial a-priori guarantee on the complexity of approximations implies approximations with constant (distribution-free and accuracy-free) complexity. We extend our trichotomy to classes $\mathcal{H}$ of unbounded VC dimension and give characterizations of interpretability based on the algebra generated by $\mathcal{H}$. △ Less

Submitted 15 June, 2024; originally announced June 2024.

Comments: To appear at COLT 2024

arXiv:2406.01192 [pdf, other]

Sparsity-Agnostic Linear Bandits with Adaptive Adversaries

Authors: Tianyuan **, Kyoungseok Jang, Nicolò Cesa-Bianchi

Abstract: We study stochastic linear bandits where, in each round, the learner receives a set of actions (i.e., feature vectors), from which it chooses an element and obtains a stochastic reward. The expected reward is a fixed but unknown linear function of the chosen action. We study sparse regret bounds, that depend on the number $S$ of non-zero coefficients in the linear reward function. Previous works f… ▽ More We study stochastic linear bandits where, in each round, the learner receives a set of actions (i.e., feature vectors), from which it chooses an element and obtains a stochastic reward. The expected reward is a fixed but unknown linear function of the chosen action. We study sparse regret bounds, that depend on the number $S$ of non-zero coefficients in the linear reward function. Previous works focused on the case where $S$ is known, or the action sets satisfy additional assumptions. In this work, we obtain the first sparse regret bounds that hold when $S$ is unknown and the action sets are adversarially generated. Our techniques combine online to confidence set conversions with a novel randomized model selection approach over a hierarchy of nested confidence sets. When $S$ is known, our analysis recovers state-of-the-art bounds for adversarial action sets. We also show that a variant of our approach, using Exp3 to dynamically select the confidence sets, can be used to improve the empirical performance of stochastic linear bandits while enjoying a regret bound with optimal dependence on the time horizon. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: 25 pages

arXiv:2402.10282 [pdf, other]

Information Capacity Regret Bounds for Bandits with Mediator Feedback

Authors: Khaled Eldowa, Nicolò Cesa-Bianchi, Alberto Maria Metelli, Marcello Restelli

Abstract: This work addresses the mediator feedback problem, a bandit game where the decision set consists of a number of policies, each associated with a probability distribution over a common space of outcomes. Upon choosing a policy, the learner observes an outcome sampled from its distribution and incurs the loss assigned to this outcome in the present round. We introduce the policy set capacity as an i… ▽ More This work addresses the mediator feedback problem, a bandit game where the decision set consists of a number of policies, each associated with a probability distribution over a common space of outcomes. Upon choosing a policy, the learner observes an outcome sampled from its distribution and incurs the loss assigned to this outcome in the present round. We introduce the policy set capacity as an information-theoretic measure for the complexity of the policy set. Adopting the classical EXP4 algorithm, we provide new regret bounds depending on the policy set capacity in both the adversarial and the stochastic settings. For a selection of policy set families, we prove nearly-matching lower bounds, scaling similarly with the capacity. We also consider the case when the policies' distributions can vary between rounds, thus addressing the related bandits with expert advice problem, which we improve upon its prior results. Additionally, we prove a lower bound showing that exploiting the similarity between the policies is not possible in general under linear bandit feedback. Finally, for a full-information variant, we provide a regret bound scaling with the information radius of the policy set. △ Less

Submitted 15 February, 2024; originally announced February 2024.

arXiv:2312.15433 [pdf, ps, other]

Best-of-Both-Worlds Algorithms for Linear Contextual Bandits

Authors: Yuko Kuroki, Alberto Rumi, Taira Tsuchiya, Fabio Vitale, Nicolò Cesa-Bianchi

Abstract: We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits. Our algorithms deliver near-optimal regret bounds in both the adversarial and stochastic regimes, without prior knowledge about the environment. In the stochastic regime, we achieve the polylogarithmic rate $\frac{(dK)^2\mathrm{poly}\log(dKT)}{Δ_{\min}}$, where $Δ_{\min}$ is the minimum suboptimality gap over the $d$-… ▽ More We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits. Our algorithms deliver near-optimal regret bounds in both the adversarial and stochastic regimes, without prior knowledge about the environment. In the stochastic regime, we achieve the polylogarithmic rate $\frac{(dK)^2\mathrm{poly}\log(dKT)}{Δ_{\min}}$, where $Δ_{\min}$ is the minimum suboptimality gap over the $d$-dimensional context space. In the adversarial regime, we obtain either the first-order $\widetilde{O}(dK\sqrt{L^*})$ bound, or the second-order $\widetilde{O}(dK\sqrt{Λ^*})$ bound, where $L^*$ is the cumulative loss of the best action and $Λ^*$ is a notion of the cumulative second moment for the losses incurred by the algorithm. Moreover, we develop an algorithm based on FTRL with Shannon entropy regularizer that does not require the knowledge of the inverse of the covariance matrix, and achieves a polylogarithmic regret in the stochastic regime while obtaining $\widetilde{O}\big(dK\sqrt{T}\big)$ regret bounds in the adversarial regime. △ Less

Submitted 19 February, 2024; v1 submitted 24 December, 2023; originally announced December 2023.

Comments: Accepted at AISTATS2024

arXiv:2310.09597 [pdf, other]

Adaptive maximization of social welfare

Authors: Nicolo Cesa-Bianchi, Roberto Colomboni, Maximilian Kasy

Abstract: We consider the problem of repeatedly choosing policies to maximize social welfare. Welfare is a weighted sum of private utility and public revenue. Earlier outcomes inform later policies. Utility is not observed, but indirectly inferred. Response functions are learned through experimentation. We derive a lower bound on regret, and a matching adversarial upper bound for a variant of the Exp3 alg… ▽ More We consider the problem of repeatedly choosing policies to maximize social welfare. Welfare is a weighted sum of private utility and public revenue. Earlier outcomes inform later policies. Utility is not observed, but indirectly inferred. Response functions are learned through experimentation. We derive a lower bound on regret, and a matching adversarial upper bound for a variant of the Exp3 algorithm. Cumulative regret grows at a rate of $T^{2/3}$. This implies that (i) welfare maximization is harder than the multi-armed bandit problem (with a rate of $T^{1/2}$ for finite policy sets), and (ii) our algorithm achieves the optimal rate. For the stochastic setting, if social welfare is concave, we can achieve a rate of $T^{1/2}$ (for continuous policy sets), using a dyadic search algorithm. We analyze an extension to nonlinear income taxation, and sketch an extension to commodity taxation. We compare our setting to monopoly pricing (which is easier), and price setting for bilateral trade (which is harder). △ Less

Submitted 14 October, 2023; originally announced October 2023.

arXiv:2307.00836 [pdf, other]

Trading-Off Payments and Accuracy in Online Classification with Paid Stochastic Experts

Authors: Dirk van der Hoeven, Ciara Pike-Burke, Hao Qiu, Nicolo Cesa-Bianchi

Abstract: We investigate online classification with paid stochastic experts. Here, before making their prediction, each expert must be paid. The amount that we pay each expert directly influences the accuracy of their prediction through some unknown Lipschitz "productivity" function. In each round, the learner must decide how much to pay each expert and then make a prediction. They incur a cost equal to a w… ▽ More We investigate online classification with paid stochastic experts. Here, before making their prediction, each expert must be paid. The amount that we pay each expert directly influences the accuracy of their prediction through some unknown Lipschitz "productivity" function. In each round, the learner must decide how much to pay each expert and then make a prediction. They incur a cost equal to a weighted sum of the prediction error and upfront payments for all experts. We introduce an online learning algorithm whose total cost after $T$ rounds exceeds that of a predictor which knows the productivity of all experts in advance by at most $\mathcal{O}(K^2(\log T)\sqrt{T})$ where $K$ is the number of experts. In order to achieve this result, we combine Lipschitz bandits and online classification with surrogate losses. These tools allow us to improve upon the bound of order $T^{2/3}$ one would obtain in the standard Lipschitz bandit setting. Our algorithm is empirically evaluated on synthetic data △ Less

Submitted 3 July, 2023; originally announced July 2023.

Comments: ICML 2023

arXiv:2303.08102 [pdf, ps, other]

Information-Theoretic Regret Bounds for Bandits with Fixed Expert Advice

Authors: Khaled Eldowa, Nicolò Cesa-Bianchi, Alberto Maria Metelli, Marcello Restelli

Abstract: We investigate the problem of bandits with expert advice when the experts are fixed and known distributions over the actions. Improving on previous analyses, we show that the regret in this setting is controlled by information-theoretic quantities that measure the similarity between experts. In some natural special cases, this allows us to obtain the first regret bound for EXP4 that can get arbitr… ▽ More We investigate the problem of bandits with expert advice when the experts are fixed and known distributions over the actions. Improving on previous analyses, we show that the regret in this setting is controlled by information-theoretic quantities that measure the similarity between experts. In some natural special cases, this allows us to obtain the first regret bound for EXP4 that can get arbitrarily close to zero if the experts are similar enough. While for a different algorithm, we provide another bound that describes the similarity between the experts in terms of the KL-divergence, and we show that this bound can be smaller than the one of EXP4 in some cases. Additionally, we provide lower bounds for certain classes of experts showing that the algorithms we analyzed are nearly optimal in some cases. △ Less

Submitted 15 March, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

arXiv:2206.02656 [pdf, ps, other]

A Regret-Variance Trade-Off in Online Learning

Authors: Dirk van der Hoeven, Nikita Zhivotovskiy, Nicolò Cesa-Bianchi

Abstract: We consider prediction with expert advice for strongly convex and bounded losses, and investigate trade-offs between regret and "variance" (i.e., squared difference of learner's predictions and best expert predictions). With $K$ experts, the Exponentially Weighted Average (EWA) algorithm is known to achieve $O(\log K)$ regret. We prove that a variant of EWA either achieves a negative regret (i.e.,… ▽ More We consider prediction with expert advice for strongly convex and bounded losses, and investigate trade-offs between regret and "variance" (i.e., squared difference of learner's predictions and best expert predictions). With $K$ experts, the Exponentially Weighted Average (EWA) algorithm is known to achieve $O(\log K)$ regret. We prove that a variant of EWA either achieves a negative regret (i.e., the algorithm outperforms the best expert), or guarantees a $O(\log K)$ bound on both variance and regret. Building on this result, we show several examples of how variance of predictions can be exploited in learning. In the online to batch analysis, we show that a large empirical variance allows to stop the online to batch conversion early and outperform the risk of the best predictor in the class. We also recover the optimal rate of model selection aggregation when we do not consider early stop**. In online prediction with corrupted losses, we show that the effect of corruption on the regret can be compensated by a large variance. In online selective sampling, we design an algorithm that samples less when the variance is large, while guaranteeing the optimal regret bound in expectation. In online learning with abstention, we use a similar term as the variance to derive the first high-probability $O(\log K)$ regret bound in this setting. Finally, we extend our results to the setting of online linear regression. △ Less

Submitted 6 June, 2022; originally announced June 2022.

arXiv:2106.04982 [pdf, other]

Cooperative Online Learning with Feedback Graphs

Authors: Nicolò Cesa-Bianchi, Tommaso R. Cesari, Riccardo Della Vecchia

Abstract: We study the interplay between feedback and communication in a cooperative online learning setting where a network of agents solves a task in which the learners' feedback is determined by an arbitrary graph. We characterize regret in terms of the independence number of the strong product between the feedback graph and the communication network. Our analysis recovers as special cases many previousl… ▽ More We study the interplay between feedback and communication in a cooperative online learning setting where a network of agents solves a task in which the learners' feedback is determined by an arbitrary graph. We characterize regret in terms of the independence number of the strong product between the feedback graph and the communication network. Our analysis recovers as special cases many previously known bounds for distributed online learning with either expert or bandit feedback. A more detailed version of our results also captures the dependence of the regret on the delay caused by the time the information takes to traverse each graph. Experiments run on synthetic data show that the empirical behavior of our algorithm is consistent with the theoretical results. △ Less

Submitted 24 September, 2022; v1 submitted 9 June, 2021; originally announced June 2021.

arXiv:2106.04913 [pdf, ps, other]

On Margin-Based Cluster Recovery with Oracle Queries

Authors: Marco Bressan, Nicolò Cesa-Bianchi, Silvio Lattanzi, Andrea Paudice

Abstract: We study an active cluster recovery problem where, given a set of $n$ points and an oracle answering queries like "are these two points in the same cluster?", the task is to recover exactly all clusters using as few queries as possible. We begin by introducing a simple but general notion of margin between clusters that captures, as special cases, the margins used in previous work, the classic SVM… ▽ More We study an active cluster recovery problem where, given a set of $n$ points and an oracle answering queries like "are these two points in the same cluster?", the task is to recover exactly all clusters using as few queries as possible. We begin by introducing a simple but general notion of margin between clusters that captures, as special cases, the margins used in previous work, the classic SVM margin, and standard notions of stability for center-based clusterings. Then, under our margin assumptions we design algorithms that, in a variety of settings, recover all clusters exactly using only $O(\log n)$ queries. For the Euclidean case, $\mathbb{R}^m$, we give an algorithm that recovers arbitrary convex clusters, in polynomial time, and with a number of queries that is lower than the best existing algorithm by $Θ(m^m)$ factors. For general pseudometric spaces, where clusters might not be convex or might not have any notion of shape, we give an algorithm that achieves the $O(\log n)$ query bound, and is provably near-optimal as a function of the packing number of the space. Finally, for clusterings realized by binary concept classes, we give a combinatorial characterization of recoverability with $O(\log n)$ queries, and we show that, for many concept classes in Euclidean spaces, this characterization is equivalent to our margin condition. Our results show a deep connection between cluster margins and active cluster recoverability. △ Less

Submitted 9 June, 2021; originally announced June 2021.

arXiv:2102.09864 [pdf, other]

An Algorithm for Stochastic and Adversarial Bandits with Switching Costs

Authors: Chloé Rouyer, Yevgeny Seldin, Nicolò Cesa-Bianchi

Abstract: We propose an algorithm for stochastic and adversarial multiarmed bandits with switching costs, where the algorithm pays a price $λ$ every time it switches the arm being played. Our algorithm is based on adaptation of the Tsallis-INF algorithm of Zimmert and Seldin (2021) and requires no prior knowledge of the regime or time horizon. In the oblivious adversarial setting it achieves the minimax opt… ▽ More We propose an algorithm for stochastic and adversarial multiarmed bandits with switching costs, where the algorithm pays a price $λ$ every time it switches the arm being played. Our algorithm is based on adaptation of the Tsallis-INF algorithm of Zimmert and Seldin (2021) and requires no prior knowledge of the regime or time horizon. In the oblivious adversarial setting it achieves the minimax optimal regret bound of $O\big((λK)^{1/3}T^{2/3} + \sqrt{KT}\big)$, where $T$ is the time horizon and $K$ is the number of arms. In the stochastically constrained adversarial regime, which includes the stochastic regime as a special case, it achieves a regret bound of $O\left(\big((λK)^{2/3} T^{1/3} + \ln T\big)\sum_{i \neq i^*} Δ_i^{-1}\right)$, where $Δ_i$ are the suboptimality gaps and $i^*$ is a unique optimal arm. In the special case of $λ= 0$ (no switching costs), both bounds are minimax optimal within constants. We also explore variants of the problem, where switching cost is allowed to change over time. We provide experimental evaluation showing competitiveness of our algorithm with the relevant baselines in the stochastic, stochastically constrained adversarial, and adversarial regimes with fixed switching cost. △ Less

Submitted 19 February, 2021; originally announced February 2021.

arXiv:2102.08754 [pdf, ps, other]

doi 10.1145/3465456.3467645

A Regret Analysis of Bilateral Trade

Authors: Nicolò Cesa-Bianchi, Tommaso Cesari, Roberto Colomboni, Federico Fusco, Stefano Leonardi

Abstract: Bilateral trade, a fundamental topic in economics, models the problem of intermediating between two strategic agents, a seller and a buyer, willing to trade a good for which they hold private valuations. Despite the simplicity of this problem, a classical result by Myerson and Satterthwaite (1983) affirms the impossibility of designing a mechanism which is simultaneously efficient, incentive compa… ▽ More Bilateral trade, a fundamental topic in economics, models the problem of intermediating between two strategic agents, a seller and a buyer, willing to trade a good for which they hold private valuations. Despite the simplicity of this problem, a classical result by Myerson and Satterthwaite (1983) affirms the impossibility of designing a mechanism which is simultaneously efficient, incentive compatible, individually rational, and budget balanced. This impossibility result fostered an intense investigation of meaningful trade-offs between these desired properties. Much work has focused on approximately efficient fixed-price mechanisms, i.e., Blumrosen and Dobzinski (2014; 2016), Colini-Baldeschi et al. (2016), which have been shown to fully characterize strong budget balanced and ex-post individually rational direct revelation mechanisms. All these results, however, either assume some knowledge on the priors of the seller/buyer valuations, or a black box access to some samples of the distributions, as in D{ü}tting et al. (2021). In this paper, we cast for the first time the bilateral trade problem in a regret minimization framework over rounds of seller/buyer interactions, with no prior knowledge on the private seller/buyer valuations. Our main contribution is a complete characterization of the regret regimes for fixed-price mechanisms with different models of feedback and private valuations, using as benchmark the best fixed price in hindsight. More precisely, we prove the following bounds on the regret: $\bullet$ $\widetildeΘ(\sqrt{T})$ for full-feedback (i.e., direct revelation mechanisms); $\bullet$ $\widetildeΘ(T^{2/3})$ for realistic feedback (i.e., posted-price mechanisms) and independent seller/buyer valuations with bounded densities; $\bullet$ $Θ(T)$ for realistic feedback and seller/buyer valuations with bounded densities; $\bullet$ $Θ(T)$ for realistic feedback and independent seller/buyer valuations; $\bullet$ $Θ(T)$ for the adversarial setting. △ Less

Submitted 16 February, 2021; originally announced February 2021.

Journal ref: EC '21: Proceedings of the 22nd ACM Conference on Economics and Computation (2021))

arXiv:2102.00504 [pdf, other]

Exact Recovery of Clusters in Finite Metric Spaces Using Oracle Queries

Authors: Marco Bressan, Nicolò Cesa-Bianchi, Silvio Lattanzi, Andrea Paudice

Abstract: We investigate the problem of exact cluster recovery using oracle queries. Previous results show that clusters in Euclidean spaces that are convex and separated with a margin can be reconstructed exactly using only $O(\log n)$ same-cluster queries, where $n$ is the number of input points. In this work, we study this problem in the more challenging non-convex setting. We introduce a structural char… ▽ More We investigate the problem of exact cluster recovery using oracle queries. Previous results show that clusters in Euclidean spaces that are convex and separated with a margin can be reconstructed exactly using only $O(\log n)$ same-cluster queries, where $n$ is the number of input points. In this work, we study this problem in the more challenging non-convex setting. We introduce a structural characterization of clusters, called $(β,γ)$-convexity, that can be applied to any finite set of points equipped with a metric (or even a semimetric, as the triangle inequality is not needed). Using $(β,γ)$-convexity, we can translate natural density properties of clusters (which include, for instance, clusters that are strongly non-convex in $\mathbb{R}^d$) into a graph-theoretic notion of convexity. By exploiting this convexity notion, we design a deterministic algorithm that recovers $(β,γ)$-convex clusters using $O(k^2 \log n + k^2 (6/βγ)^{dens(X)})$ same-cluster queries, where $k$ is the number of clusters and $dens(X)$ is the density dimension of the semimetric. We show that an exponential dependence on the density dimension is necessary, and we also show that, if we are allowed to make $O(k^2 + k\log n)$ additional queries to a "cluster separation" oracle, then we can recover clusters that have different and arbitrary scales, even when the scale of each cluster is unknown. △ Less

Submitted 13 July, 2021; v1 submitted 31 January, 2021; originally announced February 2021.

Comments: Accepted for presentation at the Conference on Learning Theory (COLT) 2021

arXiv:2006.04675 [pdf, other]

Exact Recovery of Mangled Clusters with Same-Cluster Queries

Authors: Marco Bressan, Nicolò Cesa-Bianchi, Silvio Lattanzi, Andrea Paudice

Abstract: We study the cluster recovery problem in the semi-supervised active clustering framework. Given a finite set of input points, and an oracle revealing whether any two points lie in the same cluster, our goal is to recover all clusters exactly using as few queries as possible. To this end, we relax the spherical $k$-means cluster assumption of Ashtiani et al.\ to allow for arbitrary ellipsoidal clus… ▽ More We study the cluster recovery problem in the semi-supervised active clustering framework. Given a finite set of input points, and an oracle revealing whether any two points lie in the same cluster, our goal is to recover all clusters exactly using as few queries as possible. To this end, we relax the spherical $k$-means cluster assumption of Ashtiani et al.\ to allow for arbitrary ellipsoidal clusters with margin. This removes the assumption that the clustering is center-based (i.e., defined through an optimization problem), and includes all those cases where spherical clusters are individually transformed by any combination of rotations, axis scalings, and point deletions. We show that, even in this much more general setting, it is still possible to recover the latent clustering exactly using a number of queries that scales only logarithmically with the number of input points. More precisely, we design an algorithm that, given $n$ points to be partitioned into $k$ clusters, uses $O(k^3 \ln k \ln n)$ oracle queries and $\tilde{O}(kn + k^3)$ time to recover the clustering with zero misclassification error. The $O(\cdot)$ notation hides an exponential dependence on the dimensionality of the clusters, which we show to be necessary thus characterizing the query complexity of the problem. Our algorithm is simple, easy to implement, and can also learn the clusters using low-stretch separators, a class of ellipsoids with additional theoretical guarantees. Experiments on large synthetic datasets confirm that we can reconstruct clusterings exactly and efficiently. △ Less

Submitted 30 October, 2020; v1 submitted 8 June, 2020; originally announced June 2020.

Comments: To appear at NeurIPS 2020 (oral)

arXiv:2002.01882 [pdf, other]

Locally-Adaptive Nonparametric Online Learning

Authors: Ilja Kuzborskij, Nicolò Cesa-Bianchi

Abstract: One of the main strengths of online algorithms is their ability to adapt to arbitrary data sequences. This is especially important in nonparametric settings, where performance is measured against rich classes of comparator functions that are able to fit complex environments. Although such hard comparators and complex environments may exhibit local regularities, efficient algorithms, which can prov… ▽ More One of the main strengths of online algorithms is their ability to adapt to arbitrary data sequences. This is especially important in nonparametric settings, where performance is measured against rich classes of comparator functions that are able to fit complex environments. Although such hard comparators and complex environments may exhibit local regularities, efficient algorithms, which can provably take advantage of these local patterns, are hardly known. We fill this gap by introducing efficient online algorithms (based on a single versatile master algorithm) each adapting to one of the following regularities: (i) local Lipschitzness of the competitor function, (ii) local metric dimension of the instance sequence, (iii) local performance of the predictor across different regions of the instance space. Extending previous approaches, we design algorithms that dynamically grow hierarchical $ε$-nets on the instance space whose prunings correspond to different "locality profiles" for the problem at hand. Using a technique based on tree experts, we simultaneously and efficiently compete against all such prunings, and prove regret bounds each scaling with a quantity associated with a different type of local regularity. When competing against "simple" locality profiles, our technique delivers regret bounds that are significantly better than those proven using the previous approach. On the other hand, the time dependence of our bounds is not worse than that obtained by ignoring any local regularities. △ Less

Submitted 1 November, 2020; v1 submitted 5 February, 2020; originally announced February 2020.

arXiv:1910.02757 [pdf, other]

Stochastic Bandits with Delay-Dependent Payoffs

Authors: Leonardo Cella, Nicolò Cesa-Bianchi

Abstract: Motivated by recommendation problems in music streaming platforms, we propose a nonstationary stochastic bandit model in which the expected reward of an arm depends on the number of rounds that have passed since the arm was last pulled. After proving that finding an optimal policy is NP-hard even when all model parameters are known, we introduce a class of ranking policies provably approximating,… ▽ More Motivated by recommendation problems in music streaming platforms, we propose a nonstationary stochastic bandit model in which the expected reward of an arm depends on the number of rounds that have passed since the arm was last pulled. After proving that finding an optimal policy is NP-hard even when all model parameters are known, we introduce a class of ranking policies provably approximating, to within a constant factor, the expected reward of the optimal policy. We show an algorithm whose regret with respect to the best ranking policy is bounded by $\widetilde{\mathcal{O}}\big(\!\sqrt{kT}\big)$, where $k$ is the number of arms and $T$ is time. Our algorithm uses only $\mathcal{O}\big(k\ln\ln T\big)$ switches, which helps when switching between policies is costly. As constructing the class of learning policies requires ordering the arms according to their expectations, we also bound the number of pulls required to do so. Finally, we run experiments to compare our algorithm against UCB on different problem instances. △ Less

Submitted 19 February, 2020; v1 submitted 7 October, 2019; originally announced October 2019.

arXiv:1906.00670 [pdf, other]

Nonstochastic Multiarmed Bandits with Unrestricted Delays

Authors: Tobias Sommer Thune, Nicolò Cesa-Bianchi, Yevgeny Seldin

Abstract: We investigate multiarmed bandits with delayed feedback, where the delays need neither be identical nor bounded. We first prove that "delayed" Exp3 achieves the $O(\sqrt{(KT + D)\ln K} )$ regret bound conjectured by Cesa-Bianchi et al. [2019] in the case of variable, but bounded delays. Here, $K$ is the number of actions and $D$ is the total delay over $T$ rounds. We then introduce a new algorithm… ▽ More We investigate multiarmed bandits with delayed feedback, where the delays need neither be identical nor bounded. We first prove that "delayed" Exp3 achieves the $O(\sqrt{(KT + D)\ln K} )$ regret bound conjectured by Cesa-Bianchi et al. [2019] in the case of variable, but bounded delays. Here, $K$ is the number of actions and $D$ is the total delay over $T$ rounds. We then introduce a new algorithm that lifts the requirement of bounded delays by using a wrapper that skips rounds with excessively large delays. The new algorithm maintains the same regret bound, but similar to its predecessor requires prior knowledge of $D$ and $T$. For this algorithm we then construct a novel doubling scheme that forgoes the prior knowledge requirement under the assumption that the delays are available at action time (rather than at loss observation time). This assumption is satisfied in a broad range of applications, including interaction with servers and service providers. The resulting oracle regret bound is of order $\min_β(|S_β|+β\ln K + (KT + D_β)/β)$, where $|S_β|$ is the number of observations with delay exceeding $β$, and $D_β$ is the total delay of observations with delay below $β$. The bound relaxes to $O (\sqrt{(KT + D)\ln K} )$, but we also provide examples where $D_β\ll D$ and the oracle bound has a polynomially better dependence on the problem parameters. △ Less

Submitted 19 November, 2019; v1 submitted 3 June, 2019; originally announced June 2019.

Comments: 9 pages, Neurips camera ready

arXiv:1905.11902 [pdf, other]

Correlation Clustering with Adaptive Similarity Queries

Authors: Marco Bressan, Nicolò Cesa-Bianchi, Andrea Paudice, Fabio Vitale

Abstract: In correlation clustering, we are given $n$ objects together with a binary similarity score between each pair of them. The goal is to partition the objects into clusters so to minimise the disagreements with the scores. In this work we investigate correlation clustering as an active learning problem: each similarity score can be learned by making a query, and the goal is to minimise both the disag… ▽ More In correlation clustering, we are given $n$ objects together with a binary similarity score between each pair of them. The goal is to partition the objects into clusters so to minimise the disagreements with the scores. In this work we investigate correlation clustering as an active learning problem: each similarity score can be learned by making a query, and the goal is to minimise both the disagreements and the total number of queries. On the one hand, we describe simple active learning algorithms, which provably achieve an almost optimal trade-off while giving cluster recovery guarantees, and we test them on different datasets. On the other hand, we prove information-theoretical bounds on the number of queries necessary to guarantee a prescribed disagreement bound. These results give a rich characterization of the trade-off between queries and clustering error. △ Less

Submitted 14 January, 2020; v1 submitted 28 May, 2019; originally announced May 2019.

arXiv:1905.11797 [pdf, ps, other]

ROI Maximization in Stochastic Online Decision-Making

Authors: Nicolò Cesa-Bianchi, Tommaso Cesari, Yishay Mansour, Vianney Perchet

Abstract: We introduce a novel theoretical framework for Return On Investment (ROI) maximization in repeated decision-making. Our setting is motivated by the use case of companies that regularly receive proposals for technological innovations and want to quickly decide whether they are worth implementing. We design an algorithm for learning ROI-maximizing decision-making policies over a sequence of innovati… ▽ More We introduce a novel theoretical framework for Return On Investment (ROI) maximization in repeated decision-making. Our setting is motivated by the use case of companies that regularly receive proposals for technological innovations and want to quickly decide whether they are worth implementing. We design an algorithm for learning ROI-maximizing decision-making policies over a sequence of innovation proposals. Our algorithm provably converges to an optimal policy in class $Π$ at a rate of order $\min\big\{1/(NΔ^2),N^{-1/3}\}$, where $N$ is the number of innovations and $Δ$ is the suboptimality gap in $Π$. A significant hurdle of our formulation, which sets it aside from other online learning problems such as bandits, is that running a policy does not provide an unbiased estimate of its performance. △ Less

Submitted 22 December, 2021; v1 submitted 28 May, 2019; originally announced May 2019.

arXiv:1902.01846 [pdf, other]

Distribution-Dependent Analysis of Gibbs-ERM Principle

Authors: Ilja Kuzborskij, Nicolò Cesa-Bianchi, Csaba Szepesvári

Abstract: Gibbs-ERM learning is a natural idealized model of learning with stochastic optimization algorithms (such as Stochastic Gradient Langevin Dynamics and ---to some extent--- Stochastic Gradient Descent), while it also arises in other contexts, including PAC-Bayesian theory, and sampling mechanisms. In this work we study the excess risk suffered by a Gibbs-ERM learner that uses non-convex, regularize… ▽ More Gibbs-ERM learning is a natural idealized model of learning with stochastic optimization algorithms (such as Stochastic Gradient Langevin Dynamics and ---to some extent--- Stochastic Gradient Descent), while it also arises in other contexts, including PAC-Bayesian theory, and sampling mechanisms. In this work we study the excess risk suffered by a Gibbs-ERM learner that uses non-convex, regularized empirical risk with the goal to understand the interplay between the data-generating distribution and learning in large hypothesis spaces. Our main results are distribution-dependent upper bounds on several notions of excess risk. We show that, in all cases, the distribution-dependent excess risk is essentially controlled by the effective dimension $\mathrm{tr}\left(\boldsymbol{H}^{\star} (\boldsymbol{H}^{\star} + λ\boldsymbol{I})^{-1}\right)$ of the problem, where $\boldsymbol{H}^{\star}$ is the Hessian matrix of the risk at a local minimum. This is a well-established notion of effective dimension appearing in several previous works, including the analyses of SGD and ridge regression, but ours is the first work that brings this dimension to the analysis of learning using Gibbs densities. The distribution-dependent view we advocate here improves upon earlier results of Raginsky et al. (2017), and can yield much tighter bounds depending on the interplay between the data-generating distribution and the loss function. The first part of our analysis focuses on the localized excess risk in the vicinity of a fixed local minimizer. This result is then extended to bounds on the global excess risk, by characterizing probabilities of local minima (and their complement) under Gibbs densities, a results which might be of independent interest. △ Less

Submitted 5 February, 2019; originally announced February 2019.

arXiv:1901.08082 [pdf, ps, other]

Cooperative Online Learning: Kee** your Neighbors Updated

Authors: Nicolò Cesa-Bianchi, Tommaso R. Cesari, Claire Monteleoni

Abstract: We study an asynchronous online learning setting with a network of agents. At each time step, some of the agents are activated, requested to make a prediction, and pay the corresponding loss. The loss function is then revealed to these agents and also to their neighbors in the network. Our results characterize how much knowing the network structure affects the regret as a function of the model of… ▽ More We study an asynchronous online learning setting with a network of agents. At each time step, some of the agents are activated, requested to make a prediction, and pay the corresponding loss. The loss function is then revealed to these agents and also to their neighbors in the network. Our results characterize how much knowing the network structure affects the regret as a function of the model of agent activations. When activations are stochastic, the optimal regret (up to constant factors) is shown to be of order $\sqrt{αT}$, where $T$ is the horizon and $α$ is the independence number of the network. We prove that the upper bound is achieved even when agents have no information about the network structure. When activations are adversarial the situation changes dramatically: if agents ignore the network structure, a $Ω(T)$ lower bound on the regret can be proven, showing that learning is impossible. However, when agents can choose to ignore some of their neighbors based on the knowledge of the network structure, we prove a $O(\sqrt{\overlineχ T})$ sublinear regret bound, where $\overlineχ \ge α$ is the clique-covering number of the network. △ Less

Submitted 15 January, 2020; v1 submitted 23 January, 2019; originally announced January 2019.

arXiv:1809.11033 [pdf, other]

Efficient Linear Bandits through Matrix Sketching

Authors: Ilja Kuzborskij, Leonardo Cella, Nicolò Cesa-Bianchi

Abstract: We prove that two popular linear contextual bandit algorithms, OFUL and Thompson Sampling, can be made efficient using Frequent Directions, a deterministic online sketching technique. More precisely, we show that a sketch of size $m$ allows a $\mathcal{O}(md)$ update time for both algorithms, as opposed to $Ω(d^2)$ required by their non-sketched versions in general (where $d$ is the dimension of c… ▽ More We prove that two popular linear contextual bandit algorithms, OFUL and Thompson Sampling, can be made efficient using Frequent Directions, a deterministic online sketching technique. More precisely, we show that a sketch of size $m$ allows a $\mathcal{O}(md)$ update time for both algorithms, as opposed to $Ω(d^2)$ required by their non-sketched versions in general (where $d$ is the dimension of context vectors). This computational speedup is accompanied by regret bounds of order $(1+\varepsilon_m)^{3/2}d\sqrt{T}$ for OFUL and of order $\big((1+\varepsilon_m)d\big)^{3/2}\sqrt{T}$ for Thompson Sampling, where $\varepsilon_m$ is bounded by the sum of the tail eigenvalues not covered by the sketch. In particular, when the selected contexts span a subspace of dimension at most $m$, our algorithms have a regret bound matching that of their slower, non-sketched counterparts. Experiments on real-world datasets corroborate our theoretical results. △ Less

Submitted 21 March, 2022; v1 submitted 28 September, 2018; originally announced September 2018.

arXiv:1807.03288 [pdf, ps, other]

Dynamic Pricing with Finitely Many Unknown Valuations

Authors: Nicolò Cesa-Bianchi, Tommaso Cesari, Vianney Perchet

Abstract: Motivated by posted price auctions where buyers are grouped in an unknown number of latent types characterized by their private values for the good on sale, we investigate revenue maximization in stochastic dynamic pricing when the distribution of buyers' private values is supported on an unknown set of points in [0,1] of unknown cardinality $K$. This setting can be viewed as an instance of a stoc… ▽ More Motivated by posted price auctions where buyers are grouped in an unknown number of latent types characterized by their private values for the good on sale, we investigate revenue maximization in stochastic dynamic pricing when the distribution of buyers' private values is supported on an unknown set of points in [0,1] of unknown cardinality $K$. This setting can be viewed as an instance of a stochastic $K$-armed bandit problem where the location of the arms (the $K$ unknown valuations) must be learned as well. In the distribution-free case, we prove that our setting is just as hard as $K$-armed stochastic bandits: no algorithm can achieve a regret significantly better than $\sqrt{KT}$, (where T is the time horizon); we present an efficient algorithm matching this lower bound up to logarithmic factors. In the distribution-dependent case, we show that for all $K>2$ our setting is strictly harder than $K$-armed stochastic bandits by proving that it is impossible to obtain regret bounds that grow logarithmically in time or slower. On the other hand, when a lower bound $γ>0$ on the smallest drop in the demand curve is known, we prove an upper bound on the regret of order $(1/Δ+(\log \log T)/γ^2)(K\log T)$. This is a significant improvement on previously known regret bounds for discontinuous demand curves, that are at best of order $(K^{12}/γ^8)\sqrt{T}$. When $K=2$ in the distribution-dependent case, the hardness of our setting reduces to that of a stochastic $2$-armed bandit: we prove that an upper bound of order $(\log T)/Δ$ (up to $\log\log$ factors) on the regret can be achieved with no information on the demand curve. Finally, we show a $O(\sqrt{T})$ upper bound on the regret for the setting in which the buyers' decisions are nonstochastic, and the regret is measured with respect to the best between two fixed valuations one of which is known to the seller. △ Less

Submitted 5 March, 2019; v1 submitted 9 July, 2018; originally announced July 2018.

arXiv:1805.07331 [pdf, other]

Positive and Unlabeled Learning through Negative Selection and Imbalance-aware Classification

Authors: Marco Frasca, Nicolò Cesa-Bianchi

Abstract: Motivated by applications in protein function prediction, we consider a challenging supervised classification setting in which positive labels are scarce and there are no explicit negative labels. The learning algorithm must thus select which unlabeled examples to use as negative training points, possibly ending up with an unbalanced learning problem. We address these issues by proposing an algori… ▽ More Motivated by applications in protein function prediction, we consider a challenging supervised classification setting in which positive labels are scarce and there are no explicit negative labels. The learning algorithm must thus select which unlabeled examples to use as negative training points, possibly ending up with an unbalanced learning problem. We address these issues by proposing an algorithm that combines active learning (for selecting negative examples) with imbalance-aware learning (for mitigating the label imbalance). In our experiments we observe that these two techniques operate synergistically, outperforming state-of-the-art methods on standard protein function prediction benchmarks. △ Less

Submitted 25 January, 2019; v1 submitted 18 May, 2018; originally announced May 2018.

arXiv:1705.10257 [pdf, ps, other]

Boltzmann Exploration Done Right

Authors: Nicolò Cesa-Bianchi, Claudio Gentile, Gábor Lugosi, Gergely Neu

Abstract: Boltzmann exploration is a classic strategy for sequential decision-making under uncertainty, and is one of the most standard tools in Reinforcement Learning (RL). Despite its widespread use, there is virtually no theoretical understanding about the limitations or the actual benefits of this exploration scheme. Does it drive exploration in a meaningful way? Is it prone to misidentifying the optima… ▽ More Boltzmann exploration is a classic strategy for sequential decision-making under uncertainty, and is one of the most standard tools in Reinforcement Learning (RL). Despite its widespread use, there is virtually no theoretical understanding about the limitations or the actual benefits of this exploration scheme. Does it drive exploration in a meaningful way? Is it prone to misidentifying the optimal actions or spending too much time exploring the suboptimal ones? What is the right tuning for the learning rate? In this paper, we address several of these questions in the classic setup of stochastic multi-armed bandits. One of our main results is showing that the Boltzmann exploration strategy with any monotone learning-rate sequence will induce suboptimal behavior. As a remedy, we offer a simple non-monotone schedule that guarantees near-optimal performance, albeit only when given prior access to key problem parameters that are typically not available in practical situations (like the time horizon $T$ and the suboptimality gap $Δ$). More importantly, we propose a novel variant that uses different learning rates for different arms, and achieves a distribution-dependent regret bound of order $\frac{K\log^2 T}Δ$ and a distribution-independent bound of order $\sqrt{KT}\log K$ without requiring such prior knowledge. To demonstrate the flexibility of our technique, we also propose a variant that guarantees the same performance bounds even if the rewards are heavy-tailed. △ Less

Submitted 7 November, 2017; v1 submitted 29 May, 2017; originally announced May 2017.

arXiv:1705.05091 [pdf, ps, other]

Bandit Regret Scaling with the Effective Loss Range

Authors: Nicolò Cesa-Bianchi, Ohad Shamir

Abstract: We study how the regret guarantees of nonstochastic multi-armed bandits can be improved, if the effective range of the losses in each round is small (e.g. the maximal difference between two losses in a given round). Despite a recent impossibility result, we show how this can be made possible under certain mild additional assumptions, such as availability of rough estimates of the losses, or advanc… ▽ More We study how the regret guarantees of nonstochastic multi-armed bandits can be improved, if the effective range of the losses in each round is small (e.g. the maximal difference between two losses in a given round). Despite a recent impossibility result, we show how this can be made possible under certain mild additional assumptions, such as availability of rough estimates of the losses, or advance knowledge of the loss of a single, possibly unspecified arm. Along the way, we develop a novel technique which might be of independent interest, to convert any multi-armed bandit algorithm with regret depending on the loss range, to an algorithm with regret depending only on the effective range, while avoiding predictably bad arms altogether. △ Less

Submitted 2 January, 2020; v1 submitted 15 May, 2017; originally announced May 2017.

Comments: The results in section 4 are incorrect as stated -- we have added an erratum at the beginning of the document. The results in the other sections are still valid. We thank Étienne de Montbrun for locating the error

arXiv:1702.08211 [pdf, ps, other]

Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning

Authors: Nicolò Cesa-Bianchi, Pierre Gaillard, Claudio Gentile, Sébastien Gerchinovitz

Abstract: We investigate contextual online learning with nonparametric (Lipschitz) comparison classes under different assumptions on losses and feedback information. For full information feedback and Lipschitz losses, we design the first explicit algorithm achieving the minimax regret rate (up to log factors). In a partial feedback model motivated by second-price auctions, we obtain algorithms for Lipschitz… ▽ More We investigate contextual online learning with nonparametric (Lipschitz) comparison classes under different assumptions on losses and feedback information. For full information feedback and Lipschitz losses, we design the first explicit algorithm achieving the minimax regret rate (up to log factors). In a partial feedback model motivated by second-price auctions, we obtain algorithms for Lipschitz and semi-Lipschitz losses with regret bounds improving on the known bounds for standard bandit feedback. Our analysis combines novel results for contextual second-price auctions with a novel algorithmic approach based on chaining. When the context space is Euclidean, our chaining approach is efficient and delivers an even better regret bound. △ Less

Submitted 30 June, 2017; v1 submitted 27 February, 2017; originally announced February 2017.

Comments: This document is the full version of an extended abstract accepted for presentation at COLT 2017

arXiv:1604.02855 [pdf, other]

Active Learning for Online Recognition of Human Activities from Streaming Videos

Authors: Rocco De Rosa, Ilaria Gori, Fabio Cuzzolin, Barbara Caputo, Nicolò Cesa-Bianchi

Abstract: Recognising human activities from streaming videos poses unique challenges to learning algorithms: predictive models need to be scalable, incrementally trainable, and must remain bounded in size even when the data stream is arbitrarily long. Furthermore, as parameter tuning is problematic in a streaming setting, suitable approaches should be parameterless, and make no assumptions on what class lab… ▽ More Recognising human activities from streaming videos poses unique challenges to learning algorithms: predictive models need to be scalable, incrementally trainable, and must remain bounded in size even when the data stream is arbitrarily long. Furthermore, as parameter tuning is problematic in a streaming setting, suitable approaches should be parameterless, and make no assumptions on what class labels may occur in the stream. We present here an approach to the recognition of human actions from streaming data which meets all these requirements by: (1) incrementally learning a model which adaptively covers the feature space with simple local classifiers; (2) employing an active learning strategy to reduce annotation requests; (3) achieving promising accuracy within a fixed model size. Extensive experiments on standard benchmarks show that our approach is competitive with state-of-the-art non-incremental methods, and outperforms the existing active incremental baselines. △ Less

Submitted 11 April, 2016; originally announced April 2016.

arXiv:1508.04912 [pdf, other]

The ABACOC Algorithm: a Novel Approach for Nonparametric Classification of Data Streams

Authors: Rocco De Rosa, Francesco Orabona, Nicolò Cesa-Bianchi

Abstract: Stream mining poses unique challenges to machine learning: predictive models are required to be scalable, incrementally trainable, must remain bounded in size (even when the data stream is arbitrarily long), and be nonparametric in order to achieve high accuracy even in complex and dynamic environments. Moreover, the learning system must be parameterless ---traditional tuning methods are problemat… ▽ More Stream mining poses unique challenges to machine learning: predictive models are required to be scalable, incrementally trainable, must remain bounded in size (even when the data stream is arbitrarily long), and be nonparametric in order to achieve high accuracy even in complex and dynamic environments. Moreover, the learning system must be parameterless ---traditional tuning methods are problematic in streaming settings--- and avoid requiring prior knowledge of the number of distinct class labels occurring in the stream. In this paper, we introduce a new algorithmic approach for nonparametric learning in data streams. Our approach addresses all above mentioned challenges by learning a model that covers the input space using simple local classifiers. The distribution of these classifiers dynamically adapts to the local (unknown) complexity of the classification problem, thus achieving a good balance between model complexity and predictive accuracy. We design four variants of our approach of increasing adaptivity. By means of an extensive empirical evaluation against standard nonparametric baselines, we show state-of-the-art results in terms of accuracy versus model size. For the variant that imposes a strict bound on the model size, we show better performance against all other methods measured at the same model size value. Our empirical analysis is complemented by a theoretical performance guarantee which does not rely on any stochastic assumption on the source generating the stream. △ Less

Submitted 20 August, 2015; originally announced August 2015.

arXiv:1411.1158 [pdf, ps, other]

On the Complexity of Learning with Kernels

Authors: Nicolò Cesa-Bianchi, Yishay Mansour, Ohad Shamir

Abstract: A well-recognized limitation of kernel learning is the requirement to handle a kernel matrix, whose size is quadratic in the number of training examples. Many methods have been proposed to reduce this computational cost, mostly by using a subset of the kernel matrix entries, or some form of low-rank matrix approximation, or a random projection method. In this paper, we study lower bounds on the er… ▽ More A well-recognized limitation of kernel learning is the requirement to handle a kernel matrix, whose size is quadratic in the number of training examples. Many methods have been proposed to reduce this computational cost, mostly by using a subset of the kernel matrix entries, or some form of low-rank matrix approximation, or a random projection method. In this paper, we study lower bounds on the error attainable by such methods as a function of the number of entries observed in the kernel matrix or the rank of an approximate kernel matrix. We show that there are kernel learning problems where no such method will lead to non-trivial computational savings. Our results also quantify how the problem difficulty depends on parameters such as the nature of the loss function, the regularization parameter, the norm of the desired predictor, and the kernel matrix rank. Our results also suggest cases where more efficient kernel learning might be possible. △ Less

Submitted 5 November, 2014; originally announced November 2014.

arXiv:1409.8428 [pdf, other]

Nonstochastic Multi-Armed Bandits with Graph-Structured Feedback

Authors: Noga Alon, Nicolò Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, Ohad Shamir

Abstract: We present and study a partial-information model of online learning, where a decision maker repeatedly chooses from a finite set of actions, and observes some subset of the associated losses. This naturally models several situations where the losses of different actions are related, and knowing the loss of one action provides information on the loss of other actions. Moreover, it generalizes and i… ▽ More We present and study a partial-information model of online learning, where a decision maker repeatedly chooses from a finite set of actions, and observes some subset of the associated losses. This naturally models several situations where the losses of different actions are related, and knowing the loss of one action provides information on the loss of other actions. Moreover, it generalizes and interpolates between the well studied full-information setting (where all losses are revealed) and the bandit setting (where only the loss of the action chosen by the player is revealed). We provide several algorithms addressing different variants of our setting, and provide tight regret bounds depending on combinatorial properties of the information feedback structure. △ Less

Submitted 30 September, 2014; originally announced September 2014.

Comments: Preliminary versions of parts of this paper appeared in [1,20], and also as arXiv papers arXiv:1106.2436 and arXiv:1307.4564

arXiv:1307.4564 [pdf, ps, other]

From Bandits to Experts: A Tale of Domination and Independence

Authors: Noga Alon, Nicolò Cesa-Bianchi, Claudio Gentile, Yishay Mansour

Abstract: We consider the partial observability model for multi-armed bandits, introduced by Mannor and Shamir. Our main result is a characterization of regret in the directed observability model in terms of the dominating and independence numbers of the observability graph. We also show that in the undirected case, the learner can achieve optimal regret without even accessing the observability graph before… ▽ More We consider the partial observability model for multi-armed bandits, introduced by Mannor and Shamir. Our main result is a characterization of regret in the directed observability model in terms of the dominating and independence numbers of the observability graph. We also show that in the undirected case, the learner can achieve optimal regret without even accessing the observability graph before selecting an action. Both results are shown using variants of the Exp3 algorithm operating on the observability graph in a time-efficient manner. △ Less

Submitted 17 July, 2013; originally announced July 2013.

arXiv:1306.0811 [pdf, other]

A Gang of Bandits

Authors: Nicolò Cesa-Bianchi, Claudio Gentile, Giovanni Zappella

Abstract: Multi-armed bandit problems are receiving a great deal of attention because they adequately formalize the exploration-exploitation trade-offs arising in several industrially relevant applications, such as online advertisement and, more generally, recommendation systems. In many cases, however, these applications have a strong social component, whose integration in the bandit algorithm could lead t… ▽ More Multi-armed bandit problems are receiving a great deal of attention because they adequately formalize the exploration-exploitation trade-offs arising in several industrially relevant applications, such as online advertisement and, more generally, recommendation systems. In many cases, however, these applications have a strong social component, whose integration in the bandit algorithm could lead to a dramatic performance increase. For instance, we may want to serve content to a group of users by taking advantage of an underlying network of social relationships among them. In this paper, we introduce novel algorithmic approaches to the solution of such networked bandit problems. More specifically, we design and analyze a global strategy which allocates a bandit algorithm to each network node (user) and allows it to "share" signals (contexts and payoffs) with the neghboring nodes. We then derive two more scalable variants of this strategy based on different ways of clustering the graph nodes. We experimentally compare the algorithm and its variants to state-of-the-art methods for contextual bandits that do not use the relational information. Our experiments, carried out on synthetic and real-world datasets, show a marked increase in prediction performance obtained by exploiting the network structure. △ Less

Submitted 4 November, 2013; v1 submitted 4 June, 2013; originally announced June 2013.

Comments: NIPS 2013

arXiv:1302.4387 [pdf, ps, other]

Online Learning with Switching Costs and Other Adaptive Adversaries

Authors: Nicolo Cesa-Bianchi, Ofer Dekel, Ohad Shamir

Abstract: We study the power of different types of adaptive (nonoblivious) adversaries in the setting of prediction with expert advice, under both full-information and bandit feedback. We measure the player's performance using a new notion of regret, also known as policy regret, which better captures the adversary's adaptiveness to the player's behavior. In a setting where losses are allowed to drift, we ch… ▽ More We study the power of different types of adaptive (nonoblivious) adversaries in the setting of prediction with expert advice, under both full-information and bandit feedback. We measure the player's performance using a new notion of regret, also known as policy regret, which better captures the adversary's adaptiveness to the player's behavior. In a setting where losses are allowed to drift, we characterize ---in a nearly complete manner--- the power of adaptive adversaries with bounded memories and switching costs. In particular, we show that with switching costs, the attainable rate with bandit feedback is $\widetildeΘ(T^{2/3})$. Interestingly, this rate is significantly worse than the $Θ(\sqrt{T})$ rate attainable with switching costs in the full-information case. Via a novel reduction from experts to bandits, we also show that a bounded memory adversary can force $\widetildeΘ(T^{2/3})$ regret even in the full information case, proving that switching costs are easier to control than bounded memory adversaries. Our lower bounds rely on a new stochastic adversary strategy that generates loss processes with strong dependencies. △ Less

Submitted 1 June, 2013; v1 submitted 18 February, 2013; originally announced February 2013.

arXiv:1301.5112 [pdf, ps, other]

Active Learning on Trees and Graphs

Authors: Nicolo Cesa-Bianchi, Claudio Gentile, Fabio Vitale, Giovanni Zappella

Abstract: We investigate the problem of active learning on a given tree whose nodes are assigned binary labels in an adversarial way. Inspired by recent results by Guillory and Bilmes, we characterize (up to constant factors) the optimal placement of queries so to minimize the mistakes made on the non-queried nodes. Our query selection algorithm is extremely efficient, and the optimal number of mistakes on… ▽ More We investigate the problem of active learning on a given tree whose nodes are assigned binary labels in an adversarial way. Inspired by recent results by Guillory and Bilmes, we characterize (up to constant factors) the optimal placement of queries so to minimize the mistakes made on the non-queried nodes. Our query selection algorithm is extremely efficient, and the optimal number of mistakes on the non-queried nodes is achieved by a simple and efficient mincut classifier. Through a simple modification of the query selection algorithm we also show optimality (up to constant factors) with respect to the trade-off between number of queries and number of mistakes on non-queried nodes. By using spanning trees, our algorithms can be efficiently applied to general graphs, although the problem of finding optimal and efficient active learning algorithms for general graphs remains open. Towards this end, we provide a lower bound on the number of mistakes made on arbitrary graphs by any active learning algorithm using a number of queries which is up to a constant fraction of the graph size. △ Less

Submitted 22 January, 2013; originally announced January 2013.

arXiv:1301.4769 [pdf, other]

A Correlation Clustering Approach to Link Classification in Signed Networks -- Full Version --

Authors: Nicolo Cesa-Bianchi, Claudio Gentile, Fabio Vitale, Giovanni Zappella

Abstract: Motivated by social balance theory, we develop a theory of link classification in signed networks using the correlation clustering index as measure of label regularity. We derive learning bounds in terms of correlation clustering within three fundamental transductive learning settings: online, batch and active. Our main algorithmic contribution is in the active setting, where we introduce a new fa… ▽ More Motivated by social balance theory, we develop a theory of link classification in signed networks using the correlation clustering index as measure of label regularity. We derive learning bounds in terms of correlation clustering within three fundamental transductive learning settings: online, batch and active. Our main algorithmic contribution is in the active setting, where we introduce a new family of efficient link classifiers based on covering the input graph with small circuits. These are the first active algorithms for link classification with mistake bounds that hold for arbitrary signed networks. △ Less

Submitted 28 February, 2013; v1 submitted 21 January, 2013; originally announced January 2013.

arXiv:1301.4767 [pdf, other]

A Linear Time Active Learning Algorithm for Link Classification -- Full Version --

Authors: Nicolo Cesa-Bianchi, Claudio Gentile, Fabio Vitale, Giovanni Zappella

Abstract: We present very efficient active learning algorithms for link classification in signed networks. Our algorithms are motivated by a stochastic model in which edge labels are obtained through perturbations of a initial sign assignment consistent with a two-clustering of the nodes. We provide a theoretical analysis within this model, showing that we can achieve an optimal (to whithin a constant facto… ▽ More We present very efficient active learning algorithms for link classification in signed networks. Our algorithms are motivated by a stochastic model in which edge labels are obtained through perturbations of a initial sign assignment consistent with a two-clustering of the nodes. We provide a theoretical analysis within this model, showing that we can achieve an optimal (to whithin a constant factor) number of mistakes on any graph G = (V,E) such that |E| = Ω(|V|^{3/2}) by querying O(|V|^{3/2}) edge labels. More generally, we show an algorithm that achieves optimality to within a factor of O(k) by querying at most order of |V| + (|V|/k)^{3/2} edge labels. The running time of this algorithm is at most of order |E| + |V|\log|V|. △ Less

Submitted 28 February, 2013; v1 submitted 21 January, 2013; originally announced January 2013.

arXiv:1212.5637 [pdf, other]

Random Spanning Trees and the Prediction of Weighted Graphs

Authors: Nicolo' Cesa-Bianchi, Claudio Gentile, Fabio Vitale, Giovanni Zappella

Abstract: We investigate the problem of sequentially predicting the binary labels on the nodes of an arbitrary weighted graph. We show that, under a suitable parametrization of the problem, the optimal number of prediction mistakes can be characterized (up to logarithmic factors) by the cutsize of a random spanning tree of the graph. The cutsize is induced by the unknown adversarial labeling of the graph no… ▽ More We investigate the problem of sequentially predicting the binary labels on the nodes of an arbitrary weighted graph. We show that, under a suitable parametrization of the problem, the optimal number of prediction mistakes can be characterized (up to logarithmic factors) by the cutsize of a random spanning tree of the graph. The cutsize is induced by the unknown adversarial labeling of the graph nodes. In deriving our characterization, we obtain a simple randomized algorithm achieving in expectation the optimal mistake bound on any polynomially connected weighted graph. Our algorithm draws a random spanning tree of the original graph and then predicts the nodes of this tree in constant expected amortized time and linear space. Experiments on real-world datasets show that our method compares well to both global (Perceptron) and local (label propagation) methods, while being generally faster in practice. △ Less

Submitted 21 December, 2012; originally announced December 2012.

Comments: Appeared in ICML 2010

arXiv:1209.1727 [pdf, ps, other]

Bandits with heavy tail

Authors: Sébastien Bubeck, Nicolò Cesa-Bianchi, Gábor Lugosi

Abstract: The stochastic multi-armed bandit problem is well understood when the reward distributions are sub-Gaussian. In this paper we examine the bandit problem under the weaker assumption that the distributions have moments of order 1+ε, for some $ε\in (0,1]$. Surprisingly, moments of order 2 (i.e., finite variance) are sufficient to obtain regret bounds of the same order as under sub-Gaussian reward dis… ▽ More The stochastic multi-armed bandit problem is well understood when the reward distributions are sub-Gaussian. In this paper we examine the bandit problem under the weaker assumption that the distributions have moments of order 1+ε, for some $ε\in (0,1]$. Surprisingly, moments of order 2 (i.e., finite variance) are sufficient to obtain regret bounds of the same order as under sub-Gaussian reward distributions. In order to achieve such regret, we define sampling strategies based on refined estimators of the mean such as the truncated empirical mean, Catoni's M-estimator, and the median-of-means estimator. We also derive matching lower bounds that also show that the best achievable regret deteriorates when ε<1. △ Less

Submitted 8 September, 2012; originally announced September 2012.

arXiv:1204.5721 [pdf, ps, other]

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Authors: Sébastien Bubeck, Nicolò Cesa-Bianchi

Abstract: Multi-armed bandit problems are the most basic examples of sequential decision problems with an exploration-exploitation trade-off. This is the balance between staying with the option that gave highest payoffs in the past and exploring new options that might give higher payoffs in the future. Although the study of bandit problems dates back to the Thirties, exploration-exploitation trade-offs aris… ▽ More Multi-armed bandit problems are the most basic examples of sequential decision problems with an exploration-exploitation trade-off. This is the balance between staying with the option that gave highest payoffs in the past and exploring new options that might give higher payoffs in the future. Although the study of bandit problems dates back to the Thirties, exploration-exploitation trade-offs arise in several modern applications, such as ad placement, website optimization, and packet routing. Mathematically, a multi-armed bandit is defined by the payoff process associated with each option. In this survey, we focus on two extreme cases in which the analysis of regret is particularly simple and elegant: i.i.d. payoffs and adversarial payoffs. Besides the basic setting of finitely many actions, we also analyze some of the most important variants and extensions, such as the contextual bandit model. △ Less

Submitted 3 November, 2012; v1 submitted 25 April, 2012; originally announced April 2012.

Comments: To appear in Foundations and Trends in Machine Learning

arXiv:1202.3323 [pdf, ps, other]

Mirror Descent Meets Fixed Share (and feels no regret)

Authors: Nicolò Cesa-Bianchi, Pierre Gaillard, Gabor Lugosi, Gilles Stoltz

Abstract: Mirror descent with an entropic regularizer is known to achieve shifting regret bounds that are logarithmic in the dimension. This is done using either a carefully designed projection or by a weight sharing technique. Via a novel unified analysis, we show that these two approaches deliver essentially equivalent bounds on a notion of regret generalizing shifting, adaptive, discounted, and other rel… ▽ More Mirror descent with an entropic regularizer is known to achieve shifting regret bounds that are logarithmic in the dimension. This is done using either a carefully designed projection or by a weight sharing technique. Via a novel unified analysis, we show that these two approaches deliver essentially equivalent bounds on a notion of regret generalizing shifting, adaptive, discounted, and other related regrets. Our analysis also captures and extends the generalized weight sharing technique of Bousquet and Warmuth, and can be refined in several ways, including improvements for small losses and adaptive tuning of parameters. △ Less

Submitted 27 September, 2012; v1 submitted 15 February, 2012; originally announced February 2012.

Journal ref: NIPS 2012, Lake Tahoe : United States (2012)

arXiv:1202.3079 [pdf, ps, other]

Towards minimax policies for online linear optimization with bandit feedback

Authors: Sébastien Bubeck, Nicolò Cesa-Bianchi, Sham M. Kakade

Abstract: We address the online linear optimization problem with bandit feedback. Our contribution is twofold. First, we provide an algorithm (based on exponential weights) with a regret of order $\sqrt{d n \log N}$ for any finite action set with $N$ actions, under the assumption that the instantaneous loss is bounded by 1. This shaves off an extraneous $\sqrt{d}$ factor compared to previous works, and give… ▽ More We address the online linear optimization problem with bandit feedback. Our contribution is twofold. First, we provide an algorithm (based on exponential weights) with a regret of order $\sqrt{d n \log N}$ for any finite action set with $N$ actions, under the assumption that the instantaneous loss is bounded by 1. This shaves off an extraneous $\sqrt{d}$ factor compared to previous works, and gives a regret bound of order $d \sqrt{n \log n}$ for any compact set of actions. Without further assumptions on the action set, this last bound is minimax optimal up to a logarithmic factor. Interestingly, our result also shows that the minimax regret for bandit linear optimization with expert advice in $d$ dimension is the same as for the basic $d$-armed bandit with expert advice. Our second contribution is to show how to use the Mirror Descent algorithm to obtain computationally efficient strategies with minimax optimal regret bounds in specific examples. More precisely we study two canonical action sets: the hypercube and the Euclidean ball. In the former case, we obtain the first computationally efficient algorithm with a $d \sqrt{n}$ regret, thus improving by a factor $\sqrt{d \log n}$ over the best known result for a computationally efficient algorithm. In the latter case, our approach gives the first algorithm with a $\sqrt{d n \log n}$ regret, again shaving off an extraneous $\sqrt{d}$ compared to previous works. △ Less

Submitted 14 February, 2012; originally announced February 2012.

arXiv:1110.6886 [pdf, other]

PAC-Bayesian Inequalities for Martingales

Authors: Yevgeny Seldin, François Laviolette, Nicolò Cesa-Bianchi, John Shawe-Taylor, Peter Auer

Abstract: We present a set of high-probability inequalities that control the concentration of weighted averages of multiple (possibly uncountably many) simultaneously evolving and interdependent martingales. Our results extend the PAC-Bayesian analysis in learning theory from the i.i.d. setting to martingales opening the way for its application to importance weighted sampling, reinforcement learning, and ot… ▽ More We present a set of high-probability inequalities that control the concentration of weighted averages of multiple (possibly uncountably many) simultaneously evolving and interdependent martingales. Our results extend the PAC-Bayesian analysis in learning theory from the i.i.d. setting to martingales opening the way for its application to importance weighted sampling, reinforcement learning, and other interactive learning domains, as well as many other domains in probability theory and statistics, where martingales are encountered. We also present a comparison inequality that bounds the expectation of a convex function of a martingale difference sequence shifted to the [0,1] interval by the expectation of the same function of independent Bernoulli variables. This inequality is applied to derive a tighter analog of Hoeffding-Azuma's inequality. △ Less

Submitted 30 July, 2012; v1 submitted 31 October, 2011; originally announced October 2011.

arXiv:1110.4322

An Optimal Algorithm for Linear Bandits

Authors: Nicolò Cesa-Bianchi, Sham Kakade

Abstract: We provide the first algorithm for online bandit linear optimization whose regret after T rounds is of order sqrt{Td ln N} on any finite class X of N actions in d dimensions, and of order d*sqrt{T} (up to log factors) when X is infinite. These bounds are not improvable in general. The basic idea utilizes tools from convex geometry to construct what is essentially an optimal exploration basis. We a… ▽ More We provide the first algorithm for online bandit linear optimization whose regret after T rounds is of order sqrt{Td ln N} on any finite class X of N actions in d dimensions, and of order d*sqrt{T} (up to log factors) when X is infinite. These bounds are not improvable in general. The basic idea utilizes tools from convex geometry to construct what is essentially an optimal exploration basis. We also present an application to a model of linear bandits with expert advice. Interestingly, these results show that bandit linear optimization with expert advice in d dimensions is no more difficult (in terms of the achievable regret) than the online d-armed bandit problem with expert advice (where EXP4 is optimal). △ Less

Submitted 14 February, 2012; v1 submitted 19 October, 2011; originally announced October 2011.

Comments: This paper is superseded by S. Bubeck, N. Cesa-Bianchi, and S.M. Kakade, "Towards minimax policies for online linear optimization with bandit feedback"

arXiv:1106.2429 [pdf, ps, other]

Efficient Transductive Online Learning via Randomized Rounding

Authors: Nicolò Cesa-Bianchi, Ohad Shamir

Abstract: Most traditional online learning algorithms are based on variants of mirror descent or follow-the-leader. In this paper, we present an online algorithm based on a completely different approach, tailored for transductive settings, which combines "random playout" and randomized rounding of loss subgradients. As an application of our approach, we present the first computationally efficient online alg… ▽ More Most traditional online learning algorithms are based on variants of mirror descent or follow-the-leader. In this paper, we present an online algorithm based on a completely different approach, tailored for transductive settings, which combines "random playout" and randomized rounding of loss subgradients. As an application of our approach, we present the first computationally efficient online algorithm for collaborative filtering with trace-norm constrained matrices. As a second application, we solve an open question linking batch learning and transductive online learning △ Less

Submitted 11 September, 2013; v1 submitted 13 June, 2011; originally announced June 2011.

Comments: To appear in a Festschrift in honor of V.N. Vapnik. Preliminary version presented in NIPS 2011

arXiv:1105.4585 [pdf, ps, other]

PAC-Bayesian Analysis of the Exploration-Exploitation Trade-off

Authors: Yevgeny Seldin, Nicolò Cesa-Bianchi, François Laviolette, Peter Auer, John Shawe-Taylor, Jan Peters

Abstract: We develop a coherent framework for integrative simultaneous analysis of the exploration-exploitation and model order selection trade-offs. We improve over our preceding results on the same subject (Seldin et al., 2011) by combining PAC-Bayesian analysis with Bernstein-type inequality for martingales. Such a combination is also of independent interest for studies of multiple simultaneously evolvin… ▽ More We develop a coherent framework for integrative simultaneous analysis of the exploration-exploitation and model order selection trade-offs. We improve over our preceding results on the same subject (Seldin et al., 2011) by combining PAC-Bayesian analysis with Bernstein-type inequality for martingales. Such a combination is also of independent interest for studies of multiple simultaneously evolving martingales. △ Less

Submitted 23 May, 2011; originally announced May 2011.

Comments: On-line Trading of Exploration and Exploitation 2 - ICML-2011 workshop. http://explo.cs.ucl.ac.uk/workshop/

Showing 1–47 of 47 results for author: Cesa-Bianchi, N