Search | arXiv e-print repository

Improved Algorithms for Contextual Dynamic Pricing

Authors: Matilde Tullii, Solenne Gaucher, Nadav Merlis, Vianney Perchet

Abstract: In contextual dynamic pricing, a seller sequentially prices goods based on contextual information. Buyers will purchase products only if the prices are below their valuations. The goal of the seller is to design a pricing strategy that collects as much revenue as possible. We focus on two different valuation models. The first assumes that valuations linearly depend on the context and are further d… ▽ More In contextual dynamic pricing, a seller sequentially prices goods based on contextual information. Buyers will purchase products only if the prices are below their valuations. The goal of the seller is to design a pricing strategy that collects as much revenue as possible. We focus on two different valuation models. The first assumes that valuations linearly depend on the context and are further distorted by noise. Under minor regularity assumptions, our algorithm achieves an optimal regret bound of $\tilde{\mathcal{O}}(T^{2/3})$, improving the existing results. The second model removes the linearity assumption, requiring only that the expected buyer valuation is $β$-Hölder in the context. For this model, our algorithm obtains a regret $\tilde{\mathcal{O}}(T^{d+2β/d+3β})$, where $d$ is the dimension of the context space. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2403.11637 [pdf, other]

The Value of Reward Lookahead in Reinforcement Learning

Authors: Nadav Merlis, Dorian Baudry, Vianney Perchet

Abstract: In reinforcement learning (RL), agents sequentially interact with changing environments while aiming to maximize the obtained rewards. Usually, rewards are observed only after acting, and so the goal is to maximize the expected cumulative reward. Yet, in many practical settings, reward information is observed in advance -- prices are observed before performing transactions; nearby traffic informat… ▽ More In reinforcement learning (RL), agents sequentially interact with changing environments while aiming to maximize the obtained rewards. Usually, rewards are observed only after acting, and so the goal is to maximize the expected cumulative reward. Yet, in many practical settings, reward information is observed in advance -- prices are observed before performing transactions; nearby traffic information is partially known; and goals are oftentimes given to agents prior to the interaction. In this work, we aim to quantifiably analyze the value of such future reward information through the lens of competitive analysis. In particular, we measure the ratio between the value of standard RL agents and that of agents with partial future-reward lookahead. We characterize the worst-case reward distribution and derive exact ratios for the worst-case reward expectations. Surprisingly, the resulting ratios relate to known quantities in offline RL and reward-free exploration. We further provide tight bounds for the ratio given the worst-case dynamics. Our results cover the full spectrum between observing the immediate rewards before acting to observing all the rewards before the interaction starts. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2402.13079 [pdf, ps, other]

Mode Estimation with Partial Feedback

Authors: Charles Arnal, Vivien Cabannes, Vianney Perchet

Abstract: The combination of lightly supervised pre-training and online fine-tuning has played a key role in recent AI developments. These new learning pipelines call for new theoretical frameworks. In this paper, we formalize core aspects of weakly supervised and active learning with a simple problem: the estimation of the mode of a distribution using partial feedback. We show how entropy coding allows for… ▽ More The combination of lightly supervised pre-training and online fine-tuning has played a key role in recent AI developments. These new learning pipelines call for new theoretical frameworks. In this paper, we formalize core aspects of weakly supervised and active learning with a simple problem: the estimation of the mode of a distribution using partial feedback. We show how entropy coding allows for optimal information acquisition from partial feedback, develop coarse sufficient statistics for mode identification, and adapt bandit algorithms to our new setting. Finally, we combine those contributions into a statistically and computationally efficient solution to our problem. △ Less

Submitted 20 February, 2024; originally announced February 2024.

MSC Class: 62L05; 62B86; 62D10; 62B10

arXiv:2309.00656 [pdf, other]

Local and adaptive mirror descents in extensive-form games

Authors: Côme Fiegel, Pierre Ménard, Tadashi Kozuno, Rémi Munos, Vianney Perchet, Michal Valko

Abstract: We study how to learn $ε$-optimal strategies in zero-sum imperfect information games (IIG) with trajectory feedback. In this setting, players update their policies sequentially based on their observations over a fixed number of episodes, denoted by $T$. Existing procedures suffer from high variance due to the use of importance sampling over sequences of actions (Steinberger et al., 2020; McAleer e… ▽ More We study how to learn $ε$-optimal strategies in zero-sum imperfect information games (IIG) with trajectory feedback. In this setting, players update their policies sequentially based on their observations over a fixed number of episodes, denoted by $T$. Existing procedures suffer from high variance due to the use of importance sampling over sequences of actions (Steinberger et al., 2020; McAleer et al., 2022). To reduce this variance, we consider a fixed sampling approach, where players still update their policies over time, but with observations obtained through a given fixed sampling policy. Our approach is based on an adaptive Online Mirror Descent (OMD) algorithm that applies OMD locally to each information set, using individually decreasing learning rates and a regularized loss. We show that this approach guarantees a convergence rate of $\tilde{\mathcal{O}}(T^{-1/2})$ with high probability and has a near-optimal dependence on the game parameters when applied with the best theoretical choices of learning rates and sampling policies. To achieve these results, we generalize the notion of OMD stabilization, allowing for time-varying regularization with convex increments. △ Less

Submitted 1 September, 2023; originally announced September 2023.

arXiv:2306.02071 [pdf, other]

DU-Shapley: A Shapley Value Proxy for Efficient Dataset Valuation

Authors: Felipe Garrido-Lucero, Benjamin Heymann, Maxime Vono, Patrick Loiseau, Vianney Perchet

Abstract: We consider the dataset valuation problem, that is, the problem of quantifying the incremental gain, to some relevant pre-defined utility of a machine learning task, of aggregating an individual dataset to others. The Shapley value is a natural tool to perform dataset valuation due to its formal axiomatic justification, which can be combined with Monte Carlo integration to overcome the computation… ▽ More We consider the dataset valuation problem, that is, the problem of quantifying the incremental gain, to some relevant pre-defined utility of a machine learning task, of aggregating an individual dataset to others. The Shapley value is a natural tool to perform dataset valuation due to its formal axiomatic justification, which can be combined with Monte Carlo integration to overcome the computational tractability challenges. Such generic approximation methods, however, remain expensive in some cases. In this paper, we exploit the knowledge about the structure of the dataset valuation problem to devise more efficient Shapley value estimators. We propose a novel approximation, referred to as discrete uniform Shapley, which is expressed as an expectation under a discrete uniform distribution with support of reasonable size. We justify the relevancy of the proposed framework via asymptotic and non-asymptotic theoretical guarantees and illustrate its benefits via an extensive set of numerical experiments. △ Less

Submitted 17 June, 2024; v1 submitted 3 June, 2023; originally announced June 2023.

Comments: 22 pages

arXiv:2305.19691 [pdf, other]

Constant or logarithmic regret in asynchronous multiplayer bandits

Authors: Hugo Richard, Etienne Boursier, Vianney Perchet

Abstract: Multiplayer bandits have recently been extensively studied because of their application to cognitive radio networks. While the literature mostly considers synchronous players, radio networks (e.g. for IoT) tend to have asynchronous devices. This motivates the harder, asynchronous multiplayer bandits problem, which was first tackled with an explore-then-commit (ETC) algorithm (see Dakdouk, 2022),… ▽ More Multiplayer bandits have recently been extensively studied because of their application to cognitive radio networks. While the literature mostly considers synchronous players, radio networks (e.g. for IoT) tend to have asynchronous devices. This motivates the harder, asynchronous multiplayer bandits problem, which was first tackled with an explore-then-commit (ETC) algorithm (see Dakdouk, 2022), with a regret upper-bound in $\mathcal{O}(T^{\frac{2}{3}})$. Before even considering decentralization, understanding the centralized case was still a challenge as it was unknown whether getting a regret smaller than $Ω(T^{\frac{2}{3}})$ was possible. We answer positively this question, as a natural extension of UCB exhibits a $\mathcal{O}(\sqrt{T\log(T)})$ minimax regret. More importantly, we introduce Cautious Greedy, a centralized algorithm that yields constant instance-dependent regret if the optimal policy assigns at least one player on each arm (a situation that is proved to occur when arm means are close enough). Otherwise, its regret increases as the sum of $\log(T)$ over some sub-optimality gaps. We provide lower bounds showing that Cautious Greedy is optimal in the data-dependent terms. Therefore, we set up a strong baseline for asynchronous multiplayer bandits and suggest that learning the optimal policy in this problem might be easier than thought, at least with centralization. △ Less

Submitted 31 May, 2023; originally announced May 2023.

arXiv:2212.12567 [pdf, other]

Adapting to game trees in zero-sum imperfect information games

Authors: Côme Fiegel, Pierre Ménard, Tadashi Kozuno, Rémi Munos, Vianney Perchet, Michal Valko

Abstract: Imperfect information games (IIG) are games in which each player only partially observes the current game state. We study how to learn $ε$-optimal strategies in a zero-sum IIG through self-play with trajectory feedback. We give a problem-independent lower bound $\widetilde{\mathcal{O}}(H(A_{\mathcal{X}}+B_{\mathcal{Y}})/ε^2)$ on the required number of realizations to learn these strategies with hi… ▽ More Imperfect information games (IIG) are games in which each player only partially observes the current game state. We study how to learn $ε$-optimal strategies in a zero-sum IIG through self-play with trajectory feedback. We give a problem-independent lower bound $\widetilde{\mathcal{O}}(H(A_{\mathcal{X}}+B_{\mathcal{Y}})/ε^2)$ on the required number of realizations to learn these strategies with high probability, where $H$ is the length of the game, $A_{\mathcal{X}}$ and $B_{\mathcal{Y}}$ are the total number of actions for the two players. We also propose two Follow the Regularized leader (FTRL) algorithms for this setting: Balanced FTRL which matches this lower bound, but requires the knowledge of the information set structure beforehand to define the regularization; and Adaptive FTRL which needs $\widetilde{\mathcal{O}}(H^2(A_{\mathcal{X}}+B_{\mathcal{Y}})/ε^2)$ realizations without this requirement by progressively adapting the regularization to the observations. △ Less

Submitted 15 February, 2023; v1 submitted 23 December, 2022; originally announced December 2022.

arXiv:2211.16275 [pdf, ps, other]

A survey on multi-player bandits

Authors: Etienne Boursier, Vianney Perchet

Abstract: Due mostly to its application to cognitive radio networks, multiplayer bandits gained a lot of interest in the last decade. A considerable progress has been made on its theoretical aspect. However, the current algorithms are far from applicable and many obstacles remain between these theoretical results and a possible implementation of multiplayer bandits algorithms in real cognitive radio network… ▽ More Due mostly to its application to cognitive radio networks, multiplayer bandits gained a lot of interest in the last decade. A considerable progress has been made on its theoretical aspect. However, the current algorithms are far from applicable and many obstacles remain between these theoretical results and a possible implementation of multiplayer bandits algorithms in real cognitive radio networks. This survey contextualizes and organizes the rich multiplayer bandits literature. In light of the existing works, some clear directions for future research appear. We believe that a further study of these different directions might lead to theoretical algorithms adapted to real-world situations. △ Less

Submitted 3 June, 2024; v1 submitted 29 November, 2022; originally announced November 2022.

Comments: final version, accepted at JMLR

arXiv:2210.12882 [pdf, other]

Stochastic Mirror Descent for Large-Scale Sparse Recovery

Authors: Sasila Ilandarideva, Yannis Bekri, Anatoli Juditsky, Vianney Perchet

Abstract: In this paper we discuss an application of Stochastic Approximation to statistical estimation of high-dimensional sparse parameters. The proposed solution reduces to resolving a penalized stochastic optimization problem on each stage of a multistage algorithm; each problem being solved to a prescribed accuracy by the non-Euclidean Composite Stochastic Mirror Descent (CSMD) algorithm. Assuming that… ▽ More In this paper we discuss an application of Stochastic Approximation to statistical estimation of high-dimensional sparse parameters. The proposed solution reduces to resolving a penalized stochastic optimization problem on each stage of a multistage algorithm; each problem being solved to a prescribed accuracy by the non-Euclidean Composite Stochastic Mirror Descent (CSMD) algorithm. Assuming that the problem objective is smooth and quadratically minorated and stochastic perturbations are sub-Gaussian, our analysis prescribes the method parameters which ensure fast convergence of the estimation error (the radius of a confidence ball of a given norm around the approximate solution). This convergence is linear during the first "preliminary" phase of the routine and is sublinear during the second "asymptotic" phase. We consider an application of the proposed approach to sparse Generalized Linear Regression problem. In this setting, we show that the proposed algorithm attains the optimal convergence of the estimation error under weak assumptions on the regressor distribution. We also present a numerical study illustrating the performance of the algorithm on high-dimensional simulation data. △ Less

Submitted 23 October, 2022; originally announced October 2022.

arXiv:2205.15695 [pdf, other]

On Preemption and Learning in Stochastic Scheduling

Authors: Nadav Merlis, Hugo Richard, Flore Sentenac, Corentin Odic, Mathieu Molina, Vianney Perchet

Abstract: We study single-machine scheduling of jobs, each belonging to a job type that determines its duration distribution. We start by analyzing the scenario where the type characteristics are known and then move to two learning scenarios where the types are unknown: non-preemptive problems, where each started job must be completed before moving to another job; and preemptive problems, where job executio… ▽ More We study single-machine scheduling of jobs, each belonging to a job type that determines its duration distribution. We start by analyzing the scenario where the type characteristics are known and then move to two learning scenarios where the types are unknown: non-preemptive problems, where each started job must be completed before moving to another job; and preemptive problems, where job execution can be paused in the favor of moving to a different job. In both cases, we design algorithms that achieve sublinear excess cost, compared to the performance with known types, and prove lower bounds for the non-preemptive case. Notably, we demonstrate, both theoretically and through simulations, how preemptive algorithms can greatly outperform non-preemptive ones when the durations of different job types are far from one another, a phenomenon that does not occur when the type durations are known. △ Less

Submitted 1 June, 2023; v1 submitted 31 May, 2022; originally announced May 2022.

Comments: Accepted to ICML 2023

arXiv:2205.13255 [pdf, other]

Active Labeling: Streaming Stochastic Gradients

Authors: Vivien Cabannes, Francis Bach, Vianney Perchet, Alessandro Rudi

Abstract: The workhorse of machine learning is stochastic gradient descent. To access stochastic gradients, it is common to consider iteratively input/output pairs of a training dataset. Interestingly, it appears that one does not need full supervision to access stochastic gradients, which is the main motivation of this paper. After formalizing the "active labeling" problem, which focuses on active learning… ▽ More The workhorse of machine learning is stochastic gradient descent. To access stochastic gradients, it is common to consider iteratively input/output pairs of a training dataset. Interestingly, it appears that one does not need full supervision to access stochastic gradients, which is the main motivation of this paper. After formalizing the "active labeling" problem, which focuses on active learning with partial supervision, we provide a streaming technique that provably minimizes the ratio of generalization error over the number of samples. We illustrate our technique in depth for robust regression. △ Less

Submitted 7 December, 2022; v1 submitted 26 May, 2022; originally announced May 2022.

Comments: 38 pages (9 main pages), 9 figures

MSC Class: 68T37 ACM Class: G.3

arXiv:2108.00230 [pdf, other]

Pure Exploration and Regret Minimization in Matching Bandits

Authors: Flore Sentenac, Jialin Yi, Clément Calauzènes, Vianney Perchet, Milan Vojnovic

Abstract: Finding an optimal matching in a weighted graph is a standard combinatorial problem. We consider its semi-bandit version where either a pair or a full matching is sampled sequentially. We prove that it is possible to leverage a rank-1 assumption on the adjacency matrix to reduce the sample complexity and the regret of off-the-shelf algorithms up to reaching a linear dependency in the number of ver… ▽ More Finding an optimal matching in a weighted graph is a standard combinatorial problem. We consider its semi-bandit version where either a pair or a full matching is sampled sequentially. We prove that it is possible to leverage a rank-1 assumption on the adjacency matrix to reduce the sample complexity and the regret of off-the-shelf algorithms up to reaching a linear dependency in the number of vertices (up to poly log terms). △ Less

Submitted 31 July, 2021; originally announced August 2021.

Journal ref: Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 2021

arXiv:2107.00995 [pdf, other]

Online Matching in Sparse Random Graphs: Non-Asymptotic Performances of Greedy Algorithm

Authors: Nathan Noiry, Flore Sentenac, Vianney Perchet

Abstract: Motivated by sequential budgeted allocation problems, we investigate online matching problems where connections between vertices are not i.i.d., but they have fixed degree distributions -- the so-called configuration model. We estimate the competitive ratio of the simplest algorithm, GREEDY, by approximating some relevant stochastic discrete processes by their continuous counterparts, that are sol… ▽ More Motivated by sequential budgeted allocation problems, we investigate online matching problems where connections between vertices are not i.i.d., but they have fixed degree distributions -- the so-called configuration model. We estimate the competitive ratio of the simplest algorithm, GREEDY, by approximating some relevant stochastic discrete processes by their continuous counterparts, that are solutions of an explicit system of partial differential equations. This technique gives precise bounds on the estimation errors, with arbitrarily high probability as the problem size increases. In particular, it allows the formal comparison between different configuration models. We also prove that, quite surprisingly, GREEDY can have better performance guarantees than RANKING, another celebrated algorithm for online matching that usually outperforms the former. △ Less

Submitted 2 July, 2021; originally announced July 2021.

arXiv:2106.05061 [pdf, other]

Quickest change detection with unknown parameters: Constant complexity and near optimality

Authors: Firas Jarboui, Viannet Perchet

Abstract: We consider the quickest change detection problem where both the parameters of pre- and post- change distributions are unknown, which prevents the use of classical simple hypothesis testing. Without additional assumptions, optimal solutions are not tractable as they rely on some minimax and robust variant of the objective. As a consequence, change points might be detected too late for practical ap… ▽ More We consider the quickest change detection problem where both the parameters of pre- and post- change distributions are unknown, which prevents the use of classical simple hypothesis testing. Without additional assumptions, optimal solutions are not tractable as they rely on some minimax and robust variant of the objective. As a consequence, change points might be detected too late for practical applications (in economics, health care or maintenance for instance). Available constant complexity techniques typically solve a relaxed version of the problem, deeply relying on very specific probability distributions and/or some very precise additional knowledge. We consider a totally different approach that leverages the theoretical asymptotic properties of optimal solutions to derive a new scalable approximate algorithm with near optimal performance that runs~in~$\mathcal{O}(1)$, adapted to even more complex Markovian settings. △ Less

Submitted 9 June, 2021; originally announced June 2021.

arXiv:2106.04228 [pdf, ps, other]

Decentralized Learning in Online Queuing Systems

Authors: Flore Sentenac, Etienne Boursier, Vianney Perchet

Abstract: Motivated by packet routing in computer networks, online queuing systems are composed of queues receiving packets at different rates. Repeatedly, they send packets to servers, each of them treating only at most one packet at a time. In the centralized case, the number of accumulated packets remains bounded (i.e., the system is \textit{stable}) as long as the ratio between service rates and arrival… ▽ More Motivated by packet routing in computer networks, online queuing systems are composed of queues receiving packets at different rates. Repeatedly, they send packets to servers, each of them treating only at most one packet at a time. In the centralized case, the number of accumulated packets remains bounded (i.e., the system is \textit{stable}) as long as the ratio between service rates and arrival rates is larger than $1$. In the decentralized case, individual no-regret strategies ensures stability when this ratio is larger than $2$. Yet, myopically minimizing regret disregards the long term effects due to the carryover of packets to further rounds. On the other hand, minimizing long term costs leads to stable Nash equilibria as soon as the ratio exceeds $\frac{e}{e-1}$. Stability with decentralized learning strategies with a ratio below $2$ was a major remaining question. We first argue that for ratios up to $2$, cooperation is required for stability of learning strategies, as selfish minimization of policy regret, a \textit{patient} notion of regret, might indeed still be unstable in this case. We therefore consider cooperative queues and propose the first learning decentralized algorithm guaranteeing stability of the system as long as the ratio of rates is larger than $1$, thus reaching performances comparable to centralized strategies. △ Less

Submitted 4 November, 2021; v1 submitted 8 June, 2021; originally announced June 2021.

Comments: NeurIPS 2021 camera ready

arXiv:2102.08087 [pdf, other]

Making the most of your day: online learning for optimal allocation of time

Authors: Etienne Boursier, Tristan Garrec, Vianney Perchet, Marco Scarsini

Abstract: We study online learning for optimal allocation when the resource to be allocated is time. %Examples of possible applications include job scheduling for a computing server, a driver filling a day with rides, a landlord renting an estate, etc. An agent receives task proposals sequentially according to a Poisson process and can either accept or reject a proposed task. If she accepts the proposal, sh… ▽ More We study online learning for optimal allocation when the resource to be allocated is time. %Examples of possible applications include job scheduling for a computing server, a driver filling a day with rides, a landlord renting an estate, etc. An agent receives task proposals sequentially according to a Poisson process and can either accept or reject a proposed task. If she accepts the proposal, she is busy for the duration of the task and obtains a reward that depends on the task duration. If she rejects it, she remains on hold until a new task proposal arrives. We study the regret incurred by the agent, first when she knows her reward function but does not know the distribution of the task duration, and then when she does not know her reward function, either. This natural setting bears similarities with contextual (one-armed) bandits, but with the crucial difference that the normalized reward associated to a context depends on the whole distribution of contexts. △ Less

Submitted 4 November, 2021; v1 submitted 16 February, 2021; originally announced February 2021.

Comments: NeurIPS 2021 camera ready

arXiv:2012.14264 [pdf, other]

Lifelong Learning in Multi-Armed Bandits

Authors: Matthieu Jedor, Jonathan Louëdec, Vianney Perchet

Abstract: Continuously learning and leveraging the knowledge accumulated from prior tasks in order to improve future performance is a long standing machine learning problem. In this paper, we study the problem in the multi-armed bandit framework with the objective to minimize the total regret incurred over a series of tasks. While most bandit algorithms are designed to have a low worst-case regret, we exami… ▽ More Continuously learning and leveraging the knowledge accumulated from prior tasks in order to improve future performance is a long standing machine learning problem. In this paper, we study the problem in the multi-armed bandit framework with the objective to minimize the total regret incurred over a series of tasks. While most bandit algorithms are designed to have a low worst-case regret, we examine here the average regret over bandit instances drawn from some prior distribution which may change over time. We specifically focus on confidence interval tuning of UCB algorithms. We propose a bandit over bandit approach with greedy algorithms and we perform extensive experimental evaluations in both stationary and non-stationary environments. We further apply our solution to the mortal bandit problem, showing empirical improvement over previous work. △ Less

Submitted 28 December, 2020; originally announced December 2020.

arXiv:2007.09996 [pdf, ps, other]

Social Learning in Non-Stationary Environments

Authors: Etienne Boursier, Vianney Perchet, Marco Scarsini

Abstract: Potential buyers of a product or service, before making their decisions, tend to read reviews written by previous consumers. We consider Bayesian consumers with heterogeneous preferences, who sequentially decide whether to buy an item of unknown quality, based on previous buyers' reviews. The quality is multi-dimensional and may occasionally vary over time; the reviews are also multi-dimensional.… ▽ More Potential buyers of a product or service, before making their decisions, tend to read reviews written by previous consumers. We consider Bayesian consumers with heterogeneous preferences, who sequentially decide whether to buy an item of unknown quality, based on previous buyers' reviews. The quality is multi-dimensional and may occasionally vary over time; the reviews are also multi-dimensional. In the simple uni-dimensional and static setting, beliefs about the quality are known to converge to its true value. Our paper extends this result in several ways. First, a multi-dimensional quality is considered, second, rates of convergence are provided, third, a dynamical Markovian model with varying quality is studied. In this dynamical setting the cost of learning is shown to be small. △ Less

Submitted 23 February, 2022; v1 submitted 20 July, 2020; originally announced July 2020.

arXiv:2006.06613 [pdf, ps, other]

Statistical Efficiency of Thompson Sampling for Combinatorial Semi-Bandits

Authors: Pierre Perrault, Etienne Boursier, Vianney Perchet, Michal Valko

Abstract: We investigate stochastic combinatorial multi-armed bandit with semi-bandit feedback (CMAB). In CMAB, the question of the existence of an efficient policy with an optimal asymptotic regret (up to a factor poly-logarithmic with the action size) is still open for many families of distributions, including mutually independent outcomes, and more generally the multivariate sub-Gaussian family. We propo… ▽ More We investigate stochastic combinatorial multi-armed bandit with semi-bandit feedback (CMAB). In CMAB, the question of the existence of an efficient policy with an optimal asymptotic regret (up to a factor poly-logarithmic with the action size) is still open for many families of distributions, including mutually independent outcomes, and more generally the multivariate sub-Gaussian family. We propose to answer the above question for these two families by analyzing variants of the Combinatorial Thompson Sampling policy (CTS). For mutually independent outcomes in $[0,1]$, we propose a tight analysis of CTS using Beta priors. We then look at the more general setting of multivariate sub-Gaussian outcomes and propose a tight analysis of CTS using Gaussian priors. This last result gives us an alternative to the Efficient Sampling for Combinatorial Bandit policy (ESCB), which, although optimal, is not computationally efficient. △ Less

Submitted 3 January, 2021; v1 submitted 11 June, 2020; originally announced June 2020.

Comments: accepted to NeurIPS 2020

arXiv:2005.01656 [pdf, ps, other]

Categorized Bandits

Authors: Matthieu Jedor, Jonathan Louedec, Vianney Perchet

Abstract: We introduce a new stochastic multi-armed bandit setting where arms are grouped inside ``ordered'' categories. The motivating example comes from e-commerce, where a customer typically has a greater appetence for items of a specific well-identified but unknown category than any other one. We introduce three concepts of ordering between categories, inspired by stochastic dominance between random var… ▽ More We introduce a new stochastic multi-armed bandit setting where arms are grouped inside ``ordered'' categories. The motivating example comes from e-commerce, where a customer typically has a greater appetence for items of a specific well-identified but unknown category than any other one. We introduce three concepts of ordering between categories, inspired by stochastic dominance between random variables, which are gradually weaker so that more and more bandit scenarios satisfy at least one of them. We first prove instance-dependent lower bounds on the cumulative regret for each of these models, indicating how the complexity of the bandit problems increases with the generality of the ordering concept considered. We also provide algorithms that fully leverage the structure of the model with their associated theoretical guarantees. Finally, we have conducted an analysis on real data to highlight that those ordered categories actually exist in practice. △ Less

Submitted 4 May, 2020; originally announced May 2020.

arXiv:2002.01197 [pdf, ps, other]

Selfish Robustness and Equilibria in Multi-Player Bandits

Authors: Etienne Boursier, Vianney Perchet

Abstract: Motivated by cognitive radios, stochastic multi-player multi-armed bandits gained a lot of interest recently. In this class of problems, several players simultaneously pull arms and encounter a collision - with 0 reward - if some of them pull the same arm at the same time. While the cooperative case where players maximize the collective reward (obediently following some fixed protocol) has been mo… ▽ More Motivated by cognitive radios, stochastic multi-player multi-armed bandits gained a lot of interest recently. In this class of problems, several players simultaneously pull arms and encounter a collision - with 0 reward - if some of them pull the same arm at the same time. While the cooperative case where players maximize the collective reward (obediently following some fixed protocol) has been mostly considered, robustness to malicious players is a crucial and challenging concern. Existing approaches consider only the case of adversarial jammers whose objective is to blindly minimize the collective reward. We shall consider instead the more natural class of selfish players whose incentives are to maximize their individual rewards, potentially at the expense of the social welfare. We provide the first algorithm robust to selfish players (a.k.a. Nash equilibrium) with a logarithmic regret, when the arm performance is observed. When collisions are also observed, Grim Trigger type of strategies enable some implicit communication-based algorithms and we construct robust algorithms in two different settings: the homogeneous (with a regret comparable to the centralized optimal one) and heterogeneous cases (for an adapted and relevant notion of regret). We also provide impossibility results when only the reward is observed or when arm means vary arbitrarily among players. △ Less

Submitted 19 June, 2020; v1 submitted 4 February, 2020; originally announced February 2020.

arXiv:1907.04723 [pdf, other]

doi 10.1007/978-3-030-19875-6_9

Markov Decision Process for MOOC users behavioral inference

Authors: Firas Jarboui, Célya Gruson-daniel, Pierre Chanial, Alain Durmus, Vincent Rocchisani, Sophie-helene Goulet Ebongue, Anneliese Depoux, Wilfried Kirschenmann, Vianney Perchet

Abstract: Studies on massive open online courses (MOOCs) users discuss the existence of typical profiles and their impact on the learning process of the students. However defining the typical behaviors as well as classifying the users accordingly is a difficult task. In this paper we suggest two methods to model MOOC users behaviour given their log data. We mold their behavior into a Markov Decision Process… ▽ More Studies on massive open online courses (MOOCs) users discuss the existence of typical profiles and their impact on the learning process of the students. However defining the typical behaviors as well as classifying the users accordingly is a difficult task. In this paper we suggest two methods to model MOOC users behaviour given their log data. We mold their behavior into a Markov Decision Process framework. We associate the user's intentions with the MDP reward and argue that this allows us to classify them. △ Less

Submitted 10 March, 2021; v1 submitted 10 July, 2019; originally announced July 2019.

arXiv:1906.08509 [pdf, other]

Online A-Optimal Design and Active Linear Regression

Authors: Xavier Fontaine, Pierre Perrault, Michal Valko, Vianney Perchet

Abstract: We consider in this paper the problem of optimal experiment design where a decision maker can choose which points to sample to obtain an estimate $\hatβ$ of the hidden parameter $β^{\star}$ of an underlying linear model. The key challenge of this work lies in the heteroscedasticity assumption that we make, meaning that each covariate has a different and unknown variance. The goal of the decision m… ▽ More We consider in this paper the problem of optimal experiment design where a decision maker can choose which points to sample to obtain an estimate $\hatβ$ of the hidden parameter $β^{\star}$ of an underlying linear model. The key challenge of this work lies in the heteroscedasticity assumption that we make, meaning that each covariate has a different and unknown variance. The goal of the decision maker is then to figure out on the fly the optimal way to allocate the total budget of $T$ samples between covariates, as sampling several times a specific one will reduce the variance of the estimated model around it (but at the cost of a possible higher variance elsewhere). By trying to minimize the $\ell^2$-loss $\mathbb{E} [\lVert\hatβ-β^{\star}\rVert^2]$ the decision maker is actually minimizing the trace of the covariance matrix of the problem, which corresponds then to online A-optimal design. Combining techniques from bandit and convex optimization we propose a new active sampling algorithm and we compare it with existing ones. We provide theoretical guarantees of this algorithm in different settings, including a $\mathcal{O}(T^{-2})$ regret bound in the case where the covariates form a basis of the feature space, generalizing and improving existing results. Numerical experiments validate our theoretical findings. △ Less

Submitted 30 December, 2020; v1 submitted 20 June, 2019; originally announced June 2019.

Comments: 29 pages, 5 figures

arXiv:1905.11797 [pdf, ps, other]

ROI Maximization in Stochastic Online Decision-Making

Authors: Nicolò Cesa-Bianchi, Tommaso Cesari, Yishay Mansour, Vianney Perchet

Abstract: We introduce a novel theoretical framework for Return On Investment (ROI) maximization in repeated decision-making. Our setting is motivated by the use case of companies that regularly receive proposals for technological innovations and want to quickly decide whether they are worth implementing. We design an algorithm for learning ROI-maximizing decision-making policies over a sequence of innovati… ▽ More We introduce a novel theoretical framework for Return On Investment (ROI) maximization in repeated decision-making. Our setting is motivated by the use case of companies that regularly receive proposals for technological innovations and want to quickly decide whether they are worth implementing. We design an algorithm for learning ROI-maximizing decision-making policies over a sequence of innovation proposals. Our algorithm provably converges to an optimal policy in class $Π$ at a rate of order $\min\big\{1/(NΔ^2),N^{-1/3}\}$, where $N$ is the number of innovations and $Δ$ is the suboptimality gap in $Π$. A significant hurdle of our formulation, which sets it aside from other online learning problems such as bandits, is that running a policy does not provide an unbiased estimate of its performance. △ Less

Submitted 22 December, 2021; v1 submitted 28 May, 2019; originally announced May 2019.

arXiv:1905.11148 [pdf, other]

Utility/Privacy Trade-off through the lens of Optimal Transport

Authors: Etienne Boursier, Vianney Perchet

Abstract: Strategic information is valuable either by remaining private (for instance if it is sensitive) or, on the other hand, by being used publicly to increase some utility. These two objectives are antagonistic and leaking this information might be more rewarding than concealing it. Unlike classical solutions that focus on the first point, we consider instead agents that optimize a natural trade-off be… ▽ More Strategic information is valuable either by remaining private (for instance if it is sensitive) or, on the other hand, by being used publicly to increase some utility. These two objectives are antagonistic and leaking this information might be more rewarding than concealing it. Unlike classical solutions that focus on the first point, we consider instead agents that optimize a natural trade-off between both objectives. We formalize this as an optimization problem where the objective map** is regularized by the amount of information revealed to the adversary (measured as a divergence between the prior and posterior on the private knowledge). Quite surprisingly, when combined with the entropic regularization, the Sinkhorn loss naturally emerges in the optimization objective, making it efficiently solvable. We apply these techniques to preserve some privacy in online repeated auctions. △ Less

Submitted 2 March, 2020; v1 submitted 27 May, 2019; originally announced May 2019.

Comments: AISTATS 2020

arXiv:1902.04376 [pdf, ps, other]

An adaptive stochastic optimization algorithm for resource allocation

Authors: Xavier Fontaine, Shie Mannor, Vianney Perchet

Abstract: We consider the classical problem of sequential resource allocation where a decision maker must repeatedly divide a budget between several resources, each with diminishing returns. This can be recast as a specific stochastic optimization problem where the objective is to maximize the cumulative reward, or equivalently to minimize the regret. We construct an algorithm that is {\em adaptive} to the… ▽ More We consider the classical problem of sequential resource allocation where a decision maker must repeatedly divide a budget between several resources, each with diminishing returns. This can be recast as a specific stochastic optimization problem where the objective is to maximize the cumulative reward, or equivalently to minimize the regret. We construct an algorithm that is {\em adaptive} to the complexity of the problem, expressed in term of the regularity of the returns of the resources, measured by the exponent in the Łojasiewicz inequality (or by their universal concavity parameter). Our parameter-independent algorithm recovers the optimal rates for strongly-concave functions and the classical fast rates of multi-armed bandit (for linear reward functions). Moreover, the algorithm improves existing results on stochastic optimization in this regret minimization setting for intermediate cases. △ Less

Submitted 16 January, 2020; v1 submitted 12 February, 2019; originally announced February 2019.

Comments: ALT2020, 45 pages, 9 figures

Journal ref: Proceedings of Machine Learning Research (PMLR), volume 117, 2020

arXiv:1902.03794 [pdf, other]

Exploiting Structure of Uncertainty for Efficient Matroid Semi-Bandits

Authors: Pierre Perrault, Vianney Perchet, Michal Valko

Abstract: We improve the efficiency of algorithms for stochastic \emph{combinatorial semi-bandits}. In most interesting problems, state-of-the-art algorithms take advantage of structural properties of rewards, such as \emph{independence}. However, while being optimal in terms of asymptotic regret, these algorithms are inefficient. In our paper, we first reduce their implementation to a specific \emph{submod… ▽ More We improve the efficiency of algorithms for stochastic \emph{combinatorial semi-bandits}. In most interesting problems, state-of-the-art algorithms take advantage of structural properties of rewards, such as \emph{independence}. However, while being optimal in terms of asymptotic regret, these algorithms are inefficient. In our paper, we first reduce their implementation to a specific \emph{submodular maximization}. Then, in case of \emph{matroid} constraints, we design adapted approximation routines, thereby providing the first efficient algorithms that rely on reward structure to improve regret bound. In particular, we improve the state-of-the-art efficient gap-free regret bound by a factor $\sqrt{m}/\log m$, where $m$ is the maximum action size. Finally, we show how our improvement translates to more general \emph{budgeted combinatorial semi-bandits}. △ Less

Submitted 20 June, 2019; v1 submitted 11 February, 2019; originally announced February 2019.

Comments: Accepted to ICML 2019, Long Beach

arXiv:1902.01239 [pdf, other]

A Practical Algorithm for Multiplayer Bandits when Arm Means Vary Among Players

Authors: Etienne Boursier, Emilie Kaufmann, Abbas Mehrabian, Vianney Perchet

Abstract: We study a multiplayer stochastic multi-armed bandit problem in which players cannot communicate, and if two or more players pull the same arm, a collision occurs and the involved players receive zero reward. We consider the challenging heterogeneous setting, in which different arms may have different means for different players, and propose a new and efficient algorithm that combines the idea of… ▽ More We study a multiplayer stochastic multi-armed bandit problem in which players cannot communicate, and if two or more players pull the same arm, a collision occurs and the involved players receive zero reward. We consider the challenging heterogeneous setting, in which different arms may have different means for different players, and propose a new and efficient algorithm that combines the idea of leveraging forced collisions for implicit communication and that of performing matching eliminations. We present a finite-time analysis of our algorithm, giving the first sublinear minimax regret bound for this problem, and prove that if the optimal assignment of players to arms is unique, our algorithm attains the optimal $O(\ln(T))$ regret, solving an open question raised at NeurIPS 2018. △ Less

Submitted 20 March, 2020; v1 submitted 4 February, 2019; originally announced February 2019.

Comments: AISTATS2020

arXiv:1810.05065 [pdf, ps, other]

Regularized Contextual Bandits

Authors: Xavier Fontaine, Quentin Berthet, Vianney Perchet

Abstract: We consider the stochastic contextual bandit problem with additional regularization. The motivation comes from problems where the policy of the agent must be close to some baseline policy which is known to perform well on the task. To tackle this problem we use a nonparametric model and propose an algorithm splitting the context space into bins, and solving simultaneously - and independently - reg… ▽ More We consider the stochastic contextual bandit problem with additional regularization. The motivation comes from problems where the policy of the agent must be close to some baseline policy which is known to perform well on the task. To tackle this problem we use a nonparametric model and propose an algorithm splitting the context space into bins, and solving simultaneously - and independently - regularized multi-armed bandit instances on each bin. We derive slow and fast rates of convergence, depending on the unknown complexity of the problem. We also consider a new relevant margin condition to get problem-independent convergence rates, ending up in intermediate convergence rates interpolating between the aforementioned slow and fast rates. △ Less

Submitted 5 June, 2019; v1 submitted 11 October, 2018; originally announced October 2018.

Comments: AISTATS 2019, 23 pages, 2 figures

Journal ref: Proceedings of Machine Learning Research, PMLR 89:2144-2153, 2019

arXiv:1810.04088 [pdf, other]

Bridging the gap between regret minimization and best arm identification, with application to A/B tests

Authors: Rémy Degenne, Thomas Nedelec, Clément Calauzènes, Vianney Perchet

Abstract: State of the art online learning procedures focus either on selecting the best alternative ("best arm identification") or on minimizing the cost (the "regret"). We merge these two objectives by providing the theoretical analysis of cost minimizing algorithms that are also delta-PAC (with a proven guaranteed bound on the decision time), hence fulfilling at the same time regret minimization and best… ▽ More State of the art online learning procedures focus either on selecting the best alternative ("best arm identification") or on minimizing the cost (the "regret"). We merge these two objectives by providing the theoretical analysis of cost minimizing algorithms that are also delta-PAC (with a proven guaranteed bound on the decision time), hence fulfilling at the same time regret minimization and best arm identification. This analysis sheds light on the common observation that ill-callibrated UCB-algorithms minimize regret while still identifying quickly the best arm. We also extend these results to the non-iid case faced by many practitioners. This provides a technique to make cost versus decision time compromise when doing adaptive tests with applications ranging from website A/B testing to clinical trials. △ Less

Submitted 26 February, 2019; v1 submitted 9 October, 2018; originally announced October 2018.

Journal ref: AISTATS 2019 proceedings

arXiv:1809.08151 [pdf, other]

SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits

Authors: Etienne Boursier, Vianney Perchet

Abstract: Motivated by cognitive radio networks, we consider the stochastic multiplayer multi-armed bandit problem, where several players pull arms simultaneously and collisions occur if one of them is pulled by several players at the same stage. We present a decentralized algorithm that achieves the same performance as a centralized one, contradicting the existing lower bounds for that problem. This is pos… ▽ More Motivated by cognitive radio networks, we consider the stochastic multiplayer multi-armed bandit problem, where several players pull arms simultaneously and collisions occur if one of them is pulled by several players at the same stage. We present a decentralized algorithm that achieves the same performance as a centralized one, contradicting the existing lower bounds for that problem. This is possible by "hacking" the standard model by constructing a communication protocol between players that deliberately enforces collisions, allowing them to share their information at a negligible cost. This motivates the introduction of a more appropriate dynamic setting without sensing, where similar communication protocols are no longer possible. However, we show that the logarithmic growth of the regret is still achievable for this model with a new algorithm. △ Less

Submitted 19 November, 2019; v1 submitted 21 September, 2018; originally announced September 2018.

Journal ref: NeurIPS 2019

arXiv:1807.03558 [pdf, other]

Bandits with Side Observations: Bounded vs. Logarithmic Regret

Authors: Rémy Degenne, Evrard Garcelon, Vianney Perchet

Abstract: We consider the classical stochastic multi-armed bandit but where, from time to time and roughly with frequency $ε$, an extra observation is gathered by the agent for free. We prove that, no matter how small $ε$ is the agent can ensure a regret uniformly bounded in time. More precisely, we construct an algorithm with a regret smaller than $\sum_i \frac{\log(1/ε)}{Δ_i}$, up to multiplicative cons… ▽ More We consider the classical stochastic multi-armed bandit but where, from time to time and roughly with frequency $ε$, an extra observation is gathered by the agent for free. We prove that, no matter how small $ε$ is the agent can ensure a regret uniformly bounded in time. More precisely, we construct an algorithm with a regret smaller than $\sum_i \frac{\log(1/ε)}{Δ_i}$, up to multiplicative constant and loglog terms. We also prove a matching lower-bound, stating that no reasonable algorithm can outperform this quantity. △ Less

Submitted 10 July, 2018; originally announced July 2018.

Comments: Conference on Uncertainty in Artificial Intelligence (UAI) 2018, 21 pages

arXiv:1807.03288 [pdf, ps, other]

Dynamic Pricing with Finitely Many Unknown Valuations

Authors: Nicolò Cesa-Bianchi, Tommaso Cesari, Vianney Perchet

Abstract: Motivated by posted price auctions where buyers are grouped in an unknown number of latent types characterized by their private values for the good on sale, we investigate revenue maximization in stochastic dynamic pricing when the distribution of buyers' private values is supported on an unknown set of points in [0,1] of unknown cardinality $K$. This setting can be viewed as an instance of a stoc… ▽ More Motivated by posted price auctions where buyers are grouped in an unknown number of latent types characterized by their private values for the good on sale, we investigate revenue maximization in stochastic dynamic pricing when the distribution of buyers' private values is supported on an unknown set of points in [0,1] of unknown cardinality $K$. This setting can be viewed as an instance of a stochastic $K$-armed bandit problem where the location of the arms (the $K$ unknown valuations) must be learned as well. In the distribution-free case, we prove that our setting is just as hard as $K$-armed stochastic bandits: no algorithm can achieve a regret significantly better than $\sqrt{KT}$, (where T is the time horizon); we present an efficient algorithm matching this lower bound up to logarithmic factors. In the distribution-dependent case, we show that for all $K>2$ our setting is strictly harder than $K$-armed stochastic bandits by proving that it is impossible to obtain regret bounds that grow logarithmically in time or slower. On the other hand, when a lower bound $γ>0$ on the smallest drop in the demand curve is known, we prove an upper bound on the regret of order $(1/Δ+(\log \log T)/γ^2)(K\log T)$. This is a significant improvement on previously known regret bounds for discontinuous demand curves, that are at best of order $(K^{12}/γ^8)\sqrt{T}$. When $K=2$ in the distribution-dependent case, the hardness of our setting reduces to that of a stochastic $2$-armed bandit: we prove that an upper bound of order $(\log T)/Δ$ (up to $\log\log$ factors) on the regret can be achieved with no information on the demand curve. Finally, we show a $O(\sqrt{T})$ upper bound on the regret for the setting in which the buyers' decisions are nonstochastic, and the regret is measured with respect to the best between two fixed valuations one of which is known to the seller. △ Less

Submitted 5 March, 2019; v1 submitted 9 July, 2018; originally announced July 2018.

arXiv:1806.02282 [pdf, ps, other]

Finding the bandit in a graph: Sequential search-and-stop

Authors: Pierre Perrault, Vianney Perchet, Michal Valko

Abstract: We consider the problem where an agent wants to find a hidden object that is randomly located in some vertex of a directed acyclic graph (DAG) according to a fixed but possibly unknown distribution. The agent can only examine vertices whose in-neighbors have already been examined. In this paper, we address a learning setting where we allow the agent to stop before having found the object and resta… ▽ More We consider the problem where an agent wants to find a hidden object that is randomly located in some vertex of a directed acyclic graph (DAG) according to a fixed but possibly unknown distribution. The agent can only examine vertices whose in-neighbors have already been examined. In this paper, we address a learning setting where we allow the agent to stop before having found the object and restart searching on a new independent instance of the same problem. Our goal is to maximize the total number of hidden objects found given a time budget. The agent can thus skip an instance after realizing that it would spend too much time on it. Our contributions are both to the search theory and multi-armed bandits. If the distribution is known, we provide a quasi-optimal and efficient stationary strategy. If the distribution is unknown, we additionally show how to sequentially approximate it and, at the same time, act near-optimally in order to collect as many hidden objects as possible. △ Less

Submitted 22 April, 2019; v1 submitted 6 June, 2018; originally announced June 2018.

Comments: in International Conference on Artificial Intelligence and Statistics (AISTATS 2019), April 2019, Naha, Okinawa, Japan

arXiv:1704.00773 [pdf, other]

A comparative study of counterfactual estimators

Authors: Thomas Nedelec, Nicolas Le Roux, Vianney Perchet

Abstract: We provide a comparative study of several widely used off-policy estimators (Empirical Average, Basic Importance Sampling and Normalized Importance Sampling), detailing the different regimes where they are individually suboptimal. We then exhibit properties optimal estimators should possess. In the case where examples have been gathered using multiple policies, we show that fused estimators domina… ▽ More We provide a comparative study of several widely used off-policy estimators (Empirical Average, Basic Importance Sampling and Normalized Importance Sampling), detailing the different regimes where they are individually suboptimal. We then exhibit properties optimal estimators should possess. In the case where examples have been gathered using multiple policies, we show that fused estimators dominate basic ones but can still be improved. △ Less

Submitted 29 January, 2019; v1 submitted 3 April, 2017; originally announced April 2017.

arXiv:1702.06917 [pdf, ps, other]

Fast Rates for Bandit Optimization with Upper-Confidence Frank-Wolfe

Authors: Quentin Berthet, Vianney Perchet

Abstract: We consider the problem of bandit optimization, inspired by stochastic optimization and online learning problems with bandit feedback. In this problem, the objective is to minimize a global loss function of all the actions, not necessarily a cumulative loss. This framework allows us to study a very general class of problems, with applications in statistics, machine learning, and other fields. To s… ▽ More We consider the problem of bandit optimization, inspired by stochastic optimization and online learning problems with bandit feedback. In this problem, the objective is to minimize a global loss function of all the actions, not necessarily a cumulative loss. This framework allows us to study a very general class of problems, with applications in statistics, machine learning, and other fields. To solve this problem, we analyze the Upper-Confidence Frank-Wolfe algorithm, inspired by techniques for bandits and convex optimization. We give theoretical guarantees for the performance of this algorithm over various classes of functions, and discuss the optimality of these results. △ Less

Submitted 6 September, 2017; v1 submitted 22 February, 2017; originally announced February 2017.

arXiv:1609.08870 [pdf, ps, other]

Approachability of convex sets in generalized quitting games

Authors: János Flesch, Rida Laraki, Vianney Perchet

Abstract: We consider Blackwell approachability, a very powerful and geometric tool in game theory, used for example to design strategies of the uninformed player in repeated games with incomplete information. We extend this theory to "generalized quitting games" , a class of repeated stochastic games in which each player may have quitting actions, such as the Big-Match. We provide three simple geometric an… ▽ More We consider Blackwell approachability, a very powerful and geometric tool in game theory, used for example to design strategies of the uninformed player in repeated games with incomplete information. We extend this theory to "generalized quitting games" , a class of repeated stochastic games in which each player may have quitting actions, such as the Big-Match. We provide three simple geometric and strongly related conditions for the weak approachability of a convex target set. The first is sufficient: it guarantees that, for any fixed horizon, a player has a strategy ensuring that the expected time-average payoff vector converges to the target set as horizon goes to infinity. The third is necessary: if it is not satisfied, the opponent can weakly exclude the target set. In the special case where only the approaching player can quit the game (Big-Match of type I), the three conditions are equivalent and coincide with Blackwell's condition. Consequently, we obtain a full characterization and prove that the game is weakly determined-every convex set is either weakly approachable or weakly excludable. In games where only the opponent can quit (Big-Match of type II), none of our conditions is both sufficient and necessary for weak approachability. We provide a continuous time sufficient condition using techniques coming from differential games, and show its usefulness in practice, in the spirit of Vieille's seminal work for weak approachability.Finally, we study uniform approachability where the strategy should not depend on the horizon and demonstrate that, in contrast with classical Blackwell approacha-bility for convex sets, weak approachability does not imply uniform approachability. △ Less

Submitted 28 September, 2016; originally announced September 2016.

arXiv:1511.08405 [pdf, ps, other]

Gains and Losses are Fundamentally Different in Regret Minimization: The Sparse Case

Authors: Joon Kwon, Vianney Perchet

Abstract: We demonstrate that, in the classical non-stochastic regret minimization problem with $d$ decisions, gains and losses to be respectively maximized or minimized are fundamentally different. Indeed, by considering the additional sparsity assumption (at each stage, at most $s$ decisions incur a nonzero outcome), we derive optimal regret bounds of different orders. Specifically, with gains, we obtain… ▽ More We demonstrate that, in the classical non-stochastic regret minimization problem with $d$ decisions, gains and losses to be respectively maximized or minimized are fundamentally different. Indeed, by considering the additional sparsity assumption (at each stage, at most $s$ decisions incur a nonzero outcome), we derive optimal regret bounds of different orders. Specifically, with gains, we obtain an optimal regret guarantee after $T$ stages of order $\sqrt{T\log s}$, so the classical dependency in the dimension is replaced by the sparsity size. With losses, we provide matching upper and lower bounds of order $\sqrt{Ts\log(d)/d}$, which is decreasing in $d$. Eventually, we also study the bandit setting, and obtain an upper bound of order $\sqrt{Ts\log (d/s)}$ when outcomes are losses. This bound is proven to be optimal up to the logarithmic factor $\sqrt{\log(d/s)}$. △ Less

Submitted 26 November, 2015; originally announced November 2015.

arXiv:1511.05720 [pdf, other]

Online learning in repeated auctions

Authors: Jonathan Weed, Vianney Perchet, Philippe Rigollet

Abstract: Motivated by online advertising auctions, we consider repeated Vickrey auctions where goods of unknown value are sold sequentially and bidders only learn (potentially noisy) information about a good's value once it is purchased. We adopt an online learning approach with bandit feedback to model this problem and derive bidding strategies for two models: stochastic and adversarial. In the stochastic… ▽ More Motivated by online advertising auctions, we consider repeated Vickrey auctions where goods of unknown value are sold sequentially and bidders only learn (potentially noisy) information about a good's value once it is purchased. We adopt an online learning approach with bandit feedback to model this problem and derive bidding strategies for two models: stochastic and adversarial. In the stochastic model, the observed values of the goods are random variables centered around the true value of the good. In this case, logarithmic regret is achievable when competing against well behaved adversaries. In the adversarial model, the goods need not be identical and we simply compare our performance against that of the best fixed bid in hindsight. We show that sublinear regret is also achievable in this case and prove matching minimax lower bounds. To our knowledge, this is the first complete set of strategies for bidders participating in auctions of this type. △ Less

Submitted 18 November, 2015; originally announced November 2015.

MSC Class: Primary 62L05; secondary 62C20

arXiv:1402.2043 [pdf, other]

Approachability in unknown games: Online learning meets multi-objective optimization

Authors: Shie Mannor, Vianney Perchet, Gilles Stoltz

Abstract: In the standard setting of approachability there are two players and a target set. The players play repeatedly a known vector-valued game where the first player wants to have the average vector-valued payoff converge to the target set which the other player tries to exclude it from this set. We revisit this setting in the spirit of online learning and do not assume that the first player knows the… ▽ More In the standard setting of approachability there are two players and a target set. The players play repeatedly a known vector-valued game where the first player wants to have the average vector-valued payoff converge to the target set which the other player tries to exclude it from this set. We revisit this setting in the spirit of online learning and do not assume that the first player knows the game structure: she receives an arbitrary vector-valued reward vector at every round. She wishes to approach the smallest ("best") possible set given the observed average payoffs in hindsight. This extension of the standard setting has implications even when the original target set is not approachable and when it is not obvious which expansion of it should be approached instead. We show that it is impossible, in general, to approach the best target set in hindsight and propose achievable though ambitious alternative goals. We further propose a concrete strategy to approach these goals. Our method does not require projection onto a target set and amounts to switching between scalar regret minimization algorithms that are performed in episodes. Applications to global cost minimization and to approachability under sample path constraints are considered. △ Less

Submitted 17 June, 2016; v1 submitted 10 February, 2014; originally announced February 2014.

arXiv:1311.4825 [pdf, other]

Gaussian Process Optimization with Mutual Information

Authors: Emile Contal, Vianney Perchet, Nicolas Vayatis

Abstract: In this paper, we analyze a generic algorithm scheme for sequential global optimization using Gaussian processes. The upper bounds we derive on the cumulative regret for this generic algorithm improve by an exponential factor the previously known bounds for algorithms like GP-UCB. We also introduce the novel Gaussian Process Mutual Information algorithm (GP-MI), which significantly improves furthe… ▽ More In this paper, we analyze a generic algorithm scheme for sequential global optimization using Gaussian processes. The upper bounds we derive on the cumulative regret for this generic algorithm improve by an exponential factor the previously known bounds for algorithms like GP-UCB. We also introduce the novel Gaussian Process Mutual Information algorithm (GP-MI), which significantly improves further these upper bounds for the cumulative regret. We confirm the efficiency of this algorithm on synthetic and real tasks against the natural competitor, GP-UCB, and also the Expected Improvement heuristic. △ Less

Submitted 8 June, 2015; v1 submitted 19 November, 2013; originally announced November 2013.

Comments: Proceedings of The 31st International Conference on Machine Learning (ICML 2014)

arXiv:1305.5399 [pdf, other]

A Primal Condition for Approachability with Partial Monitoring

Authors: Shie Mannor, Vianney Perchet, Gilles Stoltz

Abstract: In approachability with full monitoring there are two types of conditions that are known to be equivalent for convex sets: a primal and a dual condition. The primal one is of the form: a set C is approachable if and only all containing half-spaces are approachable in the one-shot game; while the dual one is of the form: a convex set C is approachable if and only if it intersects all payoff sets of… ▽ More In approachability with full monitoring there are two types of conditions that are known to be equivalent for convex sets: a primal and a dual condition. The primal one is of the form: a set C is approachable if and only all containing half-spaces are approachable in the one-shot game; while the dual one is of the form: a convex set C is approachable if and only if it intersects all payoff sets of a certain form. We consider approachability in games with partial monitoring. In previous works (Perchet 2011; Mannor et al. 2011) we provided a dual characterization of approachable convex sets; we also exhibited efficient strategies in the case where C is a polytope. In this paper we provide primal conditions on a convex set to be approachable with partial monitoring. They depend on a modified reward function and lead to approachability strategies, based on modified payoff functions, that proceed by projections similarly to Blackwell's (1956) strategy; this is in contrast with previously studied strategies in this context that relied mostly on the signaling structure and aimed at estimating well the distributions of the signals received. Our results generalize classical results by Kohlberg 1975 (see also Mertens et al. 1994) and apply to games with arbitrary signaling structure as well as to arbitrary convex sets. △ Less

Submitted 23 May, 2013; originally announced May 2013.

arXiv:1302.1611 [pdf, ps, other]

Bounded regret in stochastic multi-armed bandits

Authors: Sébastien Bubeck, Vianney Perchet, Philippe Rigollet

Abstract: We study the stochastic multi-armed bandit problem when one knows the value $μ^{(\star)}$ of an optimal arm, as a well as a positive lower bound on the smallest positive gap $Δ$. We propose a new randomized policy that attains a regret {\em uniformly bounded over time} in this setting. We also prove several lower bounds, which show in particular that bounded regret is not possible if one only know… ▽ More We study the stochastic multi-armed bandit problem when one knows the value $μ^{(\star)}$ of an optimal arm, as a well as a positive lower bound on the smallest positive gap $Δ$. We propose a new randomized policy that attains a regret {\em uniformly bounded over time} in this setting. We also prove several lower bounds, which show in particular that bounded regret is not possible if one only knows $Δ$, and bounded regret of order $1/Δ$ is not possible if one only knows $μ^{(\star)}$ △ Less

Submitted 12 February, 2013; v1 submitted 6 February, 2013; originally announced February 2013.

MSC Class: 62L05

arXiv:1110.6084 [pdf, ps, other]

doi 10.1214/13-AOS1101

The multi-armed bandit problem with covariates

Authors: Vianney Perchet, Philippe Rigollet

Abstract: We consider a multi-armed bandit problem in a setting where each arm produces a noisy reward realization which depends on an observable random covariate. As opposed to the traditional static multi-armed bandit problem, this setting allows for dynamically changing rewards that better describe applications where side information is available. We adopt a nonparametric model where the expected rewards… ▽ More We consider a multi-armed bandit problem in a setting where each arm produces a noisy reward realization which depends on an observable random covariate. As opposed to the traditional static multi-armed bandit problem, this setting allows for dynamically changing rewards that better describe applications where side information is available. We adopt a nonparametric model where the expected rewards are smooth functions of the covariate and where the hardness of the problem is captured by a margin parameter. To maximize the expected cumulative reward, we introduce a policy called Adaptively Binned Successive Elimination (abse) that adaptively decomposes the global problem into suitably "localized" static bandit problems. This policy constructs an adaptive partition using a variant of the Successive Elimination (se) policy. Our results include sharper regret bounds for the se policy in a static bandit problem and minimax optimal regret bounds for the abse policy in the dynamic problem. △ Less

Submitted 24 May, 2013; v1 submitted 27 October, 2011; originally announced October 2011.

Comments: Published in at http://dx.doi.org/10.1214/13-AOS1101 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS1101

Journal ref: Annals of Statistics 2013, Vol. 41, No. 2, 693-721

arXiv:1006.1746 [pdf, ps, other]

Calibration and Internal no-Regret with Partial Monitoring

Authors: Vianney Perchet

Abstract: Calibrated strategies can be obtained by performing strategies that have no internal regret in some auxiliary game. Such strategies can be constructed explicitly with the use of Blackwell's approachability theorem, in an other auxiliary game. We establish the converse: a strategy that approaches a convex $B$-set can be derived from the construction of a calibrated strategy. We develop these tools… ▽ More Calibrated strategies can be obtained by performing strategies that have no internal regret in some auxiliary game. Such strategies can be constructed explicitly with the use of Blackwell's approachability theorem, in an other auxiliary game. We establish the converse: a strategy that approaches a convex $B$-set can be derived from the construction of a calibrated strategy. We develop these tools in the framework of a game with partial monitoring, where players do not observe the actions of their opponents but receive random signals, to define a notion of internal regret and construct strategies that have no such regret. △ Less

Submitted 9 June, 2010; originally announced June 2010.

Showing 1–45 of 45 results for author: Perchet, V