-
Calibrated Forecasting and Persuasion
Authors:
Atulya Jain,
Vianney Perchet
Abstract:
How should an expert send forecasts to maximize her utility subject to passing a calibration test? We consider a dynamic game where an expert sends probabilistic forecasts to a decision maker. The decision maker uses a calibration test based on past outcomes to verify the expert's forecasts. We characterize the optimal forecasting strategy by reducing the dynamic game to a static persuasion proble…
▽ More
How should an expert send forecasts to maximize her utility subject to passing a calibration test? We consider a dynamic game where an expert sends probabilistic forecasts to a decision maker. The decision maker uses a calibration test based on past outcomes to verify the expert's forecasts. We characterize the optimal forecasting strategy by reducing the dynamic game to a static persuasion problem. A distribution of forecasts is implementable by a calibrated strategy if and only if it is a mean-preserving contraction of the distribution of conditionals (honest forecasts). We characterize the value of information by comparing what an informed and uninformed expert can attain. Moreover, we consider a decision maker who uses regret minimization, instead of the calibration test, to take actions. We show that the expert can achieve the same payoff against a regret minimizer as under the calibration test, and in some instances, she can achieve strictly more.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Improved Algorithms for Contextual Dynamic Pricing
Authors:
Matilde Tullii,
Solenne Gaucher,
Nadav Merlis,
Vianney Perchet
Abstract:
In contextual dynamic pricing, a seller sequentially prices goods based on contextual information. Buyers will purchase products only if the prices are below their valuations. The goal of the seller is to design a pricing strategy that collects as much revenue as possible. We focus on two different valuation models. The first assumes that valuations linearly depend on the context and are further d…
▽ More
In contextual dynamic pricing, a seller sequentially prices goods based on contextual information. Buyers will purchase products only if the prices are below their valuations. The goal of the seller is to design a pricing strategy that collects as much revenue as possible. We focus on two different valuation models. The first assumes that valuations linearly depend on the context and are further distorted by noise. Under minor regularity assumptions, our algorithm achieves an optimal regret bound of $\tilde{\mathcal{O}}(T^{2/3})$, improving the existing results. The second model removes the linearity assumption, requiring only that the expected buyer valuation is $β$-Hölder in the context. For this model, our algorithm obtains a regret $\tilde{\mathcal{O}}(T^{d+2β/d+3β})$, where $d$ is the dimension of the context space.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Lookback Prophet Inequalities
Authors:
Ziyad Benomar,
Dorian Baudry,
Vianney Perchet
Abstract:
Prophet inequalities are fundamental optimal stop** problems, where a decision-maker observes sequentially items with values sampled independently from known distributions, and must decide at each new observation to either stop and gain the current value or reject it irrevocably and move to the next step. This model is often too pessimistic and does not adequately represent real-world online sel…
▽ More
Prophet inequalities are fundamental optimal stop** problems, where a decision-maker observes sequentially items with values sampled independently from known distributions, and must decide at each new observation to either stop and gain the current value or reject it irrevocably and move to the next step. This model is often too pessimistic and does not adequately represent real-world online selection processes. Potentially, rejected items can be revisited and a fraction of their value can be recovered. To analyze this problem, we consider general decay functions $D_1,D_2,\ldots$, quantifying the value to be recovered from a rejected item, depending on how far it has been observed in the past. We analyze how lookback improves, or not, the competitive ratio in prophet inequalities in different order models. We show that, under mild monotonicity assumptions on the decay functions, the problem can be reduced to the case where all the decay functions are equal to the same function $x \mapsto γx$, where $γ= \inf_{x>0} \inf_{j \geq 1} D_j(x)/x$. Consequently, we focus on this setting and refine the analyses of the competitive ratios, with upper and lower bounds expressed as increasing functions of $γ$.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Feature-Based Online Bilateral Trade
Authors:
Solenne Gaucher,
Martino Bernasconi,
Matteo Castiglioni,
Andrea Celli,
Vianney Perchet
Abstract:
Bilateral trade models the problem of facilitating trades between a seller and a buyer having private valuations for the item being sold. In the online version of the problem, the learner faces a new seller and buyer at each time step, and has to post a price for each of the two parties without any knowledge of their valuations. We consider a scenario where, at each time step, before posting price…
▽ More
Bilateral trade models the problem of facilitating trades between a seller and a buyer having private valuations for the item being sold. In the online version of the problem, the learner faces a new seller and buyer at each time step, and has to post a price for each of the two parties without any knowledge of their valuations. We consider a scenario where, at each time step, before posting prices the learner observes a context vector containing information about the features of the item for sale. The valuations of both the seller and the buyer follow an unknown linear function of the context. In this setting, the learner could leverage previous transactions in an attempt to estimate private valuations. We characterize the regret regimes of different settings, taking as a baseline the best context-dependent prices in hindsight. First, in the setting in which the learner has two-bit feedback and strong budget balance constraints, we propose an algorithm with $O(\log T)$ regret. Then, we study the same set-up with noisy valuations, providing a tight $\widetilde O(T^{\frac23})$ regret upper bound. Finally, we show that loosening budget balance constraints allows the learner to operate under more restrictive feedback. Specifically, we show how to address the one-bit, global budget balance setting through a reduction from the two-bit, strong budget balance setup. This established a fundamental trade-off between the quality of the feedback and the strictness of the budget constraints.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Dynamic online matching with budget refills
Authors:
Maria Cherifa,
Clément Calauzènes,
Vianney Perchet
Abstract:
Inspired by sequential budgeted allocation problems, we study the online matching problem with budget refills. In this context, we consider an online bipartite graph G=(U,V,E), where the nodes in $V$ are discovered sequentially and nodes in $U$ are known beforehand. Each $u\in U$ is endowed with a budget $b_{u,t}\in \mathbb{N}$ that dynamically evolves over time. Unlike the canonical setting, in…
▽ More
Inspired by sequential budgeted allocation problems, we study the online matching problem with budget refills. In this context, we consider an online bipartite graph G=(U,V,E), where the nodes in $V$ are discovered sequentially and nodes in $U$ are known beforehand. Each $u\in U$ is endowed with a budget $b_{u,t}\in \mathbb{N}$ that dynamically evolves over time. Unlike the canonical setting, in many applications, the budget can be refilled from time to time, which leads to a much richer dynamic that we consider here. Intuitively, adding extra budgets in $U$ seems to ease the matching task, and our results support this intuition. In fact, for the stochastic framework considered where we studied the matching size built by Greedy algorithm on an Erdős-Réyni random graph, we showed that the matching size generated by Greedy converges with high probability to a solution of an explicit system of ODE. Moreover, under specific conditions, the competitive ratio (performance measure of the algorithm) can even tend to 1. For the adversarial part, where the graph considered is deterministic and the algorithm used is Balance, the $b$-matching bound holds when the refills are scarce. However, when refills are regular, our results suggest a potential improvement in algorithm performance. In both cases, Balance algorithm manages to reach the performance of the upper bound on the adversarial graphs considered.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
-
Non-clairvoyant Scheduling with Partial Predictions
Authors:
Ziyad Benomar,
Vianney Perchet
Abstract:
The non-clairvoyant scheduling problem has gained new interest within learning-augmented algorithms, where the decision-maker is equipped with predictions without any quality guarantees. In practical settings, access to predictions may be reduced to specific instances, due to cost or data limitations. Our investigation focuses on scenarios where predictions for only $B$ job sizes out of $n$ are av…
▽ More
The non-clairvoyant scheduling problem has gained new interest within learning-augmented algorithms, where the decision-maker is equipped with predictions without any quality guarantees. In practical settings, access to predictions may be reduced to specific instances, due to cost or data limitations. Our investigation focuses on scenarios where predictions for only $B$ job sizes out of $n$ are available to the algorithm. We first establish near-optimal lower bounds and algorithms in the case of perfect predictions. Subsequently, we present a learning-augmented algorithm satisfying the robustness, consistency, and smoothness criteria, and revealing a novel tradeoff between consistency and smoothness inherent in the scenario with a restricted number of predictions.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
The Value of Reward Lookahead in Reinforcement Learning
Authors:
Nadav Merlis,
Dorian Baudry,
Vianney Perchet
Abstract:
In reinforcement learning (RL), agents sequentially interact with changing environments while aiming to maximize the obtained rewards. Usually, rewards are observed only after acting, and so the goal is to maximize the expected cumulative reward. Yet, in many practical settings, reward information is observed in advance -- prices are observed before performing transactions; nearby traffic informat…
▽ More
In reinforcement learning (RL), agents sequentially interact with changing environments while aiming to maximize the obtained rewards. Usually, rewards are observed only after acting, and so the goal is to maximize the expected cumulative reward. Yet, in many practical settings, reward information is observed in advance -- prices are observed before performing transactions; nearby traffic information is partially known; and goals are oftentimes given to agents prior to the interaction. In this work, we aim to quantifiably analyze the value of such future reward information through the lens of competitive analysis. In particular, we measure the ratio between the value of standard RL agents and that of agents with partial future-reward lookahead. We characterize the worst-case reward distribution and derive exact ratios for the worst-case reward expectations. Surprisingly, the resulting ratios relate to known quantities in offline RL and reward-free exploration. We further provide tight bounds for the ratio given the worst-case dynamics. Our results cover the full spectrum between observing the immediate rewards before acting to observing all the rewards before the interaction starts.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
The Price of Fairness in Bipartite Matching
Authors:
Rémi Castera,
Felipe Garrido-Lucero,
Mathieu Molina,
Simon Mauras,
Patrick Loiseau,
Vianney Perchet
Abstract:
We investigate notions of group fairness in bipartite matching markets involving agents and jobs, where agents are grouped based on sensitive attributes. Employing a geometric approach, we characterize how many agents can be matched in each group, showing that the set of feasible matchings forms a (discrete) polymatroid. We show how we can define weakly-fair matchings geometrically, for which poly…
▽ More
We investigate notions of group fairness in bipartite matching markets involving agents and jobs, where agents are grouped based on sensitive attributes. Employing a geometric approach, we characterize how many agents can be matched in each group, showing that the set of feasible matchings forms a (discrete) polymatroid. We show how we can define weakly-fair matchings geometrically, for which poly-matroid properties imply that they are maximal. Next, we focus on strong fairness notions (inspired by group-fairness metrics in machine learning), where each group gets their exact same fraction of their entitlement, and we explore the Price of Fairness (PoF), i.e., the loss in optimality when imposing such fairness constraints. Importantly, we advocate for the notion of opportunity fairness, where a group entitlement is the maximum number of agents that can be matched without the presence of other competing groups. We show that the opportunity PoF is bounded independently of the number of agents and jobs, but may be linear in the number of groups. Finally, we provide improved bounds with additional structural properties, or with stochastic graphs.
△ Less
Submitted 1 March, 2024;
originally announced March 2024.
-
Mode Estimation with Partial Feedback
Authors:
Charles Arnal,
Vivien Cabannes,
Vianney Perchet
Abstract:
The combination of lightly supervised pre-training and online fine-tuning has played a key role in recent AI developments. These new learning pipelines call for new theoretical frameworks. In this paper, we formalize core aspects of weakly supervised and active learning with a simple problem: the estimation of the mode of a distribution using partial feedback. We show how entropy coding allows for…
▽ More
The combination of lightly supervised pre-training and online fine-tuning has played a key role in recent AI developments. These new learning pipelines call for new theoretical frameworks. In this paper, we formalize core aspects of weakly supervised and active learning with a simple problem: the estimation of the mode of a distribution using partial feedback. We show how entropy coding allows for optimal information acquisition from partial feedback, develop coarse sufficient statistics for mode identification, and adapt bandit algorithms to our new setting. Finally, we combine those contributions into a statistically and computationally efficient solution to our problem.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
Local and adaptive mirror descents in extensive-form games
Authors:
Côme Fiegel,
Pierre Ménard,
Tadashi Kozuno,
Rémi Munos,
Vianney Perchet,
Michal Valko
Abstract:
We study how to learn $ε$-optimal strategies in zero-sum imperfect information games (IIG) with trajectory feedback. In this setting, players update their policies sequentially based on their observations over a fixed number of episodes, denoted by $T$. Existing procedures suffer from high variance due to the use of importance sampling over sequences of actions (Steinberger et al., 2020; McAleer e…
▽ More
We study how to learn $ε$-optimal strategies in zero-sum imperfect information games (IIG) with trajectory feedback. In this setting, players update their policies sequentially based on their observations over a fixed number of episodes, denoted by $T$. Existing procedures suffer from high variance due to the use of importance sampling over sequences of actions (Steinberger et al., 2020; McAleer et al., 2022). To reduce this variance, we consider a fixed sampling approach, where players still update their policies over time, but with observations obtained through a given fixed sampling policy. Our approach is based on an adaptive Online Mirror Descent (OMD) algorithm that applies OMD locally to each information set, using individually decreasing learning rates and a regularized loss. We show that this approach guarantees a convergence rate of $\tilde{\mathcal{O}}(T^{-1/2})$ with high probability and has a near-optimal dependence on the game parameters when applied with the best theoretical choices of learning rates and sampling policies. To achieve these results, we generalize the notion of OMD stabilization, allowing for time-varying regularization with convex increments.
△ Less
Submitted 1 September, 2023;
originally announced September 2023.
-
Trading-off price for data quality to achieve fair online allocation
Authors:
Mathieu Molina,
Nicolas Gast,
Patrick Loiseau,
Vianney Perchet
Abstract:
We consider the problem of online allocation subject to a long-term fairness penalty. Contrary to existing works, however, we do not assume that the decision-maker observes the protected attributes -- which is often unrealistic in practice. Instead they can purchase data that help estimate them from sources of different quality; and hence reduce the fairness penalty at some cost. We model this pro…
▽ More
We consider the problem of online allocation subject to a long-term fairness penalty. Contrary to existing works, however, we do not assume that the decision-maker observes the protected attributes -- which is often unrealistic in practice. Instead they can purchase data that help estimate them from sources of different quality; and hence reduce the fairness penalty at some cost. We model this problem as a multi-armed bandit problem where each arm corresponds to the choice of a data source, coupled with the online allocation problem. We propose an algorithm that jointly solves both problems and show that it has a regret bounded by $\mathcal{O}(\sqrt{T})$. A key difficulty is that the rewards received by selecting a source are correlated by the fairness penalty, which leads to a need for randomization (despite a stochastic setting). Our algorithm takes into account contextual information available before the source selection, and can adapt to many different fairness notions. We also show that in some instances, the estimates used can be learned on the fly.
△ Less
Submitted 4 December, 2023; v1 submitted 23 June, 2023;
originally announced June 2023.
-
Online Matching in Geometric Random Graphs
Authors:
Flore Sentenac,
Nathan Noiry,
Matthieu Lerasle,
Laurent Ménard,
Vianney Perchet
Abstract:
We investigate online maximum cardinality matching, a central problem in ad allocation. In this problem, users are revealed sequentially, and each new user can be paired with any previously unmatched campaign that it is compatible with. Despite the limited theoretical guarantees, the greedy algorithm, which matches incoming users with any available campaign, exhibits outstanding performance in pra…
▽ More
We investigate online maximum cardinality matching, a central problem in ad allocation. In this problem, users are revealed sequentially, and each new user can be paired with any previously unmatched campaign that it is compatible with. Despite the limited theoretical guarantees, the greedy algorithm, which matches incoming users with any available campaign, exhibits outstanding performance in practice. Some theoretical support for this practical success was established in specific classes of graphs, where the connections between different vertices lack strong correlations - an assumption not always valid. To bridge this gap, we focus on the following model: both users and campaigns are represented as points uniformly distributed in the interval $[0,1]$, and a user is eligible to be paired with a campaign if they are similar enough, i.e. the distance between their respective points is less than $c/N$, with $c>0$ a model parameter. As a benchmark, we determine the size of the optimal offline matching in these bipartite random geometric graphs. In the online setting and investigate the number of matches made by the online algorithm closest, which greedily pairs incoming points with their nearest available neighbors. We demonstrate that the algorithm's performance can be compared to its fluid limit, which is characterized as the solution to a specific partial differential equation (PDE). From this PDE solution, we can compute the competitive ratio of closest, and our computations reveal that it remains significantly better than its worst-case guarantee. This model turns out to be related to the online minimum cost matching problem, and we can extend the results to refine certain findings in that area of research. Specifically, we determine the exact asymptotic cost of closest in the $ε$-excess regime, providing a more accurate estimate than the previously known loose upper bound.
△ Less
Submitted 5 October, 2023; v1 submitted 13 June, 2023;
originally announced June 2023.
-
DU-Shapley: A Shapley Value Proxy for Efficient Dataset Valuation
Authors:
Felipe Garrido-Lucero,
Benjamin Heymann,
Maxime Vono,
Patrick Loiseau,
Vianney Perchet
Abstract:
We consider the dataset valuation problem, that is, the problem of quantifying the incremental gain, to some relevant pre-defined utility of a machine learning task, of aggregating an individual dataset to others. The Shapley value is a natural tool to perform dataset valuation due to its formal axiomatic justification, which can be combined with Monte Carlo integration to overcome the computation…
▽ More
We consider the dataset valuation problem, that is, the problem of quantifying the incremental gain, to some relevant pre-defined utility of a machine learning task, of aggregating an individual dataset to others. The Shapley value is a natural tool to perform dataset valuation due to its formal axiomatic justification, which can be combined with Monte Carlo integration to overcome the computational tractability challenges. Such generic approximation methods, however, remain expensive in some cases. In this paper, we exploit the knowledge about the structure of the dataset valuation problem to devise more efficient Shapley value estimators. We propose a novel approximation, referred to as discrete uniform Shapley, which is expressed as an expectation under a discrete uniform distribution with support of reasonable size. We justify the relevancy of the proposed framework via asymptotic and non-asymptotic theoretical guarantees and illustrate its benefits via an extensive set of numerical experiments.
△ Less
Submitted 17 June, 2024; v1 submitted 3 June, 2023;
originally announced June 2023.
-
Constant or logarithmic regret in asynchronous multiplayer bandits
Authors:
Hugo Richard,
Etienne Boursier,
Vianney Perchet
Abstract:
Multiplayer bandits have recently been extensively studied because of their application to cognitive radio networks.
While the literature mostly considers synchronous players, radio networks (e.g. for IoT) tend to have asynchronous devices. This motivates the harder, asynchronous multiplayer bandits problem, which was first tackled with an explore-then-commit (ETC) algorithm (see Dakdouk, 2022),…
▽ More
Multiplayer bandits have recently been extensively studied because of their application to cognitive radio networks.
While the literature mostly considers synchronous players, radio networks (e.g. for IoT) tend to have asynchronous devices. This motivates the harder, asynchronous multiplayer bandits problem, which was first tackled with an explore-then-commit (ETC) algorithm (see Dakdouk, 2022), with a regret upper-bound in $\mathcal{O}(T^{\frac{2}{3}})$. Before even considering decentralization, understanding the centralized case was still a challenge as it was unknown whether getting a regret smaller than $Ω(T^{\frac{2}{3}})$ was possible.
We answer positively this question, as a natural extension of UCB exhibits a $\mathcal{O}(\sqrt{T\log(T)})$ minimax regret.
More importantly, we introduce Cautious Greedy, a centralized algorithm that yields constant instance-dependent regret if the optimal policy assigns at least one player on each arm (a situation that is proved to occur when arm means are close enough). Otherwise, its regret increases as the sum of $\log(T)$ over some sub-optimality gaps. We provide lower bounds showing that Cautious Greedy is optimal in the data-dependent terms.
Therefore, we set up a strong baseline for asynchronous multiplayer bandits and suggest that learning the optimal policy in this problem might be easier than thought, at least with centralization.
△ Less
Submitted 31 May, 2023;
originally announced May 2023.
-
Addressing bias in online selection with limited budget of comparisons
Authors:
Ziyad Benomar,
Evgenii Chzhen,
Nicolas Schreuder,
Vianney Perchet
Abstract:
Consider a hiring process with candidates coming from different universities. It is easy to order candidates with the same background, yet it can be challenging to compare them otherwise. The latter case requires additional costly assessments, leading to a potentially high total cost for the hiring organization. Given an assigned budget, what would be an optimal strategy to select the most qualifi…
▽ More
Consider a hiring process with candidates coming from different universities. It is easy to order candidates with the same background, yet it can be challenging to compare them otherwise. The latter case requires additional costly assessments, leading to a potentially high total cost for the hiring organization. Given an assigned budget, what would be an optimal strategy to select the most qualified candidate? We model the above problem as a multicolor secretary problem, allowing comparisons between candidates from distinct groups at a fixed cost. Our study explores how the allocated budget enhances the success probability of online selection algorithms.
△ Less
Submitted 20 February, 2024; v1 submitted 16 March, 2023;
originally announced March 2023.
-
Adapting to game trees in zero-sum imperfect information games
Authors:
Côme Fiegel,
Pierre Ménard,
Tadashi Kozuno,
Rémi Munos,
Vianney Perchet,
Michal Valko
Abstract:
Imperfect information games (IIG) are games in which each player only partially observes the current game state. We study how to learn $ε$-optimal strategies in a zero-sum IIG through self-play with trajectory feedback. We give a problem-independent lower bound $\widetilde{\mathcal{O}}(H(A_{\mathcal{X}}+B_{\mathcal{Y}})/ε^2)$ on the required number of realizations to learn these strategies with hi…
▽ More
Imperfect information games (IIG) are games in which each player only partially observes the current game state. We study how to learn $ε$-optimal strategies in a zero-sum IIG through self-play with trajectory feedback. We give a problem-independent lower bound $\widetilde{\mathcal{O}}(H(A_{\mathcal{X}}+B_{\mathcal{Y}})/ε^2)$ on the required number of realizations to learn these strategies with high probability, where $H$ is the length of the game, $A_{\mathcal{X}}$ and $B_{\mathcal{Y}}$ are the total number of actions for the two players. We also propose two Follow the Regularized leader (FTRL) algorithms for this setting: Balanced FTRL which matches this lower bound, but requires the knowledge of the information set structure beforehand to define the regularization; and Adaptive FTRL which needs $\widetilde{\mathcal{O}}(H^2(A_{\mathcal{X}}+B_{\mathcal{Y}})/ε^2)$ realizations without this requirement by progressively adapting the regularization to the observations.
△ Less
Submitted 15 February, 2023; v1 submitted 23 December, 2022;
originally announced December 2022.
-
A survey on multi-player bandits
Authors:
Etienne Boursier,
Vianney Perchet
Abstract:
Due mostly to its application to cognitive radio networks, multiplayer bandits gained a lot of interest in the last decade. A considerable progress has been made on its theoretical aspect. However, the current algorithms are far from applicable and many obstacles remain between these theoretical results and a possible implementation of multiplayer bandits algorithms in real cognitive radio network…
▽ More
Due mostly to its application to cognitive radio networks, multiplayer bandits gained a lot of interest in the last decade. A considerable progress has been made on its theoretical aspect. However, the current algorithms are far from applicable and many obstacles remain between these theoretical results and a possible implementation of multiplayer bandits algorithms in real cognitive radio networks. This survey contextualizes and organizes the rich multiplayer bandits literature. In light of the existing works, some clear directions for future research appear. We believe that a further study of these different directions might lead to theoretical algorithms adapted to real-world situations.
△ Less
Submitted 3 June, 2024; v1 submitted 29 November, 2022;
originally announced November 2022.
-
Stochastic Mirror Descent for Large-Scale Sparse Recovery
Authors:
Sasila Ilandarideva,
Yannis Bekri,
Anatoli Juditsky,
Vianney Perchet
Abstract:
In this paper we discuss an application of Stochastic Approximation to statistical estimation of high-dimensional sparse parameters. The proposed solution reduces to resolving a penalized stochastic optimization problem on each stage of a multistage algorithm; each problem being solved to a prescribed accuracy by the non-Euclidean Composite Stochastic Mirror Descent (CSMD) algorithm. Assuming that…
▽ More
In this paper we discuss an application of Stochastic Approximation to statistical estimation of high-dimensional sparse parameters. The proposed solution reduces to resolving a penalized stochastic optimization problem on each stage of a multistage algorithm; each problem being solved to a prescribed accuracy by the non-Euclidean Composite Stochastic Mirror Descent (CSMD) algorithm. Assuming that the problem objective is smooth and quadratically minorated and stochastic perturbations are sub-Gaussian, our analysis prescribes the method parameters which ensure fast convergence of the estimation error (the radius of a confidence ball of a given norm around the approximate solution). This convergence is linear during the first "preliminary" phase of the routine and is sublinear during the second "asymptotic" phase. We consider an application of the proposed approach to sparse Generalized Linear Regression problem. In this setting, we show that the proposed algorithm attains the optimal convergence of the estimation error under weak assumptions on the regressor distribution. We also present a numerical study illustrating the performance of the algorithm on high-dimensional simulation data.
△ Less
Submitted 23 October, 2022;
originally announced October 2022.
-
On Preemption and Learning in Stochastic Scheduling
Authors:
Nadav Merlis,
Hugo Richard,
Flore Sentenac,
Corentin Odic,
Mathieu Molina,
Vianney Perchet
Abstract:
We study single-machine scheduling of jobs, each belonging to a job type that determines its duration distribution. We start by analyzing the scenario where the type characteristics are known and then move to two learning scenarios where the types are unknown: non-preemptive problems, where each started job must be completed before moving to another job; and preemptive problems, where job executio…
▽ More
We study single-machine scheduling of jobs, each belonging to a job type that determines its duration distribution. We start by analyzing the scenario where the type characteristics are known and then move to two learning scenarios where the types are unknown: non-preemptive problems, where each started job must be completed before moving to another job; and preemptive problems, where job execution can be paused in the favor of moving to a different job. In both cases, we design algorithms that achieve sublinear excess cost, compared to the performance with known types, and prove lower bounds for the non-preemptive case. Notably, we demonstrate, both theoretically and through simulations, how preemptive algorithms can greatly outperform non-preemptive ones when the durations of different job types are far from one another, a phenomenon that does not occur when the type durations are known.
△ Less
Submitted 1 June, 2023; v1 submitted 31 May, 2022;
originally announced May 2022.
-
Active Labeling: Streaming Stochastic Gradients
Authors:
Vivien Cabannes,
Francis Bach,
Vianney Perchet,
Alessandro Rudi
Abstract:
The workhorse of machine learning is stochastic gradient descent. To access stochastic gradients, it is common to consider iteratively input/output pairs of a training dataset. Interestingly, it appears that one does not need full supervision to access stochastic gradients, which is the main motivation of this paper. After formalizing the "active labeling" problem, which focuses on active learning…
▽ More
The workhorse of machine learning is stochastic gradient descent. To access stochastic gradients, it is common to consider iteratively input/output pairs of a training dataset. Interestingly, it appears that one does not need full supervision to access stochastic gradients, which is the main motivation of this paper. After formalizing the "active labeling" problem, which focuses on active learning with partial supervision, we provide a streaming technique that provably minimizes the ratio of generalization error over the number of samples. We illustrate our technique in depth for robust regression.
△ Less
Submitted 7 December, 2022; v1 submitted 26 May, 2022;
originally announced May 2022.
-
An algorithmic solution to the Blotto game using multi-marginal couplings
Authors:
Vianney Perchet,
Philippe Rigollet,
Thibaut Le Gouic
Abstract:
We describe an efficient algorithm to compute solutions for the general two-player Blotto game on n battlefields with heterogeneous values. While explicit constructions for such solutions have been limited to specific, largely symmetric or homogeneous, setups, this algorithmic resolution covers the most general situation to date: value-asymmetric game with asymmetric budget. The proposed algorithm…
▽ More
We describe an efficient algorithm to compute solutions for the general two-player Blotto game on n battlefields with heterogeneous values. While explicit constructions for such solutions have been limited to specific, largely symmetric or homogeneous, setups, this algorithmic resolution covers the most general situation to date: value-asymmetric game with asymmetric budget. The proposed algorithm rests on recent theoretical advances regarding Sinkhorn iterations for matrix and tensor scaling. An important case which had been out of reach of previous attempts is that of heterogeneous but symmetric battlefield values with asymmetric budget. In this case, the Blotto game is constant-sum so optimal solutions exist, and our algorithm samples from an $\varepsilon$-optimal solution in time $\tilde{\mathcal{O}}(n^2 + \varepsilon^{-4})$, independently of budgets and battlefield values. In the case of asymmetric values where optimal solutions need not exist but Nash equilibria do, our algorithm samples from an $\varepsilon$-Nash equilibrium with similar complexity but where implicit constants depend on various parameters of the game such as battlefield values.
△ Less
Submitted 31 May, 2022; v1 submitted 15 February, 2022;
originally announced February 2022.
-
Privacy Amplification via Shuffling for Linear Contextual Bandits
Authors:
Evrard Garcelon,
Kamalika Chaudhuri,
Vianney Perchet,
Matteo Pirotta
Abstract:
Contextual bandit algorithms are widely used in domains where it is desirable to provide a personalized service by leveraging contextual information, that may contain sensitive information that needs to be protected. Inspired by this scenario, we study the contextual linear bandit problem with differential privacy (DP) constraints. While the literature has focused on either centralized (joint DP)…
▽ More
Contextual bandit algorithms are widely used in domains where it is desirable to provide a personalized service by leveraging contextual information, that may contain sensitive information that needs to be protected. Inspired by this scenario, we study the contextual linear bandit problem with differential privacy (DP) constraints. While the literature has focused on either centralized (joint DP) or local (local DP) privacy, we consider the shuffle model of privacy and we show that is possible to achieve a privacy/utility trade-off between JDP and LDP. By leveraging shuffling from privacy and batching from bandits, we present an algorithm with regret bound $\widetilde{\mathcal{O}}(T^{2/3}/\varepsilon^{1/3})$, while guaranteeing both central (joint) and local privacy. Our result shows that it is possible to obtain a trade-off between JDP and LDP by leveraging the shuffle model while preserving local privacy.
△ Less
Submitted 11 December, 2021;
originally announced December 2021.
-
Stochastic Online Linear Regression: the Forward Algorithm to Replace Ridge
Authors:
Reda Ouhamma,
Odalric Maillard,
Vianney Perchet
Abstract:
We consider the problem of online linear regression in the stochastic setting. We derive high probability regret bounds for online ridge regression and the forward algorithm. This enables us to compare online regression algorithms more accurately and eliminate assumptions of bounded observations and predictions. Our study advocates for the use of the forward algorithm in lieu of ridge due to its e…
▽ More
We consider the problem of online linear regression in the stochastic setting. We derive high probability regret bounds for online ridge regression and the forward algorithm. This enables us to compare online regression algorithms more accurately and eliminate assumptions of bounded observations and predictions. Our study advocates for the use of the forward algorithm in lieu of ridge due to its enhanced bounds and robustness to the regularization parameter. Moreover, we explain how to integrate it in algorithms involving linear function approximation to remove a boundedness assumption without deteriorating theoretical bounds. We showcase this modification in linear bandit settings where it yields improved regret bounds. Last, we provide numerical experiments to illustrate our results and endorse our intuitions.
△ Less
Submitted 2 November, 2021;
originally announced November 2021.
-
Online Sign Identification: Minimization of the Number of Errors in Thresholding Bandits
Authors:
Reda Ouhamma,
Rémy Degenne,
Pierre Gaillard,
Vianney Perchet
Abstract:
In the fixed budget thresholding bandit problem, an algorithm sequentially allocates a budgeted number of samples to different distributions. It then predicts whether the mean of each distribution is larger or lower than a given threshold. We introduce a large family of algorithms (containing most existing relevant ones), inspired by the Frank-Wolfe algorithm, and provide a thorough yet generic an…
▽ More
In the fixed budget thresholding bandit problem, an algorithm sequentially allocates a budgeted number of samples to different distributions. It then predicts whether the mean of each distribution is larger or lower than a given threshold. We introduce a large family of algorithms (containing most existing relevant ones), inspired by the Frank-Wolfe algorithm, and provide a thorough yet generic analysis of their performance. This allowed us to construct new explicit algorithms, for a broad class of problems, whose losses are within a small constant factor of the non-adaptive oracle ones. Quite interestingly, we observed that adaptive methods empirically greatly out-perform non-adaptive oracles, an uncommon behavior in standard online learning settings, such as regret minimization. We explain this surprising phenomenon on an insightful toy problem.
△ Less
Submitted 18 October, 2021;
originally announced October 2021.
-
Pure Exploration and Regret Minimization in Matching Bandits
Authors:
Flore Sentenac,
Jialin Yi,
Clément Calauzènes,
Vianney Perchet,
Milan Vojnovic
Abstract:
Finding an optimal matching in a weighted graph is a standard combinatorial problem. We consider its semi-bandit version where either a pair or a full matching is sampled sequentially. We prove that it is possible to leverage a rank-1 assumption on the adjacency matrix to reduce the sample complexity and the regret of off-the-shelf algorithms up to reaching a linear dependency in the number of ver…
▽ More
Finding an optimal matching in a weighted graph is a standard combinatorial problem. We consider its semi-bandit version where either a pair or a full matching is sampled sequentially. We prove that it is possible to leverage a rank-1 assumption on the adjacency matrix to reduce the sample complexity and the regret of off-the-shelf algorithms up to reaching a linear dependency in the number of vertices (up to poly log terms).
△ Less
Submitted 31 July, 2021;
originally announced August 2021.
-
Online Matching in Sparse Random Graphs: Non-Asymptotic Performances of Greedy Algorithm
Authors:
Nathan Noiry,
Flore Sentenac,
Vianney Perchet
Abstract:
Motivated by sequential budgeted allocation problems, we investigate online matching problems where connections between vertices are not i.i.d., but they have fixed degree distributions -- the so-called configuration model. We estimate the competitive ratio of the simplest algorithm, GREEDY, by approximating some relevant stochastic discrete processes by their continuous counterparts, that are sol…
▽ More
Motivated by sequential budgeted allocation problems, we investigate online matching problems where connections between vertices are not i.i.d., but they have fixed degree distributions -- the so-called configuration model. We estimate the competitive ratio of the simplest algorithm, GREEDY, by approximating some relevant stochastic discrete processes by their continuous counterparts, that are solutions of an explicit system of partial differential equations. This technique gives precise bounds on the estimation errors, with arbitrarily high probability as the problem size increases. In particular, it allows the formal comparison between different configuration models. We also prove that, quite surprisingly, GREEDY can have better performance guarantees than RANKING, another celebrated algorithm for online matching that usually outperforms the former.
△ Less
Submitted 2 July, 2021;
originally announced July 2021.
-
Unsupervised Neural Hidden Markov Models with a Continuous latent state space
Authors:
Firas Jarboui,
Vianney Perchet
Abstract:
We introduce a new procedure to neuralize unsupervised Hidden Markov Models in the continuous case. This provides higher flexibility to solve problems with underlying latent variables. This approach is evaluated on both synthetic and real data. On top of generating likely model parameters with comparable performances to off-the-shelf neural architecture (LSTMs, GRUs,..), the obtained results are e…
▽ More
We introduce a new procedure to neuralize unsupervised Hidden Markov Models in the continuous case. This provides higher flexibility to solve problems with underlying latent variables. This approach is evaluated on both synthetic and real data. On top of generating likely model parameters with comparable performances to off-the-shelf neural architecture (LSTMs, GRUs,..), the obtained results are easily interpretable.
△ Less
Submitted 10 June, 2021;
originally announced June 2021.
-
Offline Inverse Reinforcement Learning
Authors:
Firas Jarboui,
Vianney Perchet
Abstract:
The objective of offline RL is to learn optimal policies when a fixed exploratory demonstrations data-set is available and sampling additional observations is impossible (typically if this operation is either costly or rises ethical questions). In order to solve this problem, off the shelf approaches require a properly defined cost function (or its evaluation on the provided data-set), which are s…
▽ More
The objective of offline RL is to learn optimal policies when a fixed exploratory demonstrations data-set is available and sampling additional observations is impossible (typically if this operation is either costly or rises ethical questions). In order to solve this problem, off the shelf approaches require a properly defined cost function (or its evaluation on the provided data-set), which are seldom available in practice. To circumvent this issue, a reasonable alternative is to query an expert for few optimal demonstrations in addition to the exploratory data-set. The objective is then to learn an optimal policy w.r.t. the expert's latent cost function. Current solutions either solve a behaviour cloning problem (which does not leverage the exploratory data) or a reinforced imitation learning problem (using a fixed cost function that discriminates available exploratory trajectories from expert ones). Inspired by the success of IRL techniques in achieving state of the art imitation performances in online settings, we exploit GAN based data augmentation procedures to construct the first offline IRL algorithm. The obtained policies outperformed the aforementioned solutions on multiple OpenAI gym environments.
△ Less
Submitted 9 June, 2021;
originally announced June 2021.
-
Quickest change detection with unknown parameters: Constant complexity and near optimality
Authors:
Firas Jarboui,
Viannet Perchet
Abstract:
We consider the quickest change detection problem where both the parameters of pre- and post- change distributions are unknown, which prevents the use of classical simple hypothesis testing. Without additional assumptions, optimal solutions are not tractable as they rely on some minimax and robust variant of the objective. As a consequence, change points might be detected too late for practical ap…
▽ More
We consider the quickest change detection problem where both the parameters of pre- and post- change distributions are unknown, which prevents the use of classical simple hypothesis testing. Without additional assumptions, optimal solutions are not tractable as they rely on some minimax and robust variant of the objective. As a consequence, change points might be detected too late for practical applications (in economics, health care or maintenance for instance). Available constant complexity techniques typically solve a relaxed version of the problem, deeply relying on very specific probability distributions and/or some very precise additional knowledge. We consider a totally different approach that leverages the theoretical asymptotic properties of optimal solutions to derive a new scalable approximate algorithm with near optimal performance that runs~in~$\mathcal{O}(1)$, adapted to even more complex Markovian settings.
△ Less
Submitted 9 June, 2021;
originally announced June 2021.
-
Decentralized Learning in Online Queuing Systems
Authors:
Flore Sentenac,
Etienne Boursier,
Vianney Perchet
Abstract:
Motivated by packet routing in computer networks, online queuing systems are composed of queues receiving packets at different rates. Repeatedly, they send packets to servers, each of them treating only at most one packet at a time. In the centralized case, the number of accumulated packets remains bounded (i.e., the system is \textit{stable}) as long as the ratio between service rates and arrival…
▽ More
Motivated by packet routing in computer networks, online queuing systems are composed of queues receiving packets at different rates. Repeatedly, they send packets to servers, each of them treating only at most one packet at a time. In the centralized case, the number of accumulated packets remains bounded (i.e., the system is \textit{stable}) as long as the ratio between service rates and arrival rates is larger than $1$. In the decentralized case, individual no-regret strategies ensures stability when this ratio is larger than $2$. Yet, myopically minimizing regret disregards the long term effects due to the carryover of packets to further rounds. On the other hand, minimizing long term costs leads to stable Nash equilibria as soon as the ratio exceeds $\frac{e}{e-1}$. Stability with decentralized learning strategies with a ratio below $2$ was a major remaining question. We first argue that for ratios up to $2$, cooperation is required for stability of learning strategies, as selfish minimization of policy regret, a \textit{patient} notion of regret, might indeed still be unstable in this case. We therefore consider cooperative queues and propose the first learning decentralized algorithm guaranteeing stability of the system as long as the ratio of rates is larger than $1$, thus reaching performances comparable to centralized strategies.
△ Less
Submitted 4 November, 2021; v1 submitted 8 June, 2021;
originally announced June 2021.
-
A Generalised Inverse Reinforcement Learning Framework
Authors:
Firas Jarboui,
Vianney Perchet
Abstract:
The gloabal objective of inverse Reinforcement Learning (IRL) is to estimate the unknown cost function of some MDP base on observed trajectories generated by (approximate) optimal policies. The classical approach consists in tuning this cost function so that associated optimal trajectories (that minimise the cumulative discounted cost, i.e. the classical RL loss) are 'similar' to the observed ones…
▽ More
The gloabal objective of inverse Reinforcement Learning (IRL) is to estimate the unknown cost function of some MDP base on observed trajectories generated by (approximate) optimal policies. The classical approach consists in tuning this cost function so that associated optimal trajectories (that minimise the cumulative discounted cost, i.e. the classical RL loss) are 'similar' to the observed ones. Prior contributions focused on penalising degenerate solutions and improving algorithmic scalability. Quite orthogonally to them, we question the pertinence of characterising optimality with respect to the cumulative discounted cost as it induces an implicit bias against policies with longer mixing times. State of the art value based RL algorithms circumvent this issue by solving for the fixed point of the Bellman optimality operator, a stronger criterion that is not well defined for the inverse problem. To alleviate this bias in IRL, we introduce an alternative training loss that puts more weights on future states which yields a reformulation of the (maximum entropy) IRL problem. The algorithms we devised exhibit enhanced performances (and similar tractability) than off-the-shelf ones in multiple OpenAI gym environments.
△ Less
Submitted 25 May, 2021;
originally announced May 2021.
-
Encrypted Linear Contextual Bandit
Authors:
Evrard Garcelon,
Vianney Perchet,
Matteo Pirotta
Abstract:
Contextual bandit is a general framework for online learning in sequential decision-making problems that has found application in a wide range of domains, including recommendation systems, online advertising, and clinical trials.
A critical aspect of bandit methods is that they require to observe the contexts --i.e., individual or group-level data-- and rewards in order to solve the sequential p…
▽ More
Contextual bandit is a general framework for online learning in sequential decision-making problems that has found application in a wide range of domains, including recommendation systems, online advertising, and clinical trials.
A critical aspect of bandit methods is that they require to observe the contexts --i.e., individual or group-level data-- and rewards in order to solve the sequential problem. The large deployment in industrial applications has increased interest in methods that preserve the users' privacy. In this paper, we introduce a privacy-preserving bandit framework based on homomorphic encryption{\color{violet} which allows computations using encrypted data}. The algorithm \textit{only} observes encrypted information (contexts and rewards) and has no ability to decrypt it. Leveraging the properties of homomorphic encryption, we show that despite the complexity of the setting, it is possible to solve linear contextual bandits over encrypted data with a $\widetilde{O}(d\sqrt{T})$ regret bound in any linear contextual bandit problem, while kee** data encrypted.
△ Less
Submitted 23 March, 2022; v1 submitted 17 March, 2021;
originally announced March 2021.
-
Making the most of your day: online learning for optimal allocation of time
Authors:
Etienne Boursier,
Tristan Garrec,
Vianney Perchet,
Marco Scarsini
Abstract:
We study online learning for optimal allocation when the resource to be allocated is time. %Examples of possible applications include job scheduling for a computing server, a driver filling a day with rides, a landlord renting an estate, etc. An agent receives task proposals sequentially according to a Poisson process and can either accept or reject a proposed task. If she accepts the proposal, sh…
▽ More
We study online learning for optimal allocation when the resource to be allocated is time. %Examples of possible applications include job scheduling for a computing server, a driver filling a day with rides, a landlord renting an estate, etc. An agent receives task proposals sequentially according to a Poisson process and can either accept or reject a proposed task. If she accepts the proposal, she is busy for the duration of the task and obtains a reward that depends on the task duration. If she rejects it, she remains on hold until a new task proposal arrives. We study the regret incurred by the agent, first when she knows her reward function but does not know the distribution of the task duration, and then when she does not know her reward function, either. This natural setting bears similarities with contextual (one-armed) bandits, but with the crucial difference that the normalized reward associated to a context depends on the whole distribution of contexts.
△ Less
Submitted 4 November, 2021; v1 submitted 16 February, 2021;
originally announced February 2021.
-
Be Greedy in Multi-Armed Bandits
Authors:
Matthieu Jedor,
Jonathan Louëdec,
Vianney Perchet
Abstract:
The Greedy algorithm is the simplest heuristic in sequential decision problem that carelessly takes the locally optimal choice at each round, disregarding any advantages of exploring and/or information gathering. Theoretically, it is known to sometimes have poor performances, for instance even a linear regret (with respect to the time horizon) in the standard multi-armed bandit problem. On the oth…
▽ More
The Greedy algorithm is the simplest heuristic in sequential decision problem that carelessly takes the locally optimal choice at each round, disregarding any advantages of exploring and/or information gathering. Theoretically, it is known to sometimes have poor performances, for instance even a linear regret (with respect to the time horizon) in the standard multi-armed bandit problem. On the other hand, this heuristic performs reasonably well in practice and it even has sublinear, and even near-optimal, regret bounds in some very specific linear contextual and Bayesian bandit models. We build on a recent line of work and investigate bandit settings where the number of arms is relatively large and where simple greedy algorithms enjoy highly competitive performance, both in theory and in practice. We first provide a generic worst-case bound on the regret of the Greedy algorithm. When combined with some arms subsampling, we prove that it verifies near-optimal worst-case regret bounds in continuous, infinite and many-armed bandit problems. Moreover, for shorter time spans, the theoretical relative suboptimality of Greedy is even reduced. As a consequence, we subversively claim that for many interesting problems and associated horizons, the best compromise between theoretical guarantees, practical performances and computational burden is definitely to follow the greedy heuristic. We support our claim by many numerical experiments that show significant improvements compared to the state-of-the-art, even for moderately long time horizon.
△ Less
Submitted 4 January, 2021;
originally announced January 2021.
-
Lifelong Learning in Multi-Armed Bandits
Authors:
Matthieu Jedor,
Jonathan Louëdec,
Vianney Perchet
Abstract:
Continuously learning and leveraging the knowledge accumulated from prior tasks in order to improve future performance is a long standing machine learning problem. In this paper, we study the problem in the multi-armed bandit framework with the objective to minimize the total regret incurred over a series of tasks. While most bandit algorithms are designed to have a low worst-case regret, we exami…
▽ More
Continuously learning and leveraging the knowledge accumulated from prior tasks in order to improve future performance is a long standing machine learning problem. In this paper, we study the problem in the multi-armed bandit framework with the objective to minimize the total regret incurred over a series of tasks. While most bandit algorithms are designed to have a low worst-case regret, we examine here the average regret over bandit instances drawn from some prior distribution which may change over time. We specifically focus on confidence interval tuning of UCB algorithms. We propose a bandit over bandit approach with greedy algorithms and we perform extensive experimental evaluations in both stationary and non-stationary environments. We further apply our solution to the mortal bandit problem, showing empirical improvement over previous work.
△ Less
Submitted 28 December, 2020;
originally announced December 2020.
-
Learning in repeated auctions
Authors:
Thomas Nedelec,
Clément Calauzènes,
Noureddine El Karoui,
Vianney Perchet
Abstract:
Online auctions are one of the most fundamental facets of the modern economy and power an industry generating hundreds of billions of dollars a year in revenue. Auction theory has historically focused on the question of designing the best way to sell a single item to potential buyers, with the concurrent objectives of maximizing revenue generated or welfare created. Theoretical results in this are…
▽ More
Online auctions are one of the most fundamental facets of the modern economy and power an industry generating hundreds of billions of dollars a year in revenue. Auction theory has historically focused on the question of designing the best way to sell a single item to potential buyers, with the concurrent objectives of maximizing revenue generated or welfare created. Theoretical results in this area have typically relied on some prior Bayesian knowledge agents were assumed to have on each-other. This assumption is no longer satisfied in new markets such as online advertising: similar items are sold repeatedly, and agents are unaware of each other or might try to manipulate each-other. On the other hand, statistical learning theory now provides tools to supplement those missing pieces of information given enough data, as agents can learn from their environment to improve their strategies. This survey covers recent advances in learning in repeated auctions, starting from the traditional economic study of optimal one-shot auctions with a Bayesian prior. We then focus on the question of learning optimal mechanisms from a dataset of bidders' past values. The sample complexity as well as the computational efficiency of different methods will be studied. We will also investigate online variants where gathering data has a cost to be accounted for, either by seller or buyers ("earning while learning"). Later in the survey, we will further assume that bidders are also adaptive to the mechanism as they interact repeatedly with the same seller. We will show how strategic agents can actually manipulate repeated auctions, to their own advantage. All the questions discussed in this survey are grounded in real-world applications and many of the ideas and algorithms we describe are used every day to power the Internet economy.
△ Less
Submitted 22 September, 2021; v1 submitted 18 November, 2020;
originally announced November 2020.
-
Robustness of Community Detection to Random Geometric Perturbations
Authors:
Sandrine Peche,
Vianney Perchet
Abstract:
We consider the stochastic block model where connection between vertices is perturbed by some latent (and unobserved) random geometric graph. The objective is to prove that spectral methods are robust to this type of noise, even if they are agnostic to the presence (or not) of the random graph. We provide explicit regimes where the second eigenvector of the adjacency matrix is highly correlated to…
▽ More
We consider the stochastic block model where connection between vertices is perturbed by some latent (and unobserved) random geometric graph. The objective is to prove that spectral methods are robust to this type of noise, even if they are agnostic to the presence (or not) of the random graph. We provide explicit regimes where the second eigenvector of the adjacency matrix is highly correlated to the true community vector (and therefore when weak/exact recovery is possible). This is possible thanks to a detailed analysis of the spectrum of the latent random graph, of its own interest.
△ Less
Submitted 9 November, 2020;
originally announced November 2020.
-
Local Differential Privacy for Regret Minimization in Reinforcement Learning
Authors:
Evrard Garcelon,
Vianney Perchet,
Ciara Pike-Burke,
Matteo Pirotta
Abstract:
Reinforcement learning algorithms are widely used in domains where it is desirable to provide a personalized service. In these domains it is common that user data contains sensitive information that needs to be protected from third parties. Motivated by this, we study privacy in the context of finite-horizon Markov Decision Processes (MDPs) by requiring information to be obfuscated on the user sid…
▽ More
Reinforcement learning algorithms are widely used in domains where it is desirable to provide a personalized service. In these domains it is common that user data contains sensitive information that needs to be protected from third parties. Motivated by this, we study privacy in the context of finite-horizon Markov Decision Processes (MDPs) by requiring information to be obfuscated on the user side. We formulate this notion of privacy for RL by leveraging the local differential privacy (LDP) framework. We establish a lower bound for regret minimization in finite-horizon MDPs with LDP guarantees which shows that guaranteeing privacy has a multiplicative effect on the regret. This result shows that while LDP is an appealing notion of privacy, it makes the learning problem significantly more complex. Finally, we present an optimistic algorithm that simultaneously satisfies $\varepsilon$-LDP requirements, and achieves $\sqrt{K}/\varepsilon$ regret in any finite-horizon MDP after $K$ episodes, matching the lower bound dependency on the number of episodes $K$.
△ Less
Submitted 27 October, 2021; v1 submitted 15 October, 2020;
originally announced October 2020.
-
Social Learning in Non-Stationary Environments
Authors:
Etienne Boursier,
Vianney Perchet,
Marco Scarsini
Abstract:
Potential buyers of a product or service, before making their decisions, tend to read reviews written by previous consumers. We consider Bayesian consumers with heterogeneous preferences, who sequentially decide whether to buy an item of unknown quality, based on previous buyers' reviews. The quality is multi-dimensional and may occasionally vary over time; the reviews are also multi-dimensional.…
▽ More
Potential buyers of a product or service, before making their decisions, tend to read reviews written by previous consumers. We consider Bayesian consumers with heterogeneous preferences, who sequentially decide whether to buy an item of unknown quality, based on previous buyers' reviews. The quality is multi-dimensional and may occasionally vary over time; the reviews are also multi-dimensional. In the simple uni-dimensional and static setting, beliefs about the quality are known to converge to its true value. Our paper extends this result in several ways. First, a multi-dimensional quality is considered, second, rates of convergence are provided, third, a dynamical Markovian model with varying quality is studied. In this dynamical setting the cost of learning is shown to be small.
△ Less
Submitted 23 February, 2022; v1 submitted 20 July, 2020;
originally announced July 2020.
-
Statistical Efficiency of Thompson Sampling for Combinatorial Semi-Bandits
Authors:
Pierre Perrault,
Etienne Boursier,
Vianney Perchet,
Michal Valko
Abstract:
We investigate stochastic combinatorial multi-armed bandit with semi-bandit feedback (CMAB). In CMAB, the question of the existence of an efficient policy with an optimal asymptotic regret (up to a factor poly-logarithmic with the action size) is still open for many families of distributions, including mutually independent outcomes, and more generally the multivariate sub-Gaussian family. We propo…
▽ More
We investigate stochastic combinatorial multi-armed bandit with semi-bandit feedback (CMAB). In CMAB, the question of the existence of an efficient policy with an optimal asymptotic regret (up to a factor poly-logarithmic with the action size) is still open for many families of distributions, including mutually independent outcomes, and more generally the multivariate sub-Gaussian family. We propose to answer the above question for these two families by analyzing variants of the Combinatorial Thompson Sampling policy (CTS). For mutually independent outcomes in $[0,1]$, we propose a tight analysis of CTS using Beta priors. We then look at the more general setting of multivariate sub-Gaussian outcomes and propose a tight analysis of CTS using Gaussian priors. This last result gives us an alternative to the Efficient Sampling for Combinatorial Bandit policy (ESCB), which, although optimal, is not computationally efficient.
△ Less
Submitted 3 January, 2021; v1 submitted 11 June, 2020;
originally announced June 2020.
-
Categorized Bandits
Authors:
Matthieu Jedor,
Jonathan Louedec,
Vianney Perchet
Abstract:
We introduce a new stochastic multi-armed bandit setting where arms are grouped inside ``ordered'' categories. The motivating example comes from e-commerce, where a customer typically has a greater appetence for items of a specific well-identified but unknown category than any other one. We introduce three concepts of ordering between categories, inspired by stochastic dominance between random var…
▽ More
We introduce a new stochastic multi-armed bandit setting where arms are grouped inside ``ordered'' categories. The motivating example comes from e-commerce, where a customer typically has a greater appetence for items of a specific well-identified but unknown category than any other one. We introduce three concepts of ordering between categories, inspired by stochastic dominance between random variables, which are gradually weaker so that more and more bandit scenarios satisfy at least one of them. We first prove instance-dependent lower bounds on the cumulative regret for each of these models, indicating how the complexity of the bandit problems increases with the generality of the ordering concept considered. We also provide algorithms that fully leverage the structure of the model with their associated theoretical guarantees. Finally, we have conducted an analysis on real data to highlight that those ordered categories actually exist in practice.
△ Less
Submitted 4 May, 2020;
originally announced May 2020.
-
Selfish Robustness and Equilibria in Multi-Player Bandits
Authors:
Etienne Boursier,
Vianney Perchet
Abstract:
Motivated by cognitive radios, stochastic multi-player multi-armed bandits gained a lot of interest recently. In this class of problems, several players simultaneously pull arms and encounter a collision - with 0 reward - if some of them pull the same arm at the same time. While the cooperative case where players maximize the collective reward (obediently following some fixed protocol) has been mo…
▽ More
Motivated by cognitive radios, stochastic multi-player multi-armed bandits gained a lot of interest recently. In this class of problems, several players simultaneously pull arms and encounter a collision - with 0 reward - if some of them pull the same arm at the same time. While the cooperative case where players maximize the collective reward (obediently following some fixed protocol) has been mostly considered, robustness to malicious players is a crucial and challenging concern. Existing approaches consider only the case of adversarial jammers whose objective is to blindly minimize the collective reward. We shall consider instead the more natural class of selfish players whose incentives are to maximize their individual rewards, potentially at the expense of the social welfare. We provide the first algorithm robust to selfish players (a.k.a. Nash equilibrium) with a logarithmic regret, when the arm performance is observed. When collisions are also observed, Grim Trigger type of strategies enable some implicit communication-based algorithms and we construct robust algorithms in two different settings: the homogeneous (with a regret comparable to the centralized optimal one) and heterogeneous cases (for an adapted and relevant notion of regret). We also provide impossibility results when only the reward is observed or when arm means vary arbitrarily among players.
△ Less
Submitted 19 June, 2020; v1 submitted 4 February, 2020;
originally announced February 2020.
-
Adversarial learning for revenue-maximizing auctions
Authors:
Thomas Nedelec,
Jules Baudet,
Vianney Perchet,
Noureddine El Karoui
Abstract:
We introduce a new numerical framework to learn optimal bidding strategies in repeated auctions when the seller uses past bids to optimize her mechanism. Crucially, we do not assume that the bidders know what optimization mechanism is used by the seller. We recover essentially all state-of-the-art analytical results for the single-item framework derived previously in the setup where the bidder kno…
▽ More
We introduce a new numerical framework to learn optimal bidding strategies in repeated auctions when the seller uses past bids to optimize her mechanism. Crucially, we do not assume that the bidders know what optimization mechanism is used by the seller. We recover essentially all state-of-the-art analytical results for the single-item framework derived previously in the setup where the bidder knows the optimization mechanism used by the seller and extend our approach to multi-item settings, in which no optimal shading strategies were previously known. Our approach yields substantial increases in bidder utility in all settings. Our approach also has a strong potential for practical usage since it provides a simple way to optimize bidding strategies on modern marketplaces where buyers face unknown data-driven mechanisms.
△ Less
Submitted 8 February, 2021; v1 submitted 15 September, 2019;
originally announced September 2019.
-
Markov Decision Process for MOOC users behavioral inference
Authors:
Firas Jarboui,
Célya Gruson-daniel,
Pierre Chanial,
Alain Durmus,
Vincent Rocchisani,
Sophie-helene Goulet Ebongue,
Anneliese Depoux,
Wilfried Kirschenmann,
Vianney Perchet
Abstract:
Studies on massive open online courses (MOOCs) users discuss the existence of typical profiles and their impact on the learning process of the students. However defining the typical behaviors as well as classifying the users accordingly is a difficult task. In this paper we suggest two methods to model MOOC users behaviour given their log data. We mold their behavior into a Markov Decision Process…
▽ More
Studies on massive open online courses (MOOCs) users discuss the existence of typical profiles and their impact on the learning process of the students. However defining the typical behaviors as well as classifying the users accordingly is a difficult task. In this paper we suggest two methods to model MOOC users behaviour given their log data. We mold their behavior into a Markov Decision Process framework. We associate the user's intentions with the MDP reward and argue that this allows us to classify them.
△ Less
Submitted 10 March, 2021; v1 submitted 10 July, 2019;
originally announced July 2019.
-
Online A-Optimal Design and Active Linear Regression
Authors:
Xavier Fontaine,
Pierre Perrault,
Michal Valko,
Vianney Perchet
Abstract:
We consider in this paper the problem of optimal experiment design where a decision maker can choose which points to sample to obtain an estimate $\hatβ$ of the hidden parameter $β^{\star}$ of an underlying linear model. The key challenge of this work lies in the heteroscedasticity assumption that we make, meaning that each covariate has a different and unknown variance. The goal of the decision m…
▽ More
We consider in this paper the problem of optimal experiment design where a decision maker can choose which points to sample to obtain an estimate $\hatβ$ of the hidden parameter $β^{\star}$ of an underlying linear model. The key challenge of this work lies in the heteroscedasticity assumption that we make, meaning that each covariate has a different and unknown variance. The goal of the decision maker is then to figure out on the fly the optimal way to allocate the total budget of $T$ samples between covariates, as sampling several times a specific one will reduce the variance of the estimated model around it (but at the cost of a possible higher variance elsewhere). By trying to minimize the $\ell^2$-loss $\mathbb{E} [\lVert\hatβ-β^{\star}\rVert^2]$ the decision maker is actually minimizing the trace of the covariance matrix of the problem, which corresponds then to online A-optimal design. Combining techniques from bandit and convex optimization we propose a new active sampling algorithm and we compare it with existing ones. We provide theoretical guarantees of this algorithm in different settings, including a $\mathcal{O}(T^{-2})$ regret bound in the case where the covariates form a basis of the feature space, generalizing and improving existing results. Numerical experiments validate our theoretical findings.
△ Less
Submitted 30 December, 2020; v1 submitted 20 June, 2019;
originally announced June 2019.
-
Robust Stackelberg buyers in repeated auctions
Authors:
Clément Calauzènes,
Thomas Nedelec,
Vianney Perchet,
Noureddine El Karoui
Abstract:
We consider the practical and classical setting where the seller is using an exploration stage to learn the value distributions of the bidders before running a revenue-maximizing auction in a exploitation phase. In this two-stage process, we exhibit practical, simple and robust strategies with large utility uplifts for the bidders. We quantify precisely the seller revenue against non-discounted bu…
▽ More
We consider the practical and classical setting where the seller is using an exploration stage to learn the value distributions of the bidders before running a revenue-maximizing auction in a exploitation phase. In this two-stage process, we exhibit practical, simple and robust strategies with large utility uplifts for the bidders. We quantify precisely the seller revenue against non-discounted buyers, complementing recent studies that had focused on impatient/heavily discounted buyers. We also prove the robustness of these shading strategies to sample approximation error of the seller, to bidder's approximation error of the competition and to possible change of the mechanisms.
△ Less
Submitted 29 May, 2019;
originally announced May 2019.
-
ROI Maximization in Stochastic Online Decision-Making
Authors:
Nicolò Cesa-Bianchi,
Tommaso Cesari,
Yishay Mansour,
Vianney Perchet
Abstract:
We introduce a novel theoretical framework for Return On Investment (ROI) maximization in repeated decision-making. Our setting is motivated by the use case of companies that regularly receive proposals for technological innovations and want to quickly decide whether they are worth implementing. We design an algorithm for learning ROI-maximizing decision-making policies over a sequence of innovati…
▽ More
We introduce a novel theoretical framework for Return On Investment (ROI) maximization in repeated decision-making. Our setting is motivated by the use case of companies that regularly receive proposals for technological innovations and want to quickly decide whether they are worth implementing. We design an algorithm for learning ROI-maximizing decision-making policies over a sequence of innovation proposals. Our algorithm provably converges to an optimal policy in class $Π$ at a rate of order $\min\big\{1/(NΔ^2),N^{-1/3}\}$, where $N$ is the number of innovations and $Δ$ is the suboptimality gap in $Π$. A significant hurdle of our formulation, which sets it aside from other online learning problems such as bandits, is that running a policy does not provide an unbiased estimate of its performance.
△ Less
Submitted 22 December, 2021; v1 submitted 28 May, 2019;
originally announced May 2019.
-
Utility/Privacy Trade-off through the lens of Optimal Transport
Authors:
Etienne Boursier,
Vianney Perchet
Abstract:
Strategic information is valuable either by remaining private (for instance if it is sensitive) or, on the other hand, by being used publicly to increase some utility. These two objectives are antagonistic and leaking this information might be more rewarding than concealing it. Unlike classical solutions that focus on the first point, we consider instead agents that optimize a natural trade-off be…
▽ More
Strategic information is valuable either by remaining private (for instance if it is sensitive) or, on the other hand, by being used publicly to increase some utility. These two objectives are antagonistic and leaking this information might be more rewarding than concealing it. Unlike classical solutions that focus on the first point, we consider instead agents that optimize a natural trade-off between both objectives. We formalize this as an optimization problem where the objective map** is regularized by the amount of information revealed to the adversary (measured as a divergence between the prior and posterior on the private knowledge). Quite surprisingly, when combined with the entropic regularization, the Sinkhorn loss naturally emerges in the optimization objective, making it efficiently solvable. We apply these techniques to preserve some privacy in online repeated auctions.
△ Less
Submitted 2 March, 2020; v1 submitted 27 May, 2019;
originally announced May 2019.
-
Learning to bid in revenue-maximizing auctions
Authors:
Thomas Nedelec,
Noureddine El Karoui,
Vianney Perchet
Abstract:
We consider the problem of the optimization of bidding strategies in prior-dependent revenue-maximizing auctions, when the seller fixes the reserve prices based on the bid distributions. Our study is done in the setting where one bidder is strategic. Using a variational approach, we study the complexity of the original objective and we introduce a relaxation of the objective functional in order to…
▽ More
We consider the problem of the optimization of bidding strategies in prior-dependent revenue-maximizing auctions, when the seller fixes the reserve prices based on the bid distributions. Our study is done in the setting where one bidder is strategic. Using a variational approach, we study the complexity of the original objective and we introduce a relaxation of the objective functional in order to use gradient descent methods. Our approach is simple, general and can be applied to various value distributions and revenue-maximizing mechanisms. The new strategies we derive yield massive uplifts compared to the traditional truthfully bidding strategy.
△ Less
Submitted 14 May, 2019; v1 submitted 27 February, 2019;
originally announced February 2019.
-
An adaptive stochastic optimization algorithm for resource allocation
Authors:
Xavier Fontaine,
Shie Mannor,
Vianney Perchet
Abstract:
We consider the classical problem of sequential resource allocation where a decision maker must repeatedly divide a budget between several resources, each with diminishing returns. This can be recast as a specific stochastic optimization problem where the objective is to maximize the cumulative reward, or equivalently to minimize the regret. We construct an algorithm that is {\em adaptive} to the…
▽ More
We consider the classical problem of sequential resource allocation where a decision maker must repeatedly divide a budget between several resources, each with diminishing returns. This can be recast as a specific stochastic optimization problem where the objective is to maximize the cumulative reward, or equivalently to minimize the regret. We construct an algorithm that is {\em adaptive} to the complexity of the problem, expressed in term of the regularity of the returns of the resources, measured by the exponent in the Łojasiewicz inequality (or by their universal concavity parameter). Our parameter-independent algorithm recovers the optimal rates for strongly-concave functions and the classical fast rates of multi-armed bandit (for linear reward functions). Moreover, the algorithm improves existing results on stochastic optimization in this regret minimization setting for intermediate cases.
△ Less
Submitted 16 January, 2020; v1 submitted 12 February, 2019;
originally announced February 2019.