Search | arXiv e-print repository

An Efficient Solution to s-Rectangular Robust Markov Decision Processes

Authors: Navdeep Kumar, Kfir Levy, Kaixin Wang, Shie Mannor

Abstract: We present an efficient robust value iteration for \texttt{s}-rectangular robust Markov Decision Processes (MDPs) with a time complexity comparable to standard (non-robust) MDPs which is significantly faster than any existing method. We do so by deriving the optimal robust Bellman operator in concrete forms using our $L_p$ water filling lemma. We unveil the exact form of the optimal policies, whic… ▽ More We present an efficient robust value iteration for \texttt{s}-rectangular robust Markov Decision Processes (MDPs) with a time complexity comparable to standard (non-robust) MDPs which is significantly faster than any existing method. We do so by deriving the optimal robust Bellman operator in concrete forms using our $L_p$ water filling lemma. We unveil the exact form of the optimal policies, which turn out to be novel threshold policies with the probability of playing an action proportional to its advantage. △ Less

Submitted 31 January, 2023; originally announced January 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2205.14327

arXiv:2110.06267 [pdf, other]

Twice regularized MDPs and the equivalence between robustness and regularization

Authors: Esther Derman, Matthieu Geist, Shie Mannor

Abstract: Robust Markov decision processes (MDPs) aim to handle changing or partially known system dynamics. To solve them, one typically resorts to robust optimization methods. However, this significantly increases computational complexity and limits scalability in both learning and planning. On the other hand, regularized MDPs show more stability in policy learning without impairing time complexity. Yet,… ▽ More Robust Markov decision processes (MDPs) aim to handle changing or partially known system dynamics. To solve them, one typically resorts to robust optimization methods. However, this significantly increases computational complexity and limits scalability in both learning and planning. On the other hand, regularized MDPs show more stability in policy learning without impairing time complexity. Yet, they generally do not encompass uncertainty in the model dynamics. In this work, we aim to learn robust MDPs using regularization. We first show that regularized MDPs are a particular instance of robust MDPs with uncertain reward. We thus establish that policy iteration on reward-robust MDPs can have the same time complexity as on regularized MDPs. We further extend this relationship to MDPs with uncertain transitions: this leads to a regularization term with an additional dependence on the value function. We finally generalize regularized MDPs to twice regularized MDPs (R${}^2$ MDPs), i.e., MDPs with $\textit{both}$ value and policy regularization. The corresponding Bellman operators enable develo** policy iteration schemes with convergence and robustness guarantees. It also reduces planning and learning in robust MDPs to regularized MDPs. △ Less

Submitted 12 October, 2021; originally announced October 2021.

Comments: Accepted to NeurIPS 2021

arXiv:2102.03802 [pdf, other]

Dimension Free Generalization Bounds for Non Linear Metric Learning

Authors: Mark Kozdoba, Shie Mannor

Abstract: In this work we study generalization guarantees for the metric learning problem, where the metric is induced by a neural network type embedding of the data. Specifically, we provide uniform generalization bounds for two regimes -- the sparse regime, and a non-sparse regime which we term \emph{bounded amplification}. The sparse regime bounds correspond to situations where $\ell_1$-type norms of the… ▽ More In this work we study generalization guarantees for the metric learning problem, where the metric is induced by a neural network type embedding of the data. Specifically, we provide uniform generalization bounds for two regimes -- the sparse regime, and a non-sparse regime which we term \emph{bounded amplification}. The sparse regime bounds correspond to situations where $\ell_1$-type norms of the parameters are small. Similarly to the situation in classification, solutions satisfying such bounds can be obtained by an appropriate regularization of the problem. On the other hand, unregularized SGD optimization of a metric learning loss typically does not produce sparse solutions. We show that despite this lack of sparsity, by relying on a different, new property of the solutions, it is still possible to provide dimension free generalization guarantees. Consequently, these bounds can explain generalization in non sparse real experimental situations. We illustrate the studied phenomena on the MNIST and 20newsgroups datasets. △ Less

Submitted 7 February, 2021; originally announced February 2021.

arXiv:2007.13232 [pdf, other]

The Pendulum Arrangement: Maximizing the Escape Time of Heterogeneous Random Walks

Authors: Asaf Cassel, Shie Mannor, Guy Tennenholtz

Abstract: We identify a fundamental phenomenon of heterogeneous one dimensional random walks: the escape (traversal) time is maximized when the heterogeneity in transition probabilities forms a pyramid-like potential barrier. This barrier corresponds to a distinct arrangement of transition probabilities, sometimes referred to as the pendulum arrangement. We reduce this problem to a sum over products, combin… ▽ More We identify a fundamental phenomenon of heterogeneous one dimensional random walks: the escape (traversal) time is maximized when the heterogeneity in transition probabilities forms a pyramid-like potential barrier. This barrier corresponds to a distinct arrangement of transition probabilities, sometimes referred to as the pendulum arrangement. We reduce this problem to a sum over products, combinatorial optimization problem, proving that this unique structure always maximizes the escape time. This general property may influence studies in epidemiology, biology, and computer science to better understand escape time behavior and construct intruder-resilient networks. △ Less

Submitted 28 July, 2020; v1 submitted 26 July, 2020; originally announced July 2020.

Comments: Names ordered alphabetically

arXiv:2003.02894 [pdf, ps, other]

Distributional Robustness and Regularization in Reinforcement Learning

Authors: Esther Derman, Shie Mannor

Abstract: Distributionally Robust Optimization (DRO) has enabled to prove the equivalence between robustness and regularization in classification and regression, thus providing an analytical reason why regularization generalizes well in statistical learning. Although DRO's extension to sequential decision-making overcomes $\textit{external uncertainty}$ through the robust Markov Decision Process (MDP) setti… ▽ More Distributionally Robust Optimization (DRO) has enabled to prove the equivalence between robustness and regularization in classification and regression, thus providing an analytical reason why regularization generalizes well in statistical learning. Although DRO's extension to sequential decision-making overcomes $\textit{external uncertainty}$ through the robust Markov Decision Process (MDP) setting, the resulting formulation is hard to solve, especially on large domains. On the other hand, existing regularization methods in reinforcement learning only address $\textit{internal uncertainty}$ due to stochasticity. Our study aims to facilitate robust reinforcement learning by establishing a dual relation between robust MDPs and regularization. We introduce Wasserstein distributionally robust MDPs and prove that they hold out-of-sample performance guarantees. Then, we introduce a new regularizer for empirical value functions and show that it lower bounds the Wasserstein distributionally robust value function. We extend the result to linear value function approximation for large state spaces. Our approach provides an alternative formulation of robustness with guaranteed finite-sample performance. Moreover, it suggests using regularization as a practical tool for dealing with $\textit{external uncertainty}$ in reinforcement learning methods. △ Less

Submitted 14 July, 2020; v1 submitted 5 March, 2020; originally announced March 2020.

Comments: Accepted at the "Theoretical Foundations of Reinforcement Learning" Workshop - ICML 2020

arXiv:1909.04236 [pdf, other]

Online Planning with Lookahead Policies

Authors: Yonathan Efroni, Mohammad Ghavamzadeh, Shie Mannor

Abstract: Real Time Dynamic Programming (RTDP) is an online algorithm based on Dynamic Programming (DP) that acts by 1-step greedy planning. Unlike DP, RTDP does not require access to the entire state space, i.e., it explicitly handles the exploration. This fact makes RTDP particularly appealing when the state space is large and it is not possible to update all states simultaneously. In this we devise a mul… ▽ More Real Time Dynamic Programming (RTDP) is an online algorithm based on Dynamic Programming (DP) that acts by 1-step greedy planning. Unlike DP, RTDP does not require access to the entire state space, i.e., it explicitly handles the exploration. This fact makes RTDP particularly appealing when the state space is large and it is not possible to update all states simultaneously. In this we devise a multi-step greedy RTDP algorithm, which we call $h$-RTDP, that replaces the 1-step greedy policy with a $h$-step lookahead policy. We analyze $h$-RTDP in its exact form and establish that increasing the lookahead horizon, $h$, results in an improved sample complexity, with the cost of additional computations. This is the first work that proves improved sample complexity as a result of {\em increasing} the lookahead horizon in online planning. We then analyze the performance of $h$-RTDP in three approximate settings: approximate model, approximate value updates, and approximate state representation. For these cases, we prove that the asymptotic performance of $h$-RTDP remains the same as that of a corresponding approximate DP algorithm, the best one can hope for without further assumptions on the approximation errors. △ Less

Submitted 12 October, 2020; v1 submitted 9 September, 2019; originally announced September 2019.

Comments: NeurIPS 2020

arXiv:1909.02769 [pdf, ps, other]

Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs

Authors: Lior Shani, Yonathan Efroni, Shie Mannor

Abstract: Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies to be 'close' to one another, is iteratively solved. Nevertheless, TRPO has been considered a heuristic algorithm inspired by Conservative Policy Iteration (CPI). We show that the adaptive scaling me… ▽ More Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies to be 'close' to one another, is iteratively solved. Nevertheless, TRPO has been considered a heuristic algorithm inspired by Conservative Policy Iteration (CPI). We show that the adaptive scaling mechanism used in TRPO is in fact the natural "RL version" of traditional trust-region methods from convex analysis. We first analyze TRPO in the planning setting, in which we have access to the model and the entire state space. Then, we consider sample-based TRPO and establish $\tilde O(1/\sqrt{N})$ convergence rate to the global optimum. Importantly, the adaptive scaling mechanism allows us to analyze TRPO in regularized MDPs for which we prove fast rates of $\tilde O(1/N)$, much like results in convex optimization. This is the first result in RL of better rates when regularizing the instantaneous cost or reward. △ Less

Submitted 12 December, 2019; v1 submitted 6 September, 2019; originally announced September 2019.

Comments: Published at AAAI-2020 58 pages

arXiv:1902.04376 [pdf, ps, other]

An adaptive stochastic optimization algorithm for resource allocation

Authors: Xavier Fontaine, Shie Mannor, Vianney Perchet

Abstract: We consider the classical problem of sequential resource allocation where a decision maker must repeatedly divide a budget between several resources, each with diminishing returns. This can be recast as a specific stochastic optimization problem where the objective is to maximize the cumulative reward, or equivalently to minimize the regret. We construct an algorithm that is {\em adaptive} to the… ▽ More We consider the classical problem of sequential resource allocation where a decision maker must repeatedly divide a budget between several resources, each with diminishing returns. This can be recast as a specific stochastic optimization problem where the objective is to maximize the cumulative reward, or equivalently to minimize the regret. We construct an algorithm that is {\em adaptive} to the complexity of the problem, expressed in term of the regularity of the returns of the resources, measured by the exponent in the Łojasiewicz inequality (or by their universal concavity parameter). Our parameter-independent algorithm recovers the optimal rates for strongly-concave functions and the classical fast rates of multi-armed bandit (for linear reward functions). Moreover, the algorithm improves existing results on stochastic optimization in this regret minimization setting for intermediate cases. △ Less

Submitted 16 January, 2020; v1 submitted 12 February, 2019; originally announced February 2019.

Comments: ALT2020, 45 pages, 9 figures

Journal ref: Proceedings of Machine Learning Research (PMLR), volume 117, 2020

arXiv:1809.05870 [pdf, other]

doi 10.1609/aaai.v33i01.33014098

On-Line Learning of Linear Dynamical Systems: Exponential Forgetting in Kalman Filters

Authors: Mark Kozdoba, Jakub Marecek, Tigran Tchrakian, Shie Mannor

Abstract: Kalman filter is a key tool for time-series forecasting and analysis. We show that the dependence of a prediction of Kalman filter on the past is decaying exponentially, whenever the process noise is non-degenerate. Therefore, Kalman filter may be approximated by regression on a few recent observations. Surprisingly, we also show that having some process noise is essential for the exponential deca… ▽ More Kalman filter is a key tool for time-series forecasting and analysis. We show that the dependence of a prediction of Kalman filter on the past is decaying exponentially, whenever the process noise is non-degenerate. Therefore, Kalman filter may be approximated by regression on a few recent observations. Surprisingly, we also show that having some process noise is essential for the exponential decay. With no process noise, it may happen that the forecast depends on all of the past uniformly, which makes forecasting more difficult. Based on this insight, we devise an on-line algorithm for improper learning of a linear dynamical system (LDS), which considers only a few most recent observations. We use our decay results to provide the first regret bounds w.r.t. to Kalman filters within learning an LDS. That is, we compare the results of our algorithm to the best, in hindsight, Kalman filter for a given signal. Also, the algorithm is practical: its per-update run-time is linear in the regression depth. △ Less

Submitted 16 September, 2018; originally announced September 2018.

Journal ref: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019. Pages: 4098-4105

arXiv:1506.02188 [pdf, other]

Risk-Sensitive and Robust Decision-Making: a CVaR Optimization Approach

Authors: Yinlam Chow, Aviv Tamar, Shie Mannor, Marco Pavone

Abstract: In this paper we address the problem of decision making within a Markov decision process (MDP) framework where risk and modeling errors are taken into account. Our approach is to minimize a risk-sensitive conditional-value-at-risk (CVaR) objective, as opposed to a standard risk-neutral expectation. We refer to such problem as CVaR MDP. Our first contribution is to show that a CVaR objective, besid… ▽ More In this paper we address the problem of decision making within a Markov decision process (MDP) framework where risk and modeling errors are taken into account. Our approach is to minimize a risk-sensitive conditional-value-at-risk (CVaR) objective, as opposed to a standard risk-neutral expectation. We refer to such problem as CVaR MDP. Our first contribution is to show that a CVaR objective, besides capturing risk sensitivity, has an alternative interpretation as expected cost under worst-case modeling errors, for a given error budget. This result, which is of independent interest, motivates CVaR MDPs as a unifying framework for risk-sensitive and robust decision making. Our second contribution is to present an approximate value-iteration algorithm for CVaR MDPs and analyze its convergence rate. To our knowledge, this is the first solution algorithm for CVaR MDPs that enjoys error guarantees. Finally, we present results from numerical experiments that corroborate our theoretical findings and show the practicality of our approach. △ Less

Submitted 6 June, 2015; originally announced June 2015.

Comments: Submitted to NIPS 15

arXiv:1402.6361 [pdf, ps, other]

Oracle-Based Robust Optimization via Online Learning

Authors: Aharon Ben-Tal, Elad Hazan, Tomer Koren, Shie Mannor

Abstract: Robust optimization is a common framework in optimization under uncertainty when the problem parameters are not known, but it is rather known that the parameters belong to some given uncertainty set. In the robust optimization framework the problem solved is a min-max problem where a solution is judged according to its performance on the worst possible realization of the parameters. In many cases,… ▽ More Robust optimization is a common framework in optimization under uncertainty when the problem parameters are not known, but it is rather known that the parameters belong to some given uncertainty set. In the robust optimization framework the problem solved is a min-max problem where a solution is judged according to its performance on the worst possible realization of the parameters. In many cases, a straightforward solution of the robust optimization problem of a certain type requires solving an optimization problem of a more complicated type, and in some cases even NP-hard. For example, solving a robust conic quadratic program, such as those arising in robust SVM, ellipsoidal uncertainty leads in general to a semidefinite program. In this paper we develop a method for approximately solving a robust optimization problem using tools from online convex optimization, where in every stage a standard (non-robust) optimization program is solved. Our algorithms find an approximate robust solution using a number of calls to an oracle that solves the original (non-robust) problem that is inversely proportional to the square of the target accuracy. △ Less

Submitted 25 February, 2014; originally announced February 2014.

arXiv:1402.2043 [pdf, other]

Approachability in unknown games: Online learning meets multi-objective optimization

Authors: Shie Mannor, Vianney Perchet, Gilles Stoltz

Abstract: In the standard setting of approachability there are two players and a target set. The players play repeatedly a known vector-valued game where the first player wants to have the average vector-valued payoff converge to the target set which the other player tries to exclude it from this set. We revisit this setting in the spirit of online learning and do not assume that the first player knows the… ▽ More In the standard setting of approachability there are two players and a target set. The players play repeatedly a known vector-valued game where the first player wants to have the average vector-valued payoff converge to the target set which the other player tries to exclude it from this set. We revisit this setting in the spirit of online learning and do not assume that the first player knows the game structure: she receives an arbitrary vector-valued reward vector at every round. She wishes to approach the smallest ("best") possible set given the observed average payoffs in hindsight. This extension of the standard setting has implications even when the original target set is not approachable and when it is not obvious which expansion of it should be approached instead. We show that it is impossible, in general, to approach the best target set in hindsight and propose achievable though ambitious alternative goals. We further propose a concrete strategy to approach these goals. Our method does not require projection onto a target set and amounts to switching between scalar regret minimization algorithms that are performed in episodes. Applications to global cost minimization and to approachability under sample path constraints are considered. △ Less

Submitted 17 June, 2016; v1 submitted 10 February, 2014; originally announced February 2014.

arXiv:1305.5399 [pdf, other]

A Primal Condition for Approachability with Partial Monitoring

Authors: Shie Mannor, Vianney Perchet, Gilles Stoltz

Abstract: In approachability with full monitoring there are two types of conditions that are known to be equivalent for convex sets: a primal and a dual condition. The primal one is of the form: a set C is approachable if and only all containing half-spaces are approachable in the one-shot game; while the dual one is of the form: a convex set C is approachable if and only if it intersects all payoff sets of… ▽ More In approachability with full monitoring there are two types of conditions that are known to be equivalent for convex sets: a primal and a dual condition. The primal one is of the form: a set C is approachable if and only all containing half-spaces are approachable in the one-shot game; while the dual one is of the form: a convex set C is approachable if and only if it intersects all payoff sets of a certain form. We consider approachability in games with partial monitoring. In previous works (Perchet 2011; Mannor et al. 2011) we provided a dual characterization of approachable convex sets; we also exhibited efficient strategies in the case where C is a polytope. In this paper we provide primal conditions on a convex set to be approachable with partial monitoring. They depend on a modified reward function and lead to approachability strategies, based on modified payoff functions, that proceed by projections similarly to Blackwell's (1956) strategy; this is in contrast with previously studied strategies in this context that relied mostly on the signaling structure and aimed at estimating well the distributions of the signals received. Our results generalize classical results by Kohlberg 1975 (see also Mertens et al. 1994) and apply to games with arbitrary signaling structure as well as to arbitrary convex sets. △ Less

Submitted 23 May, 2013; originally announced May 2013.

arXiv:1301.2725 [pdf, other]

Robust High Dimensional Sparse Regression and Matching Pursuit

Authors: Yudong Chen, Constantine Caramanis, Shie Mannor

Abstract: We consider high dimensional sparse regression, and develop strategies able to deal with arbitrary -- possibly, severe or coordinated -- errors in the covariance matrix $X$. These may come from corrupted data, persistent experimental errors, or malicious respondents in surveys/recommender systems, etc. Such non-stochastic error-in-variables problems are notoriously difficult to treat, and as we de… ▽ More We consider high dimensional sparse regression, and develop strategies able to deal with arbitrary -- possibly, severe or coordinated -- errors in the covariance matrix $X$. These may come from corrupted data, persistent experimental errors, or malicious respondents in surveys/recommender systems, etc. Such non-stochastic error-in-variables problems are notoriously difficult to treat, and as we demonstrate, the problem is particularly pronounced in high-dimensional settings where the primary goal is {\em support recovery} of the sparse regressor. We develop algorithms for support recovery in sparse regression, when some number $n_1$ out of $n+n_1$ total covariate/response pairs are {\it arbitrarily (possibly maliciously) corrupted}. We are interested in understanding how many outliers, $n_1$, we can tolerate, while identifying the correct support. To the best of our knowledge, neither standard outlier rejection techniques, nor recently developed robust regression algorithms (that focus only on corrupted response variables), nor recent algorithms for dealing with stochastic noise or erasures, can provide guarantees on support recovery. Perhaps surprisingly, we also show that the natural brute force algorithm that searches over all subsets of $n$ covariate/response pairs, and all subsets of possible support coordinates in order to minimize regression error, is remarkably poor, unable to correctly identify the support with even $n_1 = O(n/k)$ corrupted points, where $k$ is the sparsity. This is true even in the basic setting we consider, where all authentic measurements and noise are independent and sub-Gaussian. In this setting, we provide a simple algorithm -- no more computationally taxing than OMP -- that gives stronger performance guarantees, recovering the support with up to $n_1 = O(n/(\sqrt{k} \log p))$ corrupted points, where $p$ is the dimension of the signal to be recovered. △ Less

Submitted 12 January, 2013; originally announced January 2013.

arXiv:1206.6404 [pdf]

Policy Gradients with Variance Related Risk Criteria

Authors: Dotan Di Castro, Aviv Tamar, Shie Mannor

Abstract: Managing risk in dynamic decision problems is of cardinal importance in many fields such as finance and process control. The most common approach to defining risk is through various variance related criteria such as the Sharpe Ratio or the standard deviation adjusted reward. It is known that optimizing many of the variance related risk criteria is NP-hard. In this paper we devise a framework for l… ▽ More Managing risk in dynamic decision problems is of cardinal importance in many fields such as finance and process control. The most common approach to defining risk is through various variance related criteria such as the Sharpe Ratio or the standard deviation adjusted reward. It is known that optimizing many of the variance related risk criteria is NP-hard. In this paper we devise a framework for local policy gradient style algorithms for reinforcement learning for variance related criteria. Our starting point is a new formula for the variance of the cost-to-go in episodic tasks. Using this formula we develop policy gradient algorithms for criteria that involve both the expected cost and the variance of the cost. We prove the convergence of these algorithms to local minima and demonstrate their applicability in a portfolio planning problem. △ Less

Submitted 27 June, 2012; originally announced June 2012.

Comments: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012)

arXiv:1203.1072 [pdf, other]

Go Viral, or Not: Rate-Optimal Control for Resource-Constrained Branching Processes

Authors: Shie Mannor, Kuang Xu

Abstract: We propose and analyze a new class of controlled multi-type branching processes with a per-step linear resource constraint, motivated by potential applications in viral marketing and cancer treatment. We show that the optimal exponential growth rate of the population can be achieved by maintaining a fixed proportion among the species, for both deterministic and stochastic branching processes. In t… ▽ More We propose and analyze a new class of controlled multi-type branching processes with a per-step linear resource constraint, motivated by potential applications in viral marketing and cancer treatment. We show that the optimal exponential growth rate of the population can be achieved by maintaining a fixed proportion among the species, for both deterministic and stochastic branching processes. In the special case of a two-type population and with a symmetric reward structure, the optimal proportion is obtained in closed-form. In addition to revealing structural properties of controlled branching processes, our results are intended to provide the practitioners with an easy-to-interpret benchmark for best practices, if not exact policies. As a proof of concept, the methodology is applied to the linkage structure of the 2004 US Presidential Election blogosphere, where the optimal growth rate demonstrates sizable gains over a uniform selection strategy, and to a two-compartment cell-cycle kinetics model for cancer growth, with realistic parameters, where the robust estimate for minimal treatment intensity under a worst-case growth rate is noticeably more conservative compared to that obtained using more optimistic assumptions. △ Less

Submitted 8 January, 2013; v1 submitted 5 March, 2012; originally announced March 2012.

arXiv:1109.3151 [pdf, other]

Regulation, Volatility and Efficiency in Continuous-Time Markets

Authors: Arman C. Kizilkale, Shie Mannor

Abstract: We analyze the efficiency of markets with friction, particularly power markets. We model the market as a dynamic system with $(d_t;\,t\geq 0)$ the demand process and $(s_t;\,t\geq 0)$ the supply process. Using stochastic differential equations to model the dynamics with friction, we investigate the efficiency of the market under an integrated expected undiscounted cost function solving the optimal… ▽ More We analyze the efficiency of markets with friction, particularly power markets. We model the market as a dynamic system with $(d_t;\,t\geq 0)$ the demand process and $(s_t;\,t\geq 0)$ the supply process. Using stochastic differential equations to model the dynamics with friction, we investigate the efficiency of the market under an integrated expected undiscounted cost function solving the optimal control problem. Then, we extend the setup to a game theoretic model where multiple suppliers and consumers interact continuously by setting prices in a dynamic market with friction. We investigate the equilibrium, and analyze the efficiency of the market under an integrated expected social cost function. We provide an intriguing efficiency-volatility no-free-lunch trade-off theorem. △ Less

Submitted 14 September, 2011; originally announced September 2011.

arXiv:1105.4995 [pdf, ps, other]

Robust approachability and regret minimization in games with partial monitoring

Authors: Shie Mannor, Vianney Perchet, Gilles Stoltz

Abstract: Approachability has become a standard tool in analyzing earning algorithms in the adversarial online learning setup. We develop a variant of approachability for games where there is ambiguity in the obtained reward that belongs to a set, rather than being a single vector. Using this variant we tackle the problem of approachability in games with partial monitoring and develop simple and efficient a… ▽ More Approachability has become a standard tool in analyzing earning algorithms in the adversarial online learning setup. We develop a variant of approachability for games where there is ambiguity in the obtained reward that belongs to a set, rather than being a single vector. Using this variant we tackle the problem of approachability in games with partial monitoring and develop simple and efficient algorithms (i.e., with constant per-step complexity) for this setup. We finally consider external regret and internal regret in repeated games with partial monitoring and derive regret-minimizing strategies based on approachability theory. △ Less

Submitted 15 February, 2012; v1 submitted 25 May, 2011; originally announced May 2011.

arXiv:math/0701419 [pdf, ps, other]

Strategies for prediction under imperfect monitoring

Authors: Gabor Lugosi, Shie Mannor, Gilles Stoltz

Abstract: We propose simple randomized strategies for sequential prediction under imperfect monitoring, that is, when the forecaster does not have access to the past outcomes but rather to a feedback signal. The proposed strategies are consistent in the sense that they achieve, asymptotically, the best possible average reward. It was Rustichini (1999) who first proved the existence of such consistent pred… ▽ More We propose simple randomized strategies for sequential prediction under imperfect monitoring, that is, when the forecaster does not have access to the past outcomes but rather to a feedback signal. The proposed strategies are consistent in the sense that they achieve, asymptotically, the best possible average reward. It was Rustichini (1999) who first proved the existence of such consistent predictors. The forecasters presented here offer the first constructive proof of consistency. Moreover, the proposed algorithms are computationally efficient. We also establish upper bounds for the rates of convergence. In the case of deterministic feedback, these rates are optimal up to logarithmic terms. △ Less

Submitted 7 January, 2008; v1 submitted 15 January, 2007; originally announced January 2007.

Comments: Journal version of a COLT conference paper

MSC Class: 91A20; 62L12; 68Q32

Journal ref: Mathematics of Operations Research (2008) à paraître

Showing 1–19 of 19 results for author: Mannor, S