Search | arXiv e-print repository

The Parametric Cost Function Approximation: A new approach for multistage stochastic programming

Abstract: The most common approaches for solving multistage stochastic programming problems in the research literature have been to either use value functions ("dynamic programming") or scenario trees ("stochastic programming") to approximate the impact of a decision now on the future. By contrast, common industry practice is to use a deterministic approximation of the future which is easier to understand a… ▽ More The most common approaches for solving multistage stochastic programming problems in the research literature have been to either use value functions ("dynamic programming") or scenario trees ("stochastic programming") to approximate the impact of a decision now on the future. By contrast, common industry practice is to use a deterministic approximation of the future which is easier to understand and solve, but which is criticized for ignoring uncertainty. We show that a parameterized version of a deterministic optimization model can be an effective way of handling uncertainty without the complexity of either stochastic programming or dynamic programming. We present the idea of a parameterized deterministic optimization model, and in particular a deterministic lookahead model, as a powerful strategy for many complex stochastic decision problems. This approach can handle complex, high-dimensional state variables, and avoids the usual approximations associated with scenario trees or value function approximations. Instead, it introduces the offline challenge of designing and tuning the parameterization. We illustrate the idea by using a series of application settings, and demonstrate its use in a nonstationary energy storage problem with rolling forecasts. △ Less

Submitted 1 January, 2022; originally announced January 2022.

Comments: 3 figures

MSC Class: 68 ACM Class: F.2; I.2

arXiv:2004.05417 [pdf, other]

Optimal Learning for Sequential Decisions in Laboratory Experimentation

Authors: Kristopher Reyes, Warren B Powell

Abstract: The process of discovery in the physical, biological and medical sciences can be painstakingly slow. Most experiments fail, and the time from initiation of research until a new advance reaches commercial production can span 20 years. This tutorial is aimed to provide experimental scientists with a foundation in the science of making decisions. Using numerical examples drawn from the experiences of… ▽ More The process of discovery in the physical, biological and medical sciences can be painstakingly slow. Most experiments fail, and the time from initiation of research until a new advance reaches commercial production can span 20 years. This tutorial is aimed to provide experimental scientists with a foundation in the science of making decisions. Using numerical examples drawn from the experiences of the authors, the article describes the fundamental elements of any experimental learning problem. It emphasizes the important role of belief models, which include not only the best estimate of relationships provided by prior research, previous experiments and scientific expertise, but also the uncertainty in these relationships. We introduce the concept of a learning policy, and review the major categories of policies. We then introduce a policy, known as the knowledge gradient, that maximizes the value of information from each experiment. We bring out the importance of reducing uncertainty, and illustrate this process for different belief models. △ Less

Submitted 13 April, 2020; v1 submitted 11 April, 2020; originally announced April 2020.

arXiv:2002.06238 [pdf, other]

On State Variables, Bandit Problems and POMDPs

Authors: Warren B Powell

Abstract: State variables are easily the most subtle dimension of sequential decision problems. This is especially true in the context of active learning problems (bandit problems") where decisions affect what we observe and learn. We describe our canonical framework that models {\it any} sequential decision problem, and present our definition of state variables that allows us to claim: Any properly modeled… ▽ More State variables are easily the most subtle dimension of sequential decision problems. This is especially true in the context of active learning problems (bandit problems") where decisions affect what we observe and learn. We describe our canonical framework that models {\it any} sequential decision problem, and present our definition of state variables that allows us to claim: Any properly modeled sequential decision problem is Markovian. We then present a novel two-agent perspective of partially observable Markov decision problems (POMDPs) that allows us to then claim: Any model of a real decision problem is (possibly) non-Markovian. We illustrate these perspectives using the context of observing and treating flu in a population, and provide examples of all four classes of policies in this setting. We close with an indication of how to extend this thinking to multiagent problems. △ Less

Submitted 14 February, 2020; originally announced February 2020.

arXiv:1912.09484 [pdf, other]

Zeroth-order Stochastic Compositional Algorithms for Risk-Aware Learning

Authors: Dionysios S. Kalogerias, Warren B. Powell

Abstract: We present $\textit{Free-MESSAGE}^{p}$, the first zeroth-order algorithm for (weakly-)convex mean-semideviation-based risk-aware learning, which is also the first three-level zeroth-order compositional stochastic optimization algorithm whatsoever. Using a non-trivial extension of Nesterov's classical results on Gaussian smoothing, we develop the $\textit{Free-MESSAGE}^{p}$ algorithm from first pri… ▽ More We present $\textit{Free-MESSAGE}^{p}$, the first zeroth-order algorithm for (weakly-)convex mean-semideviation-based risk-aware learning, which is also the first three-level zeroth-order compositional stochastic optimization algorithm whatsoever. Using a non-trivial extension of Nesterov's classical results on Gaussian smoothing, we develop the $\textit{Free-MESSAGE}^{p}$ algorithm from first principles, and show that it essentially solves a smoothed surrogate to the original problem, the former being a uniform approximation of the latter, in a useful, convenient sense. We then present a complete analysis of the $\textit{Free-MESSAGE}^{p}$ algorithm, which establishes convergence in a user-tunable neighborhood of the optimal solutions of the original problem for convex costs, as well as explicit convergence rates for convex, weakly convex, and strongly convex costs, and in a unified way. Orderwise, and for fixed problem parameters, our results demonstrate no sacrifice in convergence speed as compared to existing first-order methods, while striking a certain balance among the condition of the problem, its dimensionality, as well as the accuracy of the obtained results, naturally extending previous results in zeroth-order risk-neutral learning. △ Less

Submitted 13 December, 2021; v1 submitted 19 December, 2019; originally announced December 2019.

Comments: 31 pages, major revision of the first version

arXiv:1912.03513 [pdf, other]

From Reinforcement Learning to Optimal Control: A unified framework for sequential decisions

Authors: Warren B Powell

Abstract: There are over 15 distinct communities that work in the general area of sequential decisions and information, often referred to as decisions under uncertainty or stochastic optimization. We focus on two of the most important fields: stochastic optimal control, with its roots in deterministic optimal control, and reinforcement learning, with its roots in Markov decision processes. Building on prior… ▽ More There are over 15 distinct communities that work in the general area of sequential decisions and information, often referred to as decisions under uncertainty or stochastic optimization. We focus on two of the most important fields: stochastic optimal control, with its roots in deterministic optimal control, and reinforcement learning, with its roots in Markov decision processes. Building on prior work, we describe a unified framework that covers all 15 different communities, and note the strong parallels with the modeling framework of stochastic optimal control. By contrast, we make the case that the modeling framework of reinforcement learning, inherited from discrete Markov decision processes, is quite limited. Our framework (and that of stochastic control) is based on the core problem of optimizing over policies. We describe four classes of policies that we claim are universal, and show that each of these two fields have, in their own way, evolved to include examples of each of these four classes. △ Less

Submitted 18 December, 2019; v1 submitted 7 December, 2019; originally announced December 2019.

Comments: 47 pages, 6 figures

arXiv:1810.08124 [pdf, ps, other]

Approximate Dynamic Programming for Planning a Ride-Sharing System using Autonomous Fleets of Electric Vehicles

Authors: Lina Al-Kanj, Juliana Nascimento, Warren B. Powell

Abstract: Within a decade, almost every major auto company, along with fleet operators such as Uber, have announced plans to put autonomous vehicles on the road. At the same time, electric vehicles are quickly emerging as a next-generation technology that is cost effective, in addition to offering the benefits of reducing the carbon footprint. The combination of a centrally managed fleet of driverless vehic… ▽ More Within a decade, almost every major auto company, along with fleet operators such as Uber, have announced plans to put autonomous vehicles on the road. At the same time, electric vehicles are quickly emerging as a next-generation technology that is cost effective, in addition to offering the benefits of reducing the carbon footprint. The combination of a centrally managed fleet of driverless vehicles, along with the operating characteristics of electric vehicles, is creating a transformative new technology that offers significant cost savings with high service levels. This problem involves a dispatch problem for assigning riders to cars, a surge pricing problem for deciding on the price per trip and a planning problem for deciding on the fleet size. We use approximate dynamic programming to develop high-quality operational dispatch strategies to determine which car is best for a particular trip, when a car should be recharged, and when it should be re-positioned to a different zone which offers a higher density of trips. We prove that the value functions are monotone in the battery and time dimensions and use hierarchical aggregation to get better estimates of the value functions with a small number of observations. Then, surge pricing is discussed using an adaptive learning approach to decide on the price for each trip. Finally, we discuss the fleet size problem which depends on the previous two problems. △ Less

Submitted 11 December, 2018; v1 submitted 18 October, 2018; originally announced October 2018.

arXiv:1704.05963 [pdf, other]

Monte Carlo Tree Search with Sampled Information Relaxation Dual Bounds

Authors: Daniel R. Jiang, Lina Al-Kanj, Warren B. Powell

Abstract: Monte Carlo Tree Search (MCTS), most famously used in game-play artificial intelligence (e.g., the game of Go), is a well-known strategy for constructing approximate solutions to sequential decision problems. Its primary innovation is the use of a heuristic, known as a default policy, to obtain Monte Carlo estimates of downstream values for states in a decision tree. This information is used to it… ▽ More Monte Carlo Tree Search (MCTS), most famously used in game-play artificial intelligence (e.g., the game of Go), is a well-known strategy for constructing approximate solutions to sequential decision problems. Its primary innovation is the use of a heuristic, known as a default policy, to obtain Monte Carlo estimates of downstream values for states in a decision tree. This information is used to iteratively expand the tree towards regions of states and actions that an optimal policy might visit. However, to guarantee convergence to the optimal action, MCTS requires the entire tree to be expanded asymptotically. In this paper, we propose a new technique called Primal-Dual MCTS that utilizes sampled information relaxation upper bounds on potential actions, creating the possibility of "ignoring" parts of the tree that stem from highly suboptimal choices. This allows us to prove that despite converging to a partial decision tree in the limit, the recommended action from Primal-Dual MCTS is optimal. The new approach shows significant promise when used to optimize the behavior of a single driver navigating a graph while operating on a ride-sharing platform. Numerical experiments on a real dataset of 7,000 trips in New Jersey suggest that Primal-Dual MCTS improves upon standard MCTS by producing deeper decision trees and exhibits a reduced sensitivity to the size of the action space. △ Less

Submitted 19 April, 2017; originally announced April 2017.

Comments: 33 pages, 6 figures

arXiv:1605.05711 [pdf, ps, other]

The Information-Collecting Vehicle Routing Problem: Stochastic Optimization for Emergency Storm Response

Authors: Lina Al-Kanj, Warren B. Powell, Belgacem Bouzaiene-Ayari

Abstract: Utilities face the challenge of responding to power outages due to storms and ice damage, but most power grids are not equipped with sensors to pinpoint the precise location of the faults causing the outage. Instead, utilities have to depend primarily on phone calls (trouble calls) from customers who have lost power to guide the dispatching of utility trucks. In this paper, we develop a policy tha… ▽ More Utilities face the challenge of responding to power outages due to storms and ice damage, but most power grids are not equipped with sensors to pinpoint the precise location of the faults causing the outage. Instead, utilities have to depend primarily on phone calls (trouble calls) from customers who have lost power to guide the dispatching of utility trucks. In this paper, we develop a policy that routes a utility truck to restore outages in the power grid as quickly as possible, using phone calls to create beliefs about outages, but also using utility trucks as a mechanism for collecting additional information. This means that routing decisions change not only the physical state of the truck (as it moves from one location to another) and the grid (as the truck performs repairs), but also our belief about the network, creating the first stochastic vehicle routing problem that explicitly models information collection and belief modeling. We address the problem of managing a single utility truck, which we start by formulating as a sequential stochastic optimization model which captures our belief about the state of the grid. We propose a stochastic lookahead policy, and use Monte Carlo tree search (MCTS) to produce a practical policy that is asymptotically optimal. Simulation results show that the developed policy restores the power grid much faster compared to standard industry heuristics. △ Less

Submitted 18 May, 2016; originally announced May 2016.

arXiv:1509.01920 [pdf, other]

Risk-Averse Approximate Dynamic Programming with Quantile-Based Risk Measures

Authors: Daniel R. Jiang, Warren B. Powell

Abstract: In this paper, we consider a finite-horizon Markov decision process (MDP) for which the objective at each stage is to minimize a quantile-based risk measure (QBRM) of the sequence of future costs; we call the overall objective a dynamic quantile-based risk measure (DQBRM). In particular, we consider optimizing dynamic risk measures where the one-step risk measures are QBRMs, a class of risk measur… ▽ More In this paper, we consider a finite-horizon Markov decision process (MDP) for which the objective at each stage is to minimize a quantile-based risk measure (QBRM) of the sequence of future costs; we call the overall objective a dynamic quantile-based risk measure (DQBRM). In particular, we consider optimizing dynamic risk measures where the one-step risk measures are QBRMs, a class of risk measures that includes the popular value at risk (VaR) and the conditional value at risk (CVaR). Although there is considerable theoretical development of risk-averse MDPs in the literature, the computational challenges have not been explored as thoroughly. We propose data-driven and simulation-based approximate dynamic programming (ADP) algorithms to solve the risk-averse sequential decision problem. We address the issue of inefficient sampling for risk applications in simulated settings and present a procedure, based on importance sampling, to direct samples toward the "risky region" as the ADP algorithm progresses. Finally, we show numerical results of our algorithms in the context of an application involving risk-averse bidding for energy storage. △ Less

Submitted 8 May, 2017; v1 submitted 7 September, 2015; originally announced September 2015.

Comments: 39 pages, 7 figures

arXiv:1407.2676 [pdf, other]

A New Optimal Stepsize For Approximate Dynamic Programming

Authors: Ilya O. Ryzhov, Peter I. Frazier, Warren B. Powell

Abstract: Approximate dynamic programming (ADP) has proven itself in a wide range of applications spanning large-scale transportation problems, health care, revenue management, and energy systems. The design of effective ADP algorithms has many dimensions, but one crucial factor is the stepsize rule used to update a value function approximation. Many operations research applications are computationally inte… ▽ More Approximate dynamic programming (ADP) has proven itself in a wide range of applications spanning large-scale transportation problems, health care, revenue management, and energy systems. The design of effective ADP algorithms has many dimensions, but one crucial factor is the stepsize rule used to update a value function approximation. Many operations research applications are computationally intensive, and it is important to obtain good results quickly. Furthermore, the most popular stepsize formulas use tunable parameters and can produce very poor results if tuned improperly. We derive a new stepsize rule that optimizes the prediction error in order to improve the short-term performance of an ADP algorithm. With only one, relatively insensitive tunable parameter, the new rule adapts to the level of noise in the problem and produces faster convergence in numerical experiments. △ Less

Submitted 13 July, 2014; v1 submitted 9 July, 2014; originally announced July 2014.

Comments: Matlab files are included with the paper source

arXiv:1401.0843 [pdf, other]

Least Squares Policy Iteration with Instrumental Variables vs. Direct Policy Search: Comparison Against Optimal Benchmarks Using Energy Storage

Authors: Warren R. Scott, Warren B. Powell, Somayeh Moazehi

Abstract: This paper studies approximate policy iteration (API) methods which use least-squares Bellman error minimization for policy evaluation. We address several of its enhancements, namely, Bellman error minimization using instrumental variables, least-squares projected Bellman error minimization, and projected Bellman error minimization using instrumental variables. We prove that for a general discrete… ▽ More This paper studies approximate policy iteration (API) methods which use least-squares Bellman error minimization for policy evaluation. We address several of its enhancements, namely, Bellman error minimization using instrumental variables, least-squares projected Bellman error minimization, and projected Bellman error minimization using instrumental variables. We prove that for a general discrete-time stochastic control problem, Bellman error minimization using instrumental variables is equivalent to both variants of projected Bellman error minimization. An alternative to these API methods is direct policy search based on knowledge gradient. The practical performance of these three approximate dynamic programming methods are then investigated in the context of an application in energy storage, integrated with an intermittent wind energy supply to fully serve a stochastic time-varying electricity demand. We create a library of test problems using real-world data and apply value iteration to find their optimal policies. These benchmarks are then used to compare the developed policies. Our analysis indicates that API with instrumental variables Bellman error minimization prominently outperforms API with least-squares Bellman error minimization. However, these approaches underperform our direct policy search implementation. △ Less

Submitted 4 January, 2014; originally announced January 2014.

Comments: 37 pages, 9 figures

Showing 1–11 of 11 results for author: Powell, W B