-
The Stochastic Dynamic Post-Disaster Inventory Allocation Problem with Trucks and UAVs
Authors:
Robert van Steenbergen,
Wouter van Heeswijk,
Martijn Mes
Abstract:
Humanitarian logistics operations face increasing difficulties due to rising demands for aid in disaster areas. This paper investigates the dynamic allocation of scarce relief supplies across multiple affected districts over time. It introduces a novel stochastic dynamic post-disaster inventory allocation problem with trucks and unmanned aerial vehicles delivering relief goods under uncertain supp…
▽ More
Humanitarian logistics operations face increasing difficulties due to rising demands for aid in disaster areas. This paper investigates the dynamic allocation of scarce relief supplies across multiple affected districts over time. It introduces a novel stochastic dynamic post-disaster inventory allocation problem with trucks and unmanned aerial vehicles delivering relief goods under uncertain supply and demand. The relevance of this humanitarian logistics problem lies in the importance of considering the inter-temporal social impact of deliveries. We achieve this by incorporating deprivation costs when allocating scarce supplies. Furthermore, we consider the inherent uncertainties of disaster areas and the potential use of cargo UAVs to enhance operational efficiency. This study proposes two anticipatory solution methods based on approximate dynamic programming, specifically decomposed linear value function approximation and neural network value function approximation to effectively manage uncertainties in the dynamic allocation process. We compare DL-VFA and NN-VFA with various state-of-the-art methods (exact re-optimization, PPO) and results show a 6-8% improvement compared to the best benchmarks. NN-VFA provides the best performance and captures nonlinearities in the problem, whereas DL-VFA shows excellent scalability against a minor performance loss. The experiments reveal that consideration of deprivation costs results in improved allocation of scarce supplies both across affected districts and over time. Finally, results show that deploying UAVs can play a crucial role in the allocation of relief goods, especially in the first stages after a disaster. The use of UAVs reduces transportation- and deprivation costs together by 16-20% and reduces maximum deprivation times by 19-40%, while maintaining similar levels of demand coverage, showcasing efficient and effective operations.
△ Less
Submitted 30 November, 2023;
originally announced December 2023.
-
Dynamic Neighborhood Construction for Structured Large Discrete Action Spaces
Authors:
Fabian Akkerman,
Julius Luy,
Wouter van Heeswijk,
Maximilian Schiffer
Abstract:
Large discrete action spaces (LDAS) remain a central challenge in reinforcement learning. Existing solution approaches can handle unstructured LDAS with up to a few million actions. However, many real-world applications in logistics, production, and transportation systems have combinatorial action spaces, whose size grows well beyond millions of actions, even on small instances. Fortunately, such…
▽ More
Large discrete action spaces (LDAS) remain a central challenge in reinforcement learning. Existing solution approaches can handle unstructured LDAS with up to a few million actions. However, many real-world applications in logistics, production, and transportation systems have combinatorial action spaces, whose size grows well beyond millions of actions, even on small instances. Fortunately, such action spaces exhibit structure, e.g., equally spaced discrete resource units. With this work, we focus on handling structured LDAS (SLDAS) with sizes that cannot be handled by current benchmarks: we propose Dynamic Neighborhood Construction (DNC), a novel exploitation paradigm for SLDAS. We present a scalable neighborhood exploration heuristic that utilizes this paradigm and efficiently explores the discrete neighborhood around the continuous proxy action in structured action spaces with up to $10^{73}$ actions. We demonstrate the performance of our method by benchmarking it against three state-of-the-art approaches designed for large discrete action spaces across two distinct environments. Our results show that DNC matches or outperforms state-of-the-art approaches while being computationally more efficient. Furthermore, our method scales to action spaces that so far remained computationally intractable for existing methodologies.
△ Less
Submitted 27 February, 2024; v1 submitted 31 May, 2023;
originally announced May 2023.
-
Natural Policy Gradients In Reinforcement Learning Explained
Authors:
W. J. A. van Heeswijk
Abstract:
Traditional policy gradient methods are fundamentally flawed. Natural gradients converge quicker and better, forming the foundation of contemporary Reinforcement Learning such as Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO). This lecture note aims to clarify the intuition behind natural policy gradients, focusing on the thought process and the key mathematical con…
▽ More
Traditional policy gradient methods are fundamentally flawed. Natural gradients converge quicker and better, forming the foundation of contemporary Reinforcement Learning such as Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO). This lecture note aims to clarify the intuition behind natural policy gradients, focusing on the thought process and the key mathematical constructs.
△ Less
Submitted 5 September, 2022;
originally announced September 2022.
-
Strategic bidding in freight transport using deep reinforcement learning
Authors:
Wouter van Heeswijk
Abstract:
This paper presents a multi-agent reinforcement learning algorithm to represent strategic bidding behavior in freight transport markets. Using this algorithm, we investigate whether feasible market equilibriums arise without any central control or communication between agents. Studying behavior in such environments may serve as a step** stone towards self-organizing logistics systems like the Ph…
▽ More
This paper presents a multi-agent reinforcement learning algorithm to represent strategic bidding behavior in freight transport markets. Using this algorithm, we investigate whether feasible market equilibriums arise without any central control or communication between agents. Studying behavior in such environments may serve as a step** stone towards self-organizing logistics systems like the Physical Internet. We model an agent-based environment in which a shipper and a carrier actively learn bidding strategies using policy gradient methods, posing bid- and ask prices at the individual container level. Both agents aim to learn the best response given the expected behavior of the opposing agent. A neutral broker allocates jobs based on bid-ask spreads.
Our game-theoretical analysis and numerical experiments focus on behavioral insights. To evaluate system performance, we measure adherence to Nash equilibria, fairness of reward division and utilization of transport capacity. We observe good performance both in predictable, deterministic settings (~95% adherence to Nash equilibria) and highly stochastic environments (~85% adherence). Risk-seeking behavior may increase an agent's reward share, as long as the strategies are not overly aggressive. The results suggest a potential for full automation and decentralization of freight transport markets.
△ Less
Submitted 18 February, 2021;
originally announced February 2021.
-
A Gentle Lecture Note on Filtrations in Reinforcement Learning
Authors:
W. J. A. van Heeswijk
Abstract:
This note aims to provide a basic intuition on the concept of filtrations as used in the context of reinforcement learning (RL). Filtrations are often used to formally define RL problems, yet their implications might not be eminent for those without a background in measure theory. Essentially, a filtration is a construct that captures partial knowledge up to time $t$, without revealing any future…
▽ More
This note aims to provide a basic intuition on the concept of filtrations as used in the context of reinforcement learning (RL). Filtrations are often used to formally define RL problems, yet their implications might not be eminent for those without a background in measure theory. Essentially, a filtration is a construct that captures partial knowledge up to time $t$, without revealing any future information that has already been simulated, yet not revealed to the decision-maker. We illustrate this with simple examples from the finance domain on both discrete and continuous outcome spaces. Furthermore, we show that the notion of filtration is not needed, as basing decisions solely on the current problem state (which is possible due to the Markovian property) suffices to eliminate future knowledge from the decision-making process.
△ Less
Submitted 6 August, 2020;
originally announced August 2020.
-
Smart Containers With Bidding Capacity: A Policy Gradient Algorithm for Semi-Cooperative Learning
Authors:
Wouter van Heeswijk
Abstract:
Smart modular freight containers -- as propagated in the Physical Internet paradigm -- are equipped with sensors, data storage capability and intelligence that enable them to route themselves from origin to destination without manual intervention or central governance. In this self-organizing setting, containers can autonomously place bids on transport services in a spot market setting. However, f…
▽ More
Smart modular freight containers -- as propagated in the Physical Internet paradigm -- are equipped with sensors, data storage capability and intelligence that enable them to route themselves from origin to destination without manual intervention or central governance. In this self-organizing setting, containers can autonomously place bids on transport services in a spot market setting. However, for individual containers it may be difficult to learn good bidding policies due to limited observations. By sharing information and costs between one another, smart containers can jointly learn bidding policies, even though simultaneously competing for the same transport capacity. We replicate this behavior by learning stochastic bidding policies in a semi-cooperative multi agent setting. To this end, we develop a reinforcement learning algorithm based on the policy gradient framework. Numerical experiments show that sharing solely bids and acceptance decisions leads to stable bidding policies. Additional system information only marginally improves performance; individual job properties suffice to place appropriate bids. Furthermore, we find that carriers may have incentives not to share information with the smart containers. The experiments give rise to several directions for follow-up research, in particular the interaction between smart containers and transport services in self-organizing logistics.
△ Less
Submitted 1 May, 2020;
originally announced May 2020.
-
Donald Duck Holiday Game: A numerical analysis of a Game of the Goose role-playing variant
Authors:
W. J. A. van Heeswijk
Abstract:
The 1996 Donald Duck Holiday Game is a role-playing variant of the historical Game of the Goose, involving characters with unique attributes, event squares, and random event cards. The objective of the game is to reach the cam** before any other player does. We develop a Monte Carlo simulation model that automatically plays the game and enables analyzing its key characteristics. We assess the ga…
▽ More
The 1996 Donald Duck Holiday Game is a role-playing variant of the historical Game of the Goose, involving characters with unique attributes, event squares, and random event cards. The objective of the game is to reach the cam** before any other player does. We develop a Monte Carlo simulation model that automatically plays the game and enables analyzing its key characteristics. We assess the game on various metrics relevant to each playability. Numerical analysis shows that, on average, the game takes between 69 and 123 rounds to complete, depending on the number of players. However, durations over one hour (translated to human play time) occur over 25% of the games, which might reduce the quality of the gaming experience. Furthermore, we show that two characters are about 30% likely to win than the other three, primarily due to being exposed to fewer random events. We argue that the richer narrative of role-playing games may extend the duration for which the game remains enjoyable, such that the metrics cannot directly be compared to those of the traditional Game-of-the-Goose. Based on our analysis, we provide several suggestions to improve the game balance with only slight modifications. In a broader sense, we demonstrate that a basic Monte Carlo simulation suffices to analyze Game-of-the-Goose role-playing variants, verify how they score on criteria that contribute to an enjoyable game, and detect possible anomalies.
△ Less
Submitted 13 January, 2020;
originally announced January 2020.
-
Approximate Dynamic Programming with Neural Networks in Linear Discrete Action Spaces
Authors:
Wouter van Heeswijk,
Han La Poutré
Abstract:
Real-world problems of operations research are typically high-dimensional and combinatorial. Linear programs are generally used to formulate and efficiently solve these large decision problems. However, in multi-period decision problems, we must often compute expected downstream values corresponding to current decisions. When applying stochastic methods to approximate these values, linear programs…
▽ More
Real-world problems of operations research are typically high-dimensional and combinatorial. Linear programs are generally used to formulate and efficiently solve these large decision problems. However, in multi-period decision problems, we must often compute expected downstream values corresponding to current decisions. When applying stochastic methods to approximate these values, linear programs become restrictive for designing value function approximations (VFAs). In particular, the manual design of a polynomial VFA is challenging.
This paper presents an integrated approach for complex optimization problems, focusing on applications in the domain of operations research. It develops a hybrid solution method that combines linear programming and neural networks as part of approximate dynamic programming.
Our proposed solution method embeds neural network VFAs into linear decision problems, combining the nonlinear expressive power of neural networks with the efficiency of solving linear programs. As a proof of concept, we perform numerical experiments on a transportation problem. The neural network VFAs consistently outperform polynomial VFAs, with limited design and tuning effort.
△ Less
Submitted 26 February, 2019;
originally announced February 2019.