-
Markov flow policy -- deep MC
Authors:
Nitsan Soffair,
Gilad Katz
Abstract:
Discounted algorithms often encounter evaluation errors due to their reliance on short-term estimations, which can impede their efficacy in addressing simple, short-term tasks and impose undesired temporal discounts (\(γ\)). Interestingly, these algorithms are often tested without applying a discount, a phenomenon we refer as the \textit{train-test bias}. In response to these challenges, we propos…
▽ More
Discounted algorithms often encounter evaluation errors due to their reliance on short-term estimations, which can impede their efficacy in addressing simple, short-term tasks and impose undesired temporal discounts (\(γ\)). Interestingly, these algorithms are often tested without applying a discount, a phenomenon we refer as the \textit{train-test bias}. In response to these challenges, we propose the Markov Flow Policy, which utilizes a non-negative neural network flow to enable comprehensive forward-view predictions. Through integration into the TD7 codebase and evaluation using the MuJoCo benchmark, we observe significant performance improvements, positioning MFP as a straightforward, practical, and easily implementable solution within the domain of average rewards algorithms.
△ Less
Submitted 2 June, 2024; v1 submitted 1 May, 2024;
originally announced May 2024.
-
Conservative DDPG -- Pessimistic RL without Ensemble
Authors:
Nitsan Soffair,
Shie Mannor
Abstract:
DDPG is hindered by the overestimation bias problem, wherein its $Q$-estimates tend to overstate the actual $Q$-values. Traditional solutions to this bias involve ensemble-based methods, which require significant computational resources, or complex log-policy-based approaches, which are difficult to understand and implement. In contrast, we propose a straightforward solution using a $Q$-target and…
▽ More
DDPG is hindered by the overestimation bias problem, wherein its $Q$-estimates tend to overstate the actual $Q$-values. Traditional solutions to this bias involve ensemble-based methods, which require significant computational resources, or complex log-policy-based approaches, which are difficult to understand and implement. In contrast, we propose a straightforward solution using a $Q$-target and incorporating a behavioral cloning (BC) loss penalty. This solution, acting as an uncertainty measure, can be easily implemented with minimal code and without the need for an ensemble. Our empirical findings strongly support the superiority of Conservative DDPG over DDPG across various MuJoCo and Bullet tasks. We consistently observe better performance in all evaluated tasks and even competitive or superior performance compared to TD3 and TD7, all achieved with significantly reduced computational requirements.
△ Less
Submitted 2 June, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
MinMaxMin $Q$-learning
Authors:
Nitsan Soffair,
Shie Mannor
Abstract:
MinMaxMin $Q$-learning is a novel optimistic Actor-Critic algorithm that addresses the problem of overestimation bias ($Q$-estimations are overestimating the real $Q$-values) inherent in conservative RL algorithms. Its core formula relies on the disagreement among $Q$-networks in the form of the min-batch MaxMin $Q$-networks distance which is added to the $Q$-target and used as the priority experi…
▽ More
MinMaxMin $Q$-learning is a novel optimistic Actor-Critic algorithm that addresses the problem of overestimation bias ($Q$-estimations are overestimating the real $Q$-values) inherent in conservative RL algorithms. Its core formula relies on the disagreement among $Q$-networks in the form of the min-batch MaxMin $Q$-networks distance which is added to the $Q$-target and used as the priority experience replay sampling-rule. We implement MinMaxMin on top of TD3 and TD7, subjecting it to rigorous testing against state-of-the-art continuous-space algorithms-DDPG, TD3, and TD7-across popular MuJoCo and Bullet environments. The results show a consistent performance improvement of MinMaxMin over DDPG, TD3, and TD7 across all tested tasks.
△ Less
Submitted 2 June, 2024; v1 submitted 3 February, 2024;
originally announced February 2024.
-
SQT -- std $Q$-target
Authors:
Nitsan Soffair,
Dotan Di-Castro,
Orly Avner,
Shie Mannor
Abstract:
Std $Q$-target is a conservative, actor-critic, ensemble, $Q$-learning-based algorithm, which is based on a single key $Q$-formula: $Q$-networks standard deviation, which is an "uncertainty penalty", and, serves as a minimalistic solution to the problem of overestimation bias. We implement SQT on top of TD3/TD7 code and test it against the state-of-the-art (SOTA) actor-critic algorithms, DDPG, TD3…
▽ More
Std $Q$-target is a conservative, actor-critic, ensemble, $Q$-learning-based algorithm, which is based on a single key $Q$-formula: $Q$-networks standard deviation, which is an "uncertainty penalty", and, serves as a minimalistic solution to the problem of overestimation bias. We implement SQT on top of TD3/TD7 code and test it against the state-of-the-art (SOTA) actor-critic algorithms, DDPG, TD3 and TD7 on seven popular MuJoCo and Bullet tasks. Our results demonstrate SQT's $Q$-target formula superiority over TD3's $Q$-target formula as a conservative solution to overestimation bias in RL, while SQT shows a clear performance advantage on a wide margin over DDPG, TD3, and TD7 on all tasks.
△ Less
Submitted 2 June, 2024; v1 submitted 3 February, 2024;
originally announced February 2024.
-
Optimizing Agent Collaboration through Heuristic Multi-Agent Planning
Authors:
Nitsan Soffair
Abstract:
The SOTA algorithms for addressing QDec-POMDP issues, QDec-FP and QDec-FPS, are unable to effectively tackle problems that involve different types of sensing agents. We propose a new algorithm that addresses this issue by requiring agents to adopt the same plan if one agent is unable to take a sensing action but the other can. Our algorithm performs significantly better than both QDec-FP and QDec-…
▽ More
The SOTA algorithms for addressing QDec-POMDP issues, QDec-FP and QDec-FPS, are unable to effectively tackle problems that involve different types of sensing agents. We propose a new algorithm that addresses this issue by requiring agents to adopt the same plan if one agent is unable to take a sensing action but the other can. Our algorithm performs significantly better than both QDec-FP and QDec-FPS in these types of situations.
△ Less
Submitted 2 June, 2024; v1 submitted 3 January, 2023;
originally announced January 2023.
-
Solving Collaborative Dec-POMDPs with Deep Reinforcement Learning Heuristics
Authors:
Nitsan Soffair
Abstract:
WQMIX, QMIX, QTRAN, and VDN are SOTA algorithms for Dec-POMDP. All of them cannot solve complex agents' cooperation domains. We give an algorithm to solve such problems. In the first stage, we solve a single-agent problem and get a policy. In the second stage, we solve the multi-agent problem with the single-agent policy. SA2MA has a clear advantage over all competitors in complex agents' cooperat…
▽ More
WQMIX, QMIX, QTRAN, and VDN are SOTA algorithms for Dec-POMDP. All of them cannot solve complex agents' cooperation domains. We give an algorithm to solve such problems. In the first stage, we solve a single-agent problem and get a policy. In the second stage, we solve the multi-agent problem with the single-agent policy. SA2MA has a clear advantage over all competitors in complex agents' cooperative domains.
△ Less
Submitted 2 June, 2024; v1 submitted 9 November, 2022;
originally announced November 2022.