Search | arXiv e-print repository

Markov flow policy -- deep MC

Abstract: Discounted algorithms often encounter evaluation errors due to their reliance on short-term estimations, which can impede their efficacy in addressing simple, short-term tasks and impose undesired temporal discounts ($γ$). Interestingly, these algorithms are often tested without applying a discount, a phenomenon we refer as the \textit{train-test bias}. In response to these challenges, we propos… ▽ More Discounted algorithms often encounter evaluation errors due to their reliance on short-term estimations, which can impede their efficacy in addressing simple, short-term tasks and impose undesired temporal discounts ($γ$). Interestingly, these algorithms are often tested without applying a discount, a phenomenon we refer as the \textit{train-test bias}. In response to these challenges, we propose the Markov Flow Policy, which utilizes a non-negative neural network flow to enable comprehensive forward-view predictions. Through integration into the TD7 codebase and evaluation using the MuJoCo benchmark, we observe significant performance improvements, positioning MFP as a straightforward, practical, and easily implementable solution within the domain of average rewards algorithms. △ Less

Submitted 2 June, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

Comments: Paper do not ready

arXiv:2403.05732

Conservative DDPG -- Pessimistic RL without Ensemble

Authors: Nitsan Soffair, Shie Mannor

Abstract: DDPG is hindered by the overestimation bias problem, wherein its $Q$-estimates tend to overstate the actual $Q$-values. Traditional solutions to this bias involve ensemble-based methods, which require significant computational resources, or complex log-policy-based approaches, which are difficult to understand and implement. In contrast, we propose a straightforward solution using a $Q$-target and… ▽ More DDPG is hindered by the overestimation bias problem, wherein its $Q$-estimates tend to overstate the actual $Q$-values. Traditional solutions to this bias involve ensemble-based methods, which require significant computational resources, or complex log-policy-based approaches, which are difficult to understand and implement. In contrast, we propose a straightforward solution using a $Q$-target and incorporating a behavioral cloning (BC) loss penalty. This solution, acting as an uncertainty measure, can be easily implemented with minimal code and without the need for an ensemble. Our empirical findings strongly support the superiority of Conservative DDPG over DDPG across various MuJoCo and Bullet tasks. We consistently observe better performance in all evaluated tasks and even competitive or superior performance compared to TD3 and TD7, all achieved with significantly reduced computational requirements. △ Less

Submitted 2 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

Comments: Paper do not ready

arXiv:2402.05951

MinMaxMin $Q$-learning

Authors: Nitsan Soffair, Shie Mannor

Abstract: MinMaxMin $Q$-learning is a novel optimistic Actor-Critic algorithm that addresses the problem of overestimation bias ($Q$-estimations are overestimating the real $Q$-values) inherent in conservative RL algorithms. Its core formula relies on the disagreement among $Q$-networks in the form of the min-batch MaxMin $Q$-networks distance which is added to the $Q$-target and used as the priority experi… ▽ More MinMaxMin $Q$-learning is a novel optimistic Actor-Critic algorithm that addresses the problem of overestimation bias ($Q$-estimations are overestimating the real $Q$-values) inherent in conservative RL algorithms. Its core formula relies on the disagreement among $Q$-networks in the form of the min-batch MaxMin $Q$-networks distance which is added to the $Q$-target and used as the priority experience replay sampling-rule. We implement MinMaxMin on top of TD3 and TD7, subjecting it to rigorous testing against state-of-the-art continuous-space algorithms-DDPG, TD3, and TD7-across popular MuJoCo and Bullet environments. The results show a consistent performance improvement of MinMaxMin over DDPG, TD3, and TD7 across all tested tasks. △ Less

Submitted 2 June, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

Comments: Paper do not ready

arXiv:2402.05950 [pdf, other]

SQT -- std $Q$-target

Authors: Nitsan Soffair, Dotan Di-Castro, Orly Avner, Shie Mannor

Abstract: Std $Q$-target is a conservative, actor-critic, ensemble, $Q$-learning-based algorithm, which is based on a single key $Q$-formula: $Q$-networks standard deviation, which is an "uncertainty penalty", and, serves as a minimalistic solution to the problem of overestimation bias. We implement SQT on top of TD3/TD7 code and test it against the state-of-the-art (SOTA) actor-critic algorithms, DDPG, TD3… ▽ More Std $Q$-target is a conservative, actor-critic, ensemble, $Q$-learning-based algorithm, which is based on a single key $Q$-formula: $Q$-networks standard deviation, which is an "uncertainty penalty", and, serves as a minimalistic solution to the problem of overestimation bias. We implement SQT on top of TD3/TD7 code and test it against the state-of-the-art (SOTA) actor-critic algorithms, DDPG, TD3 and TD7 on seven popular MuJoCo and Bullet tasks. Our results demonstrate SQT's $Q$-target formula superiority over TD3's $Q$-target formula as a conservative solution to overestimation bias in RL, while SQT shows a clear performance advantage on a wide margin over DDPG, TD3, and TD7 on all tasks. △ Less

Submitted 2 June, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

arXiv:2301.01246

Optimizing Agent Collaboration through Heuristic Multi-Agent Planning

Authors: Nitsan Soffair

Abstract: The SOTA algorithms for addressing QDec-POMDP issues, QDec-FP and QDec-FPS, are unable to effectively tackle problems that involve different types of sensing agents. We propose a new algorithm that addresses this issue by requiring agents to adopt the same plan if one agent is unable to take a sensing action but the other can. Our algorithm performs significantly better than both QDec-FP and QDec-… ▽ More The SOTA algorithms for addressing QDec-POMDP issues, QDec-FP and QDec-FPS, are unable to effectively tackle problems that involve different types of sensing agents. We propose a new algorithm that addresses this issue by requiring agents to adopt the same plan if one agent is unable to take a sensing action but the other can. Our algorithm performs significantly better than both QDec-FP and QDec-FPS in these types of situations. △ Less

Submitted 2 June, 2024; v1 submitted 3 January, 2023; originally announced January 2023.

Comments: Paper do not ready

arXiv:2211.15411

Solving Collaborative Dec-POMDPs with Deep Reinforcement Learning Heuristics

Authors: Nitsan Soffair

Abstract: WQMIX, QMIX, QTRAN, and VDN are SOTA algorithms for Dec-POMDP. All of them cannot solve complex agents' cooperation domains. We give an algorithm to solve such problems. In the first stage, we solve a single-agent problem and get a policy. In the second stage, we solve the multi-agent problem with the single-agent policy. SA2MA has a clear advantage over all competitors in complex agents' cooperat… ▽ More WQMIX, QMIX, QTRAN, and VDN are SOTA algorithms for Dec-POMDP. All of them cannot solve complex agents' cooperation domains. We give an algorithm to solve such problems. In the first stage, we solve a single-agent problem and get a policy. In the second stage, we solve the multi-agent problem with the single-agent policy. SA2MA has a clear advantage over all competitors in complex agents' cooperative domains. △ Less

Submitted 2 June, 2024; v1 submitted 9 November, 2022; originally announced November 2022.

Comments: Paper do not ready

Showing 1–6 of 6 results for author: Soffair, N