Search | arXiv e-print repository

Adaptive Discounting of Training Time Attacks

Authors: Ridhima Bector, Abhay Aradhya, Chai Quek, Zinovi Rabinovich

Abstract: Among the most insidious attacks on Reinforcement Learning (RL) solutions are training-time attacks (TTAs) that create loopholes and backdoors in the learned behaviour. Not limited to a simple disruption, constructive TTAs (C-TTAs) are now available, where the attacker forces a specific, target behaviour upon a training RL agent (victim). However, even state-of-the-art C-TTAs focus on target behav… ▽ More Among the most insidious attacks on Reinforcement Learning (RL) solutions are training-time attacks (TTAs) that create loopholes and backdoors in the learned behaviour. Not limited to a simple disruption, constructive TTAs (C-TTAs) are now available, where the attacker forces a specific, target behaviour upon a training RL agent (victim). However, even state-of-the-art C-TTAs focus on target behaviours that could be naturally adopted by the victim if not for a particular feature of the environment dynamics, which C-TTAs exploit. In this work, we show that a C-TTA is possible even when the target behaviour is un-adoptable due to both environment dynamics as well as non-optimality with respect to the victim objective(s). To find efficient attacks in this context, we develop a specialised flavour of the DDPG algorithm, which we term gammaDDPG, that learns this stronger version of C-TTA. gammaDDPG dynamically alters the attack policy planning horizon based on the victim's current behaviour. This improves effort distribution throughout the attack timeline and reduces the effect of uncertainty the attacker has about the victim. To demonstrate the features of our method and better relate the results to prior research, we borrow a 3D grid domain from a state-of-the-art C-TTA for our experiments. Code is available at "bit.ly/github-rb-gDDPG". △ Less

Submitted 5 January, 2024; originally announced January 2024.

Comments: 19 pages, 7 figures

arXiv:2304.12151 [pdf, other]

Policy Resilience to Environment Poisoning Attacks on Reinforcement Learning

Authors: Hang Xu, Xinghua Qu, Zinovi Rabinovich

Abstract: This paper investigates policy resilience to training-environment poisoning attacks on reinforcement learning (RL) policies, with the goal of recovering the deployment performance of a poisoned RL policy. Due to the fact that the policy resilience is an add-on concern to RL algorithms, it should be resource-efficient, time-conserving, and widely applicable without compromising the performance of R… ▽ More This paper investigates policy resilience to training-environment poisoning attacks on reinforcement learning (RL) policies, with the goal of recovering the deployment performance of a poisoned RL policy. Due to the fact that the policy resilience is an add-on concern to RL algorithms, it should be resource-efficient, time-conserving, and widely applicable without compromising the performance of RL algorithms. This paper proposes such a policy-resilience mechanism based on an idea of knowledge sharing. We summarize the policy resilience as three stages: preparation, diagnosis, recovery. Specifically, we design the mechanism as a federated architecture coupled with a meta-learning manner, pursuing an efficient extraction and sharing of the environment knowledge. With the shared knowledge, a poisoned agent can quickly identify the deployment condition and accordingly recover its policy performance. We empirically evaluate the resilience mechanism for both model-based and model-free RL algorithms, showing its effectiveness and efficiency in restoring the deployment performance of a poisoned policy. △ Less

Submitted 24 April, 2023; originally announced April 2023.

arXiv:2302.03429 [pdf, other]

Towards Skilled Population Curriculum for Multi-Agent Reinforcement Learning

Authors: Rundong Wang, Longtao Zheng, Wei Qiu, Bowei He, Bo An, Zinovi Rabinovich, Yu**g Hu, Yingfeng Chen, Tangjie Lv, Changjie Fan

Abstract: Recent advances in multi-agent reinforcement learning (MARL) allow agents to coordinate their behaviors in complex environments. However, common MARL algorithms still suffer from scalability and sparse reward issues. One promising approach to resolving them is automatic curriculum learning (ACL). ACL involves a student (curriculum learner) training on tasks of increasing difficulty controlled by a… ▽ More Recent advances in multi-agent reinforcement learning (MARL) allow agents to coordinate their behaviors in complex environments. However, common MARL algorithms still suffer from scalability and sparse reward issues. One promising approach to resolving them is automatic curriculum learning (ACL). ACL involves a student (curriculum learner) training on tasks of increasing difficulty controlled by a teacher (curriculum generator). Despite its success, ACL's applicability is limited by (1) the lack of a general student framework for dealing with the varying number of agents across tasks and the sparse reward problem, and (2) the non-stationarity of the teacher's task due to ever-changing student strategies. As a remedy for ACL, we introduce a novel automatic curriculum learning framework, Skilled Population Curriculum (SPC), which adapts curriculum learning to multi-agent coordination. Specifically, we endow the student with population-invariant communication and a hierarchical skill set, allowing it to learn cooperation and behavior skills from distinct tasks with varying numbers of agents. In addition, we model the teacher as a contextual bandit conditioned by student policies, enabling a team of agents to change its size while still retaining previously acquired skills. We also analyze the inherent non-stationarity of this multi-agent automatic curriculum teaching problem and provide a corresponding regret bound. Empirical results show that our method improves the performance, scalability and sample efficiency in several MARL environments. △ Less

Submitted 7 February, 2023; originally announced February 2023.

arXiv:2205.13718 [pdf, other]

Off-Beat Multi-Agent Reinforcement Learning

Authors: Wei Qiu, Weixun Wang, Rundong Wang, Bo An, Yu**g Hu, Svetlana Obraztsova, Zinovi Rabinovich, Jianye Hao, Yingfeng Chen, Changjie Fan

Abstract: We investigate model-free multi-agent reinforcement learning (MARL) in environments where off-beat actions are prevalent, i.e., all actions have pre-set execution durations. During execution durations, the environment changes are influenced by, but not synchronised with, action execution. Such a setting is ubiquitous in many real-world problems. However, most MARL methods assume actions are execut… ▽ More We investigate model-free multi-agent reinforcement learning (MARL) in environments where off-beat actions are prevalent, i.e., all actions have pre-set execution durations. During execution durations, the environment changes are influenced by, but not synchronised with, action execution. Such a setting is ubiquitous in many real-world problems. However, most MARL methods assume actions are executed immediately after inference, which is often unrealistic and can lead to catastrophic failure for multi-agent coordination with off-beat actions. In order to fill this gap, we develop an algorithmic framework for MARL with off-beat actions. We then propose a novel episodic memory, LeGEM, for model-free MARL algorithms. LeGEM builds agents' episodic memories by utilizing agents' individual experiences. It boosts multi-agent learning by addressing the challenging temporal credit assignment problem raised by the off-beat actions via our novel reward redistribution scheme, alleviating the issue of non-Markovian reward. We evaluate LeGEM on various multi-agent scenarios with off-beat actions, including Stag-Hunter Game, Quarry Game, Afforestation Game, and StarCraft II micromanagement tasks. Empirical results show that LeGEM significantly boosts multi-agent coordination and achieves leading performance and improved sample efficiency. △ Less

Submitted 18 June, 2022; v1 submitted 26 May, 2022; originally announced May 2022.

Comments: Fix typos

arXiv:2108.03803 [pdf, other]

Mis-spoke or mis-lead: Achieving Robustness in Multi-Agent Communicative Reinforcement Learning

Authors: Wanqi Xue, Wei Qiu, Bo An, Zinovi Rabinovich, Svetlana Obraztsova, Chai Kiat Yeo

Abstract: Recent studies in multi-agent communicative reinforcement learning (MACRL) have demonstrated that multi-agent coordination can be greatly improved by allowing communication between agents. Meanwhile, adversarial machine learning (ML) has shown that ML models are vulnerable to attacks. Despite the increasing concern about the robustness of ML algorithms, how to achieve robust communication in multi… ▽ More Recent studies in multi-agent communicative reinforcement learning (MACRL) have demonstrated that multi-agent coordination can be greatly improved by allowing communication between agents. Meanwhile, adversarial machine learning (ML) has shown that ML models are vulnerable to attacks. Despite the increasing concern about the robustness of ML algorithms, how to achieve robust communication in multi-agent reinforcement learning has been largely neglected. In this paper, we systematically explore the problem of adversarial communication in MACRL. Our main contributions are threefold. First, we propose an effective method to perform attacks in MACRL, by learning a model to generate optimal malicious messages. Second, we develop a defence method based on message reconstruction, to maintain multi-agent coordination under message attacks. Third, we formulate the adversarial communication problem as a two-player zero-sum game and propose a game-theoretical method R-MACRL to improve the worst-case defending performance. Empirical results demonstrate that many state-of-the-art MACRL methods are vulnerable to message attacks, and our method can significantly improve their robustness. △ Less

Submitted 26 January, 2022; v1 submitted 9 August, 2021; originally announced August 2021.

Comments: Published as a conference paper in AAMAS 2022

arXiv:2102.08159 [pdf, other]

RMIX: Learning Risk-Sensitive Policies for Cooperative Reinforcement Learning Agents

Authors: Wei Qiu, Xinrun Wang, Runsheng Yu, Xu He, Rundong Wang, Bo An, Svetlana Obraztsova, Zinovi Rabinovich

Abstract: Current value-based multi-agent reinforcement learning methods optimize individual Q values to guide individuals' behaviours via centralized training with decentralized execution (CTDE). However, such expected, i.e., risk-neutral, Q value is not sufficient even with CTDE due to the randomness of rewards and the uncertainty in environments, which causes the failure of these methods to train coordin… ▽ More Current value-based multi-agent reinforcement learning methods optimize individual Q values to guide individuals' behaviours via centralized training with decentralized execution (CTDE). However, such expected, i.e., risk-neutral, Q value is not sufficient even with CTDE due to the randomness of rewards and the uncertainty in environments, which causes the failure of these methods to train coordinating agents in complex environments. To address these issues, we propose RMIX, a novel cooperative MARL method with the Conditional Value at Risk (CVaR) measure over the learned distributions of individuals' Q values. Specifically, we first learn the return distributions of individuals to analytically calculate CVaR for decentralized execution. Then, to handle the temporal nature of the stochastic outcomes during executions, we propose a dynamic risk level predictor for risk level tuning. Finally, we optimize the CVaR policies with CVaR values used to estimate the target in TD error during centralized training and the CVaR values are used as auxiliary local rewards to update the local distribution via Quantile Regression loss. Empirically, we show that our method significantly outperforms state-of-the-art methods on challenging StarCraft II tasks, demonstrating enhanced coordination and improved sample efficiency. △ Less

Submitted 22 March, 2021; v1 submitted 16 February, 2021; originally announced February 2021.

Comments: ICLR 2021 submission version: https://openreview.net/forum?id=1EVb8XRBDNr

arXiv:1911.12472 [pdf, other]

Manipulating Elections by Selecting Issues

Authors: Jasper Lu, David Kai Zhang, Zinovi Rabinovich, Svetlana Obraztsova, Yevgeniy Vorobeychik

Abstract: Constructive election control considers the problem of an adversary who seeks to sway the outcome of an electoral process in order to ensure that their favored candidate wins. We consider the computational problem of constructive election control via issue selection. In this problem, a party decides which political issues to focus on to ensure victory for the favored candidate. We also consider a… ▽ More Constructive election control considers the problem of an adversary who seeks to sway the outcome of an electoral process in order to ensure that their favored candidate wins. We consider the computational problem of constructive election control via issue selection. In this problem, a party decides which political issues to focus on to ensure victory for the favored candidate. We also consider a variation in which the goal is to maximize the number of voters supporting the favored candidate. We present strong negative results, showing, for example, that the latter problem is inapproximable for any constant factor. On the positive side, we show that when issues are binary, the problem becomes tractable in several cases, and admits a 2-approximation in the two-candidate case. Finally, we develop integer programming and heuristic methods for these problems. △ Less

Submitted 27 November, 2019; originally announced November 2019.

Comments: Published at AAMAS 2019

arXiv:1911.06992 [pdf, other]

Learning Efficient Multi-agent Communication: An Information Bottleneck Approach

Authors: Rundong Wang, Xu He, Runsheng Yu, Wei Qiu, Bo An, Zinovi Rabinovich

Abstract: We consider the problem of the limited-bandwidth communication for multi-agent reinforcement learning, where agents cooperate with the assistance of a communication protocol and a scheduler. The protocol and scheduler jointly determine which agent is communicating what message and to whom. Under the limited bandwidth constraint, a communication protocol is required to generate informative messages… ▽ More We consider the problem of the limited-bandwidth communication for multi-agent reinforcement learning, where agents cooperate with the assistance of a communication protocol and a scheduler. The protocol and scheduler jointly determine which agent is communicating what message and to whom. Under the limited bandwidth constraint, a communication protocol is required to generate informative messages. Meanwhile, an unnecessary communication connection should not be established because it occupies limited resources in vain. In this paper, we develop an Informative Multi-Agent Communication (IMAC) method to learn efficient communication protocols as well as scheduling. First, from the perspective of communication theory, we prove that the limited bandwidth constraint requires low-entropy messages throughout the transmission. Then inspired by the information bottleneck principle, we learn a valuable and compact communication protocol and a weight-based scheduler. To demonstrate the efficiency of our method, we conduct extensive experiments in various cooperative and competitive multi-agent tasks with different numbers of agents and different bandwidths. We show that IMAC converges faster and leads to efficient communication among agents under the limited bandwidth as compared to many baseline methods. △ Less

Submitted 23 June, 2020; v1 submitted 16 November, 2019; originally announced November 2019.

Comments: ICML 2020

arXiv:1906.07071 [pdf, ps, other]

Protecting Elections by Recounting Ballots

Authors: Edith Elkind, Jiarui Gan, Svetlana Obraztsova, Zinovi Rabinovich, Alexandros A. Voudouris

Abstract: Complexity of voting manipulation is a prominent topic in computational social choice. In this work, we consider a two-stage voting manipulation scenario. First, a malicious party (an attacker) attempts to manipulate the election outcome in favor of a preferred candidate by changing the vote counts in some of the voting districts. Afterwards, another party (a defender), which cares about the voter… ▽ More Complexity of voting manipulation is a prominent topic in computational social choice. In this work, we consider a two-stage voting manipulation scenario. First, a malicious party (an attacker) attempts to manipulate the election outcome in favor of a preferred candidate by changing the vote counts in some of the voting districts. Afterwards, another party (a defender), which cares about the voters' wishes, demands a recount in a subset of the manipulated districts, restoring their vote counts to their original values. We investigate the resulting Stackelberg game for the case where votes are aggregated using two variants of the Plurality rule, and obtain an almost complete picture of the complexity landscape, both from the attacker's and from the defender's perspective. △ Less

Submitted 17 June, 2019; originally announced June 2019.

arXiv:1905.13275 [pdf, ps, other]

doi 10.5555/3398761.3398823

New Algorithms for Functional Distributed Constraint Optimization Problems

Authors: Khoi D. Hoang, William Yeoh, Makoto Yokoo, Zinovi Rabinovich

Abstract: The Distributed Constraint Optimization Problem (DCOP) formulation is a powerful tool to model multi-agent coordination problems that are distributed by nature. The formulation is suitable for problems where variables are discrete and constraint utilities are represented in tabular form. However, many real-world applications have variables that are continuous and tabular forms thus cannot accurate… ▽ More The Distributed Constraint Optimization Problem (DCOP) formulation is a powerful tool to model multi-agent coordination problems that are distributed by nature. The formulation is suitable for problems where variables are discrete and constraint utilities are represented in tabular form. However, many real-world applications have variables that are continuous and tabular forms thus cannot accurately represent constraint utilities. To overcome this limitation, researchers have proposed the Functional DCOP (F-DCOP) model, which are DCOPs with continuous variables. But existing approaches usually come with some restrictions on the form of constraint utilities and are without quality guarantees. Therefore, in this paper, we (i) propose exact algorithms to solve a specific subclass of F-DCOPs; (ii) propose approximation methods with quality guarantees to solve general F-DCOPs; and (iii) empirically show that our algorithms outperform existing state-of-the-art F-DCOP algorithms on randomly generated instances when given the same communication limitations. △ Less

Submitted 30 May, 2019; originally announced May 2019.

Journal ref: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, 2020

arXiv:1905.07173 [pdf, other]

doi 10.1007/s10458-020-09490-7

Reaching Consensus Under a Deadline

Authors: Marina Bannikova, Lihi Dery, Svetlana Obraztsova, Zinovi Rabinovich, Jeffrey S. Rosenschein

Abstract: Committee decisions are complicated by a deadline, e.g., the next start of a budget, or the beginning of a semester. In committee hiring decisions, it may be that if no candidate is supported by a strong majority, the default is to hire no one - an option that may cost dearly. As a result, committee members might prefer to agree on a reasonable, if not necessarily the best, candidate, to avoid unf… ▽ More Committee decisions are complicated by a deadline, e.g., the next start of a budget, or the beginning of a semester. In committee hiring decisions, it may be that if no candidate is supported by a strong majority, the default is to hire no one - an option that may cost dearly. As a result, committee members might prefer to agree on a reasonable, if not necessarily the best, candidate, to avoid unfilled positions. In this paper, we propose a model for the above scenario - Consensus Under a Deadline (CUD)- based on a time-bounded iterative voting process. We provide convergence guarantees and an analysis of the quality of the final decision. An extensive experimental study demonstrates more subtle features of CUDs, e.g., the difference between two simple types of committee member behavior, lazy vs.~proactive voters. Finally, a user study examines the differences between the behavior of rational voting bots and real voters, concluding that it may often be best to have bots play on the voters' behalf. △ Less

Submitted 26 January, 2021; v1 submitted 17 May, 2019; originally announced May 2019.

Journal ref: Autonomous Agents and Multi-Agent Systems, 35(1), 1-42 (2021)

arXiv:1905.04933 [pdf, other]

doi 10.1007/s10726-019-09637-2

Lie on the Fly: Strategic Voting in an Iterative Preference Elicitation Process

Authors: Lihi Dery, Svetlana Obraztsova, Zinovi Rabinovich, Meir Kalech

Abstract: A voting center is in charge of collecting and aggregating voter preferences. In an iterative process, the center sends comparison queries to voters, requesting them to submit their preference between two items. Voters might discuss the candidates among themselves, figuring out during the elicitation process which candidates stand a chance of winning and which do not. Consequently, strategic voter… ▽ More A voting center is in charge of collecting and aggregating voter preferences. In an iterative process, the center sends comparison queries to voters, requesting them to submit their preference between two items. Voters might discuss the candidates among themselves, figuring out during the elicitation process which candidates stand a chance of winning and which do not. Consequently, strategic voters might attempt to manipulate by deviating from their true preferences and instead submit a different response in order to attempt to maximize their profit. We provide a practical algorithm for strategic voters which computes the best manipulative vote and maximizes the voter's selfish outcome when such a vote exists. We also provide a careful voting center which is aware of the possible manipulations and avoids manipulative queries when possible. In an empirical study on four real-world domains, we show that in practice manipulation occurs in a low percentage of settings and has a low impact on the final outcome. The careful voting center reduces manipulation even further, thus allowing for a non-distorted group decision process to take place. We thus provide a core technology study of a voting process that can be adopted in opinion or information aggregation systems and in crowdsourcing applications, e.g., peer grading in Massive Open Online Courses (MOOCs). △ Less

Submitted 13 May, 2019; originally announced May 2019.

arXiv:1903.02917 [pdf, ps, other]

Imitative Follower Deception in Stackelberg Games

Authors: Jiarui Gan, Haifeng Xu, Qingyu Guo, Long Tran-Thanh, Zinovi Rabinovich, Michael Wooldridge

Abstract: Information uncertainty is one of the major challenges facing applications of game theory. In the context of Stackelberg games, various approaches have been proposed to deal with the leader's incomplete knowledge about the follower's payoffs, typically by gathering information from the leader's interaction with the follower. Unfortunately, these approaches rely crucially on the assumption that the… ▽ More Information uncertainty is one of the major challenges facing applications of game theory. In the context of Stackelberg games, various approaches have been proposed to deal with the leader's incomplete knowledge about the follower's payoffs, typically by gathering information from the leader's interaction with the follower. Unfortunately, these approaches rely crucially on the assumption that the follower will not strategically exploit this information asymmetry, i.e., the follower behaves truthfully during the interaction according to their actual payoffs. As we show in this paper, the follower may have strong incentives to deceitfully imitate the behavior of a different follower type and, in doing this, benefit significantly from inducing the leader into choosing a highly suboptimal strategy. This raises a fundamental question: how to design a leader strategy in the presence of a deceitful follower? To answer this question, we put forward a basic model of Stackelberg games with (imitative) follower deception and show that the leader is indeed able to reduce the loss due to follower deception with carefully designed policies. We then provide a systematic study of the problem of computing the optimal leader policy and draw a relatively complete picture of the complexity landscape; essentially matching positive and negative complexity results are provided for natural variants of the model. Our intractability results are in sharp contrast to the situation with no deception, where the leader's optimal strategy can be computed in polynomial time, and thus illustrate the intrinsic difficulty of handling follower deception. Through simulations we also examine the benefit of considering follower deception in randomly generated games. △ Less

Submitted 20 May, 2019; v1 submitted 7 March, 2019; originally announced March 2019.

arXiv:1504.06058 [pdf, other]

Security Games with Information Leakage: Modeling and Computation

Authors: Haifeng Xu, Albert X. Jiang, Arunesh Sinha, Zinovi Rabinovich, Shaddin Dughmi, Milind Tambe

Abstract: Most models of Stackelberg security games assume that the attacker only knows the defender's mixed strategy, but is not able to observe (even partially) the instantiated pure strategy. Such partial observation of the deployed pure strategy -- an issue we refer to as information leakage -- is a significant concern in practical applications. While previous research on patrolling games has considered… ▽ More Most models of Stackelberg security games assume that the attacker only knows the defender's mixed strategy, but is not able to observe (even partially) the instantiated pure strategy. Such partial observation of the deployed pure strategy -- an issue we refer to as information leakage -- is a significant concern in practical applications. While previous research on patrolling games has considered the attacker's real-time surveillance, our settings, therefore models and techniques, are fundamentally different. More specifically, after describing the information leakage model, we start with an LP formulation to compute the defender's optimal strategy in the presence of leakage. Perhaps surprisingly, we show that a key subproblem to solve this LP (more precisely, the defender oracle) is NP-hard even for the simplest of security game models. We then approach the problem from three possible directions: efficient algorithms for restricted cases, approximation algorithms, and heuristic algorithms for sampling that improves upon the status quo. Our experiments confirm the necessity of handling information leakage and the advantage of our algorithms. △ Less

Submitted 4 May, 2015; v1 submitted 23 April, 2015; originally announced April 2015.

Showing 1–14 of 14 results for author: Rabinovich, Z