Towards Principled Superhuman AI for Multiplayer Symmetric Games

Jiawei Ge Yuanhao Wang¹¹footnotemark: 1 Wenzhe Li Chi ** equal contributionDepartment of Operations Research and Financial Engineering, Princeton University; [email protected]Department of Computer Science, Princeton University; [email protected]Department of Electrical and Computer Engineering, Princeton University; [email protected]Department of Electrical and Computer Engineering, Princeton University; [email protected]

Abstract

Multiplayer games, when the number of players exceeds two, present unique challenges that fundamentally distinguish them from the extensively studied two-player zero-sum games. These challenges arise from the non-uniqueness of equilibria and the risk of agents performing highly suboptimally when adopting equilibrium strategies. While a line of recent works developed learning systems successfully achieving human-level or even superhuman performance in popular multiplayer games such as Mahjong, Poker, and Diplomacy, two critical questions remain unaddressed: (1) What is the correct solution concept that AI agents should find? and (2) What is the general algorithmic framework that provably solves all games within this class? This paper takes the first step towards solving these unique challenges of multiplayer games by provably addressing both questions in multiplayer symmetric normal-form games. We also demonstrate that many meta-algorithms developed in prior practical systems for multiplayer games can fail to achieve even the basic goal of obtaining agent’s equal share of the total reward.

1 Introduction

In recent years, AI has achieved remarkable success in multi-agent decision-making problems, particularly in a wide range of strategic games. These include, but are not limited to, Go (Silver et al.,, 2016), Mahjong (Li et al.,, 2020), Poker (Moravčík et al.,, 2017; Brown and Sandholm,, 2018, 2019), Starcraft 2 (Vinyals et al.,, 2019), DOTA 2 (Berner et al.,, 2019), League of Legends (Ye et al.,, 2020), and Diplomacy (Gray et al.,, 2020; Bakhtin et al.,, 2022; † et al., 2022, FAIR). The majority of these games are two-player zero-sum games¹¹1Games such as DOTA and League of Legends, despite involving two teams, can be mostly considered similar to two-player zero-sum games in terms of their game structures and solutions., where Nash equilibria always exist and can be computed efficiently. Nash equilibria in two-player zero-sum games are also non-exploitable—an agent employing a Nash equilibrium policy will not lose even when facing an adversarial opponent who seeks to exploit the agent’s weaknesses. Although such equilibrium strategies do not necessarily capitalize on opponents’ weaknesses or guarantee large-margin victories, human players often adopt suboptimal policies that deviate significantly from equilibria in complex games with large state spaces. Consequently, AI agents who adopt equilibrium strategies often outperform humans in practice for two-player zero-sum games.

In contrast, multiplayer games—defined here as those with more than two players—exhibit fundamentally different game structures compared to two-player zero-sum games. This distinction introduces several unique challenges. Firstly, Nash equilibria are no longer efficiently computable (Daskalakis et al.,, 2009; Chen and Deng,, 2005). Moreover, there may exist multiple Nash equilibria with distinct values. Such non-uniqueness in equilibria raises a critical concern about the adoption of equilibrium strategies in multiplayer settings: if an AI agent adopts an equilibrium that is different from other players, collectively, they are not playing any single equilibrium, which undermines the equilibrium property that dissuades agents from changing strategies as long as others maintain theirs. Finally, in multiplayer games, equilibrium strategies are no longer non-exploitable. Although the introduction of alternative equilibrium notions such as Correlated Equilibria (CE) or Coarse Correlated Equilibria (CCE) alleviates computational hardness, issues of non-uniqueness and potential exploitability remain. This leads to the first critical question:

What is the correct solution concept for an AI agent to learn in multiplayer games?

While not answering this question directly, several practical systems for multiplayer games have already achieved human-level or even superhuman performance in several popular games including Mahjong (Li et al.,, 2020) (4 players), Poker (Brown and Sandholm,, 2019) (6 players), and Diplomacy (Bakhtin et al.,, 2022; † et al., 2022, FAIR) (7 players). These systems focus on develo** algorithmic frameworks capable of learning effective strategies that excel in competitive settings, such as ladders, online gaming platforms, or tournaments against human players. Generally, most of these systems rely on a basic self-play framework, starting from scratch or from human policies acquired through behavior cloning, with or without regularizations. Since all these systems demand substantial computational power, significant human expertise, and extensive engineering efforts, the general applicability of the algorithmic frameworks developed in these studies remains to be tested in a wide range of multiplayer games beyond Mahjong, Poker, or Diplomacy. This leads to the second critical question:

What is the general algorithmic framework that can provably solve all multiplayer games within certain rich classes?

In this paper, we consider an important subclass of multiplayer games: symmetric constant-sum games, which are prevalent in games involving more than two players. Examples include previously discussed multiplayer games like Mahjong, Poker, Diplomacy, as well as a variety of board games such as Avalon (Light et al.,, 2023), Mafia and Catan.²²2All these examples are symmetric games up to randomization of the seating. Symmetry brings fairness among players, and a natural baseline for symmetric constant-sum games is that the AI agent should, at a minimum, secure an equal share of the total reward. For instance, in a four-player game with only one winner, the AI should aim to win at least one-fourth of the games. This paper also focuses on the formulation of normal-form games, which are not only foundational but also expressive enough to encapsulate sequential games like extensive-form games or Markov games as special cases.³³3Extensive-Form Games (EFGs) or Markov Games (MGs) can be viewed as special cases of normal-form games, where each action in normal-form games corresponds to a policy in EFGs or MGs, although such representations may not always be efficient.

This paper takes an initial step towards addressing the grand challenges of general multiplayer games by effectively tackling the two highlighted questions within the context of multiplayer symmetric normal-form games. Regarding the first question of solution concepts, we discuss the inadequacy of classical equilibrium solutions in meeting the baseline requirement of achieving an equal share in symmetric games. We further establish important claims that (1) as long as opponents can deploy different individual policies, even playing their best response can fail to achieve an equal share; (2) there may not exist a non-exploitable strategy even when all opponents are restricted to play the same strategy (See Section 4 for details). This unfortunately leaves us with the only viable option that the agent must learn strategies that adapt to the identical strategy played by the opponents. Solution concepts that fall into this category, such as the best response to the identical strategy of the opponents, always meet the baseline requirement. While assuming all opponents playing identical strategy appears restrictive, we argue that it has already been implicitly adopted in many modern AI systems for multiplayer games. We also prove that such an assumption approximately holds in multiplayer gaming platforms with a large player base.

For the second question concerning general algorithms, this paper illustrates how we can leverage existing no-regret and no-dynamic regret learning tools appropriately to meet the baseline requirement of equal share with regret guarantees, particularly when human policies are either stationary or evolving slowly. We also present matching lower bounds that show these regret guarantees cannot be significantly improved in the worst case. Our experimental results demonstrate that while our algorithmic solutions consistently outperform static human policies, the self-play meta-algorithms developed in previous state-of-the-art systems for multiplayer games like Mahjong, Poker, and Diplomacy may converge to suboptimal solutions, resulting in the AI agent consistently losing the game even against non-adaptive human players.

1.1 Related work

Human-level or superhuman AI in practice

Building superhuman AI has long been a goal in various games. A large body of works in this line focus on two-player or two-team zero-sum games like Chess (Campbell et al.,, 2002), Go (Silver et al.,, 2016), Heads-Up Texas Hold’em (Moravčík et al.,, 2017; Brown and Sandholm,, 2018), Starcraft 2 (Vinyals et al.,, 2019), DOTA 2 (Berner et al.,, 2019) and League of Legends (Ye et al.,, 2020). Most of them are based on finding equilibria via self-play, fictitious play, league training, etc. There is comparatively much less amount of work on games with more than two players, whose game structures are fundamentally different from two-player zero-sum games. Several remarkable multiplayer successes include Poker (Brown and Sandholm,, 2019), Mahjong (Li et al.,, 2020), Doudizhu (Zha et al.,, 2021) and Diplomacy (Bakhtin et al.,, 2022; † et al., 2022, FAIR). Despite lacking a clearly formulated learning objective, these works typically design meta-algorithms, which include initially training the model using behavior cloning from human players, then enhancing it through self-play, and finally applying adaptations based on the game’s specific structure or human expertise. It remains elusive whether such a recipe is generally effective for a wide range of multiplayer games.

Existing results on symmetric games

Von Neumann & Morgenstern, 1947 (Von Neumann and Morgenstern,, 1947) gave the first definition of symmetric games and used the three-player majority-vote example to showcase the stark difference between symmetric three-player zero-sum games and symmetric two-player zero-sum games. In his seminal paper that introduced Nash equilibrium, Nash proved that a symmetric finite multi-player game must have a symmetric Nash equilibrium (Nash,, 1951). However, this existence result might mean little from an individual standpoint, as there is no reason a priori to assume that other players are indeed playing according to this symmetric equilibrium. Papadimitriou & Roughgarden, 2005(Papadimitriou and Roughgarden,, 2005) studied the computational complexity of finding the Nash equilibrium in symmetric multi-player games when the number of actions available is much smaller than the number of players, and gave a polynomial-time algorithm for the problem. In this case, symmetry greatly reduced the computational complexity (as computing Nash in general is PPAD-hard). Daskalakis et al., 2009 (Daskalakis,, 2009) proposed anonymous games, a generalization of symmetric games.

No-regret learning in games

There is a rich literature on applying no-regret learning algorithms to learning equilibria in games. It is well-known that if all agents have no-regret, the resulting empirical average would be an approximate Coarse Correlated Equilibrium (CCE) (Young,, 2004), while if all agents have no swap-regret, the resulting empirical average would be an $\epsilon$ -Correlated Equilibrium (CE) (Hart and Mas-Colell,, 2000; Cesa-Bianchi and Lugosi,, 2006). Later work continuing this line of research includes those with faster convergence rates (Syrgkanis et al.,, 2015; Chen and Peng,, 2020; Daskalakis et al.,, 2021), last-iterate convergence guarantees (Daskalakis and Panageas,, 2018; Wei et al.,, 2020), and extension to extensive-form games (Celli et al.,, 2020; Bai et al., 2022b, ; Bai et al., 2022a, ; Song et al.,, 2022) and Markov games (Song et al.,, 2021; ** et al.,, 2021).

2 Preliminaries

Notation.

For any set $\mathcal{A}$ , its cardinality is represented by $|\mathcal{A}|$ , and $\Delta(\mathcal{A})$ denotes a probability distribution over $\mathcal{A}$ . We employ $\mathcal{A}^{\otimes n}$ to denote the Cartesian product of $n$ instances of $\mathcal{A}$ . Given a distribution $x$ over $\mathcal{A}$ , $x^{\otimes n}$ represents the joint distribution of $n$ independent copies of $x$ , forming a distribution over $\mathcal{A}^{\otimes n}$ . For a function $f:\mathcal{A}\rightarrow\mathbb{R}$ , we denote $\|f\|_{\infty}:=\max_{a\in\mathcal{A}}|f(a)|$ . We use $[n]$ to denote the set $\{1,\ldots,n\}$ . In this paper, we use $C$ to denote universal constants, which may vary from line to line.

Normal-form game

An $n$ -player normal-form game consists of a finite set of $n$ players, where each player has an action space $\mathcal{A}_{i}$ and a corresponding payoff function $U_{i}:\mathcal{A}_{1}\times\cdots\times\mathcal{A}_{n}\to[-1,1]$ with $U_{i}(a_{1},\ldots,a_{n})$ denotes the payoff received by the $i$ -th player if $n$ players are taking joint actions $(a_{1},\ldots,a_{n})$ . This paper focuses on symmetric and zero-sum normal-form games.⁴⁴4Our work trivially extends to symmetric and constant-sum normal-form games, since one can convert constant-sum games to zero-sum games by offsetting all players’ payoff functions by a constant.

Definition 2.1 (Symmetric zero-sum normal-form game).

For an $n$ -player normal-form game with an action space $\mathcal{A}_{i}$ and a payoff $U_{i}$ for player $i$ , we say the game is symmetric if (1) $\mathcal{A}_{i}=\mathcal{A}$ , for all $i\in[n]$ ; (2) for any permutation $\pi$ : $U_{i}(a_{1},\cdots,a_{n})=U_{\pi^{-1}(i)}(a_{\pi(1)},\cdots,a_{\pi(n)})$ . We say the game is zero-sum if $\sum^{n}_{i=1}U_{i}(a)=0$ for all $a\in\mathcal{A}_{1}\times\cdots\times\mathcal{A}_{n}$ .

Briefly speaking, in a symmetric game, the payoffs for employing a specific action are determined solely by the actions used by others, agnostic of the identities of the players using them. Thus, the payoff function of the first player, denoted as $U_{1}$ , is sufficient to encapsulate the entire game. For $i\in[n]$ , $x_{i}\in\Delta(\mathcal{A})$ denotes a mixed strategy of the $i$ -th player and $x_{-i}\in\Delta(\mathcal{A}^{\otimes n-1})$ denotes a mixed strategy of the other players. Correspondingly, $a_{i}\in\mathcal{A}$ denotes the action of the $i$ -th player and $a_{-i}\in\mathcal{A}^{\otimes n-1}$ denotes the action of the other players. For any $a_{i}\in\mathcal{A}$ , we denote $U_{i}(a_{i},x_{-i}):=\mathbb{E}_{a_{-i}\sim x_{-i}}\left[U_{i}(a_{i},a_{-i})% \right].$

Best response

Given a mixed strategy $x_{-i}$ of the other $n-1$ players, the best response set ${\rm BR}_{i}(x_{-i})$ of the $i$ -th player is defined as ${\rm BR}_{i}(x_{-i}):={\arg\max}_{a_{i}\in\mathcal{A}_{i}}U_{i}(a_{i},x_{-i})$ .

Equilibrium

Nash equilibrium is the most commonly-used solution concept for games: a mixed strategy $x\in\Delta(\mathcal{A}_{1}\times\cdots\times\mathcal{A}_{n})$ of all players is said to be Nash Equilibrium (NE) if $x$ is a product distribution, and no player could gain by deviating her own strategy while holding all other players’ strategies fixed. That is, for all $i\in[n]$ and $a^{\prime}_{i}\in\mathcal{A}_{i}$ , $\mathbb{E}_{a\sim x}[U_{i}(a_{i},a_{-i})]\geq\mathbb{E}_{a\sim x}[U_{i}(a^{% \prime}_{i},a_{-i})]$ .

There are also two equilibrium notions relaxing the notion of Nash equilibrium by no longer requiring $x$ to be a product distribution. It allows general joint distribution $x$ which describes correlated policies among players. In particular, (1) $x$ is a Correlated Equilibrium (CE) if for all $i\in[n]$ and $a^{\prime}_{i}\in\mathcal{A}_{i}$ , $\mathbb{E}_{a\sim x}[U_{i}(a_{i},a_{-i})\mid a_{i}]\geq\mathbb{E}_{a\sim x}[U_% {i}(a^{\prime}_{i},a_{-i})\mid a_{i}]$ , and (2) $x$ is Coarse Correlated Equilibrium (CCE) if for all $i\in[n]$ and $a^{\prime}_{i}\in\mathcal{A}_{i}$ : $\mathbb{E}_{a\sim x}[U_{i}(a_{i},a_{-i})]\geq\mathbb{E}_{a\sim x}[U_{i}(a^{% \prime}_{i},a_{-i})]$ . The major difference between those two notions is in the cases when the agent deviates from her current strategy, whether she is still allowed to observe the randomness in drawing actions from the correlated policy. The relationship among various equilibrium concepts is encapsulated by ${\rm NE}\subset{\rm CE}\subset{\rm CCE}$ .

We note in two-player zero-sum games, a Nash equilibrium is non-exploitable. In math, if $(\mu^{\star},\nu^{\star})$ is the Nash equilibrium, we have $\min_{\nu}U_{1}(\mu^{\star},\nu)=\max_{\mu}\min_{\nu}U_{1}(\mu,\nu)=0$ .

No-regret learning

No-regret learning is a commonly adopted strategy in game theory to find equilibrium solutions. We consider a $T$ -step learning procedure, where for each round $t\in[T]$ : (1) the agent picks a mixed strategy $\mu^{t}$ over $\mathcal{A}$ , (2) the environment picks an adversarial loss $\ell_{t}\in[0,1]^{|\mathcal{A}|}$ . The expected utility for $t$ -th round is defined as $-\langle\mu^{t},\ell_{t}\rangle$ . To measure the performance of a particular algorithm, a common approach is to consider the regret, where the algorithm’s performance is compared against the single best action in hindsight. Specifically, for policy sequence $(\mu^{1},\ldots,\mu^{T})$ taken by an algorithm, we define

{\text{Reg}(T)}:=\sum_{t=1}^{T}\langle\mu^{t},\ell_{t}\rangle-\min_{a\in% \mathcal{A}}\sum^{T}_{t=1}\ell_{t}(a),

We say that the algorithm is a no-regret algorithm if ${\text{Reg}(T)}=o(T)$ . One of such no-regret learning algorithms is Hedge algorithm, which performs the following exponential weight updates:

\mu^{t+1}(a)\propto\mu^{t}(a)e^{-\eta_{t}\ell_{t}(a)},\quad\text{for}\quad% \forall a\in\mathcal{A}.

(1)

where $\eta_{t}$ is the learning rate.

Relation between no-regret learning and equilibria

It is known that in the normal-form game setup, if each player independently plays a no-regret learning algorithm treating all other players as adversarial environments, and we denote $x^{t}=x_{1}^{t}\times\cdots\times x_{n}^{t}$ as the joint strategy by all players at round $t$ , then we have the average policy $(1/T)\sum_{t=1}^{T}x^{t}$ converges to CCEs when $T$ is large.

3 Equilibria and Self-play are Insufficient for Multiplayer Games

In a multiplayer symmetric zero-sum game, every player is created equal, and the equal share for each player is zero. Intuitively, one might expect a good player to receive a non-negative expected payoff, making this a minimal baseline for a well-designed AI agent. However, in this section, we demonstrate that both the existing equilibria notions and the algorithmic framework of self-play from scratch—widely used in achieving superhuman performance in practical games—are insufficient to secure this seemingly straightforward goal.

To illustrate this, we consider the following $3$ -player majority vote game involving an AI agent and two other players.

Example 1 (Three-player majority vote game).

In each round, every player chooses either $0$ or $1$ , with the rule that being in the majority yields a positive payoff of $1$ , while being in the minority results in a negative payoff of $-2$ .

Limitation of equilibria

In this setup, both strategies $(0,0,0)$ and $(1,1,1)$ constitute NE. However, the existence of multiple NEs creates a predicament for the AI agent. It must choose which equilibrium to follow, yet there’s always the risk that the other two players may coordinate on the opposing NE, leading to a negative payoff for the AI agent. In other words, adhering to a single NE does not reliably ensure an average payoff when multiple equilibria exist. Since ${\rm NE}\subset{\rm CE}\subset{\rm CCE}$ , we know the same limitation also holds for CE and CCE.

Limitation of self-play from scratch

Self-play is a training method in which an AI agent improves its performance by repeatedly playing against copies of itself without human supervision. Typical self-play is carried out by optimizing one agent’s policy (denoted as the main agent) while fixing all other agents’ policies to be the current policy or the earlier versions of the main agent. When no-regret learning algorithms such as Hedge are adopted as the optimizer, the self-play algorithm is guaranteed to converge to a CCE. Moreover, it may converge to a certain “good” subclass of CCEs depending on the choice of specific optimizer and initialization.

Here, we argue that self-play from scratch (Brown and Sandholm,, 2019) fails to always achieve non-negative expected payoff in multiplayer symmetric zero-sum games. The example is again the three-player majority vote game. Imagine two parallel worlds, with two other players always playing 1 in one world, and 0 in the other. The learner must play the exact same action as the other two players to receive non-negative expected payoff. Noting that self-play from scratch does not check other players’ actual policy, the learner has no clue which world it is, and thus loses in at least one of the two worlds.

We note that several recent systems (Li et al.,, 2020; Jacob et al.,, 2022) combine self-play with human policy modeling which avoids the drawback of self-play from scratch. However, our experiments demonstrate that many of these solutions remain ineffective even in some simple normal-form games (See Section 6).

4 New Solution Concepts

The limitation of standard equilibria leads us to the following question: What solution concept should be adopted to secure an equal share (non-negative expected payoff) in multiplayer symmetric games? In this section, we provide a clear answer to this question.

Diverse strategies by opponents

We start by considering the general cases where opponents can possibly deploy different strategies.

Claim 4.1.

There exist symmetric zero-sum games such that no policy achieves non-negative expected payoff if the other players, even without collusion, are permitted to adopt different strategies.

Here, we define “no collusion” by requiring the policies of other players to be statistically independent without any shared randomness.

Proof.

We consider a $3$ -player minority game involving an AI agent and two other players. The rule of the game is that being in the minority yields a positive payoff of $2$ , while being in the majority results in a negative payoff of $-1$ . If the other two players act $0$ and $1$ , respectively, then the AI agent can only receive $-1$ payoff, which is strictly less than $0$ . Since the other two players are simply playing deterministic policies, they are independent, i.e., without collusion. ∎

This observation underscores a crucial aspect of the multiplayer symmetric game dynamic where diverse strategies by opponents can drastically affect the possibility of achieving an equal share.

Identical strategy by opponents

Claim 4.1 leaves out no choice but to consider the setting where opponents must deploy identical strategy. While it may seem limiting to assume that opponents employ identical strategies, we argue that this assumption is actually already implicit in the majority of practical systems (Li et al.,, 2020; Brown and Sandholm,, 2019; Bakhtin et al.,, 2022; † et al., 2022, FAIR). These systems utilize a self-play framework, equating the strategies of all opponents with those of the learner. Identical strategy assumption is also naturally satisfied in modern multiplayer gaming platforms with a large player base (See Section 4.1).

After making the strategies of opponents equal, one may wonder whether we can find a fixed non-exploitable policy similar to the case of two-player zero-sum games.

Claim 4.2.

There exist symmetric zero-sum games such that no fixed strategy guarantees a non-negative expected payoff against adversarial strategies by opponents, even under the constraint that they adhere to identical strategies.

Proof.

Imagine an AI agent involves in a $3$ -player majority game and plays a mixed strategy $(\beta,1-\beta)$ (i.e. play $0$ w.p. $\beta$ ; play $1$ w.p. $1-\beta$ ). And the other two players adopt an identical mixed strategy $(p,1-p)$ (i.e. play $0$ w.p. $p$ ; play $1$ w.p. $1-p$ ). Then, we can calculate the payoff of the AI agent as $U_{1}(\beta,p,p)=\beta(-2(1-p)^{2}+2p(1-p))+(1-\beta)(-2p^{2}+2p(1-p))$ . It then follows that $\max_{\beta\in[0,1]}\min_{p\in[0,1]}U_{1}(\beta,p,p)\leq\max_{\beta\in[0,1]}% \min_{p\in\{0,1\}}U_{1}(\beta,p,p)=-1,$ which is strictly less than $0$ . ∎

This observation again makes multiplayer games significantly different from two-player zero-sum games. This leads to the following conclusion:

To reliably secure an equal share in symmetric games, an agent must be in a scenario where: (a) opponents employ the identical strategy, and (b) the AI agent adapts its strategy to the opponents’ strategies, which involves opponent modeling.

It is not difficult to find solution concepts—such as (1) behavior cloning for the strategy by opponents; and (2) best response to the identical strategy by opponents—that satisfy both criteria, and are guaranteed to secure an equal share.

Finally, we rigorously summarize our discussion using mathematical language, detailing the relationships between various minimax quantities.

Proposition 4.3.

For symmetric zero-sum games, we have

	$\displaystyle\max_{x_{1}}\min_{x_{2},\cdots,x_{n}}U_{1}(x_{1},\cdots,x_{n})% \leq\min_{x_{2},\cdots,x_{n}}\max_{x_{1}}U_{1}(x_{1},\cdots,x_{n})\leq$	$\displaystyle\min_{x}\max_{x_{1}}U_{1}(x_{1},x^{\otimes n-1})=0$
	$\displaystyle\max_{x_{1}}\min_{x}U_{1}(x_{1},x^{\otimes n-1})\leq$	$\displaystyle\min_{x}\max_{x_{1}}U_{1}(x_{1},x^{\otimes n-1})=0,$

where all the inequalities can be made strict in certain games.

4.1 Multiplayer games with a large player base

We argue that assumption of all opponents playing identical strategy is well-justified in modern multiplayer gaming platforms with a large player base. Imagine a casino hosting $N$ players who randomly join poker tables, or an online Mahjong match-making platform with $N$ users. Let $\{x_{i}\}^{N}_{i=1}$ be the strategy set for these $N$ players. We can then define the meta-strategy of the population as $\bar{x}=(1/N)\sum_{i=1}^{N}x_{i}$ . The following proposition claims that, for $n$ -player symmetric zero-sum games, as long as $N\gg(n-2)^{2}$ , the payoff achieved by playing against $n-1$ players that uniformly draw from the player pool is almost the same as playing against $n-1$ players who all adopt the same meta-strategy $\bar{x}$ .

Proposition 4.4.

Let $\mathbb{E}_{x_{-1}}$ be the expectation over the randomness on sampling $n-1$ strategies uniformly from the set $\{x_{i}\}^{N}_{i=1}$ without replacement. Then for any policy $z\in\Delta(\mathcal{A})$ , we have $|\mathbb{E}_{x_{-1}}[U_{1}(z,x_{-1})]-U_{1}(z,\bar{x}^{\otimes n-1})|\leq 2(n-% 2)^{2}/N$ .

5 Efficient Algorithms

In this section, we explore efficient algorithms that are able to secure an average payoff within a framework where all opponents adopt a common strategy $y^{t}$ for each round $t\in[T]$ . Correspondingly, the payoff function of the AI agent is given by $u^{t}(\cdot):=U_{1}(\cdot,(y^{t})^{\otimes n-1})$ . Given that the rules of the game are known to the player, the player is actually aware of $U_{1}$ . We approach this exploration through two distinct scenarios: (i) Section 5.1 examines scenarios with a stationary opponent, wherein $y^{t}$ remains the same over time; (ii) Section 5.2 transitions to considering slowly changing adversarial opponents, limiting their power by imposing a variation budget.

5.1 Fixed opponents

We begin by exploring the simple stationary scenario, where the common strategy adopted by the opponent remains constant over time, denoted as $y^{t}=y$ for all $t\in[T]$ .

Notably, in this particular scenario, the payoff function $u^{t}(\cdot)$ remains constant over time. Additionally, by symmetry, it is observable that $\max_{a\in\mathcal{A}}U_{1}(a,y^{\otimes n-1})\geq 0,$ indicating the existence of at least one action that can consistently yield an average payoff in each round. Thus, any no-regret algorithm is poised to achieve zero payoff under this setting. For instance, we will deploy the Hedge algorithm, which provides the following guarantees:

Theorem 5.1 (Stationary opponents).

Let $\{x^{t}\}^{T}_{t=1}$ be the strategy sequence implemented by the Hedge algorithm against stationary opponents. Then, with probability at least $1-\delta$ , we have

\textstyle(1/T)\sum^{T}_{t=1}u^{t}(x^{t})\geq u^{\star}-C\sqrt{\log(A/\delta)/% T},

for some absolute constant $C>$ , where $u^{\star}:=\max_{a\in\mathcal{A}}U_{1}(a,y^{\otimes n-1})\geq 0$ .

Theorem 5.1 claims that with stationary opponents, the Hedge algorithm is capable of achieving approximately an equal share when $T$ is large, demonstrating its effectiveness in the long term.

5.2 Adaptive opponents

In practical scenarios, encountering a fixed opponent strategy is relatively uncommon. More often, opponents adapt and modify their strategies over time, responding to the game’s dynamics and the actions of other players. Thus, in this section, we shift our focus to the non-stationary scenario, where the common strategy $y^{t}$ adopted by the opponent varies over time.

Challenges when facing fast adapting opponents

In multiplayer games, a significant challenge arises if opponents can change their strategies arbitrarily fast as demonstrated by the following fact.

Fact 5.2.

For a $3$ -player majority game, there exists a stochastic evolution such that the maximum total payoff achievable by any algorithm over the duration of the game is at most $-T$ .

In light of this negative result, it is clear that attaining an average payoff is impractical if arbitrary evolution is allowed. Thus we introduce a constraint on this evolution by positing a variation budget $V_{T}$ , which bounds the total variation of the payoff function across the time horizon. Specifically, we assume the payoff function belongs to $\mathcal{U}$ , which is defined as

\displaystyle\textstyle\mathcal{U}:=\left\{\{u^{t}\}^{T}_{t=1}\,\Bigg{|}\,\sum% _{t=1}^{T-1}\|u^{t+1}-u^{t}\|_{\infty}\leq V_{T}\right\}.

(2)

Furthermore, we denote $\mathcal{G}(n,A,V_{T})$ as the set of symmetric zero-sum games involving $n$ players, $A$ actions, and the payoff function $\{u^{t}\}^{T}_{t=1}\in\mathcal{U}$ . This constraint effectively moderates the power of the opponents compared to a fully adversarial setup.

Slowly adapting opponents

Minimizing standard regret is only effective in the stationary environment when a fixed action consistently yields a satisfactory payoff. In non-stationary environments, we turn our attention to dynamic regret, defined as:

\displaystyle\textstyle{\text{D-Reg}(T)}:=\sum_{t=1}^{T}\max_{a\in\mathcal{A}}% u^{t}(a)-\sum^{T}_{t=1}u^{t}(x^{t}).

This measures a strategy’s performance against the best action at each time step (dynamic oracle), providing a more relevant benchmark in changing environments.

Note that in the setting of symmetric games, the dynamic oracle is always assured to secure an equal share, i.e., $\sum_{t=1}^{T}\max_{a\in\mathcal{A}}u^{t}(a)\geq 0$ . Thus, if an algorithm can achieve no-dynamic-regret, then it’s guaranteed to achieve zero expected payoff even as the scenario evolves over time. To this ends, we adapt a no-dynamic-regret algorithm—Strongly Adaptive Online Learner with Hedge $\mathcal{H}$ as a blackbox algorithm ( $\mathrm{SAOL}^{\mathcal{H}}$ ), as proposed by (Daniely et al.,, 2015) to our setting and achieve following guarantees:

Theorem 5.3.

Suppose that $n\geq 3$ , $A\geq 2$ , and $V_{T}\in[1,T]$ . Then there exists some absolute constant $C>0$ , for any fixed game in the set $\mathcal{G}(n,A,V_{T})$ , with probability at least $1-\delta$ , $\mathrm{SAOL}^{\mathcal{H}}$ satisfies the following ${\text{D-Reg}(T)}\leq CV^{1/3}_{T}T^{2/3}\left(\sqrt{\log(A/\delta)}+\log T\right)$ .

Theorem 5.3 implies that the payoff of $\mathrm{SAOL}^{\mathcal{H}}$ achieves an average payoff $(1/T)\sum^{T}_{t=1}u^{t}(x^{t})\geq\sum_{t=1}^{T}\max_{a\in\mathcal{A}}u^{t}(a% )-\tilde{O}(V_{T}^{1/3}T^{-1/3})$ . Therefore, if $V_{T}$ is sublinear in $T$ , $\mathrm{SAOL}^{\mathcal{H}}$ is capable of approximately achieving zero expected payoff over an extended duration.

Middle regime

Interesting, there is a regime in the middle where opponents are changing reasonably fast while the value of dynamic oracle $\sum_{t=1}^{T}\max_{a\in\mathcal{A}}u^{t}(a)$ is not high. In this case, the favorable policy for the learner might be simply behavior cloning—simply mimic opponents’ strategies.

Formally, we define the behavior cloning algorithm by the learner making her action in $t$ -th round the same as the action taken by the 2nd player in $(t-1)$ -th round (See Algorithm 1). Behavior cloning achieves the following:

Theorem 5.4.

Suppose that $n\geq 3$ , $A\geq 2$ , and $V_{T}\in[1,T]$ , then behavior cloning guarantees that $\inf_{\mathcal{G}(n,A,V_{T})}\mathbb{E}\left[\sum^{T}_{t=1}u^{t}(x^{t})\right]% \geq-V_{T}-1,$ where the expectation is taken over the noisy actions.

Theorem 5.4 shows that the expected average total payoff achieved by behavior cloning is at least $O(-V_{T}/T)$ . We note that this can be better than the guarantees of $\mathrm{SAOL}^{\mathcal{H}}$ when the measurement for change $V_{T}$ is large and dynamic oracle $\sum_{t=1}^{T}\max_{a\in\mathcal{A}}u^{t}(a)$ is small.

Fundamental limit

Finally, we also complement our upper bounds by matching lower bounds showing that $\mathrm{SAOL}^{\mathcal{H}}$ and behavior cloning are already the near-optimal algorithms in terms of rates when competing with dynamic oracles and zero payoff respectively. The techniques are based on adapting existing hard instances for a more general setup to the symmetric zero-sum game setting. Please see more discussion in Appendix D.

Theorem 5.5.

There exists some absolute constant $C>0$ such that for any $n\geq 3$ , $A\geq 2$ , and $V_{T}\in[1,T]$ , it holds that $\inf_{{\text{Alg}}}\sup_{\mathcal{G}(n,A,V_{T})}\mathbb{E}_{{\text{Alg}}}[{% \text{D-Reg}(T)}]\geq CV^{1/3}_{T}T^{2/3},$ where the expectation $\mathbb{E}_{{\text{Alg}}}[\cdot]$ is taken over the noisy actions and the intrinsic randomness in the algorithm.

Theorem 5.6.

There exists some absolute constant $C>0$ such that for any $n\geq 3$ , $A\geq 2$ , and $V_{T}\in[1,T]$ , it holds that $\sup_{{\text{Alg}}}\inf_{\mathcal{G}(n,A,V_{T})}\mathbb{E}_{{\text{Alg}}}\left% [\sum^{T}_{t=1}u^{t}(x^{t})\right]\leq-CV_{T},$ where $\{x^{t}\}^{T}_{t=1}$ is the policy sequence implemented by the algorithm, and the expectation $\mathbb{E}_{{\text{Alg}}}[\cdot]$ is taken over the noisy actions and the intrinsic randomness in the algorithm.

6 Experiments

In this section, we focus on the scenario where one AI agent competes against $n-1$ human players who play the identical meta-strategy. We aim to answer: (Q1) Can existing algorithmic frameworks in previous superhuman AI systems consistently secure an equal share when competing against human players in games? If not, what are the patterns of the failure cases? (Q2) Are these trained agents exploitable by adversarial opponents? We design the following two games to compare our proposed methods and other baselines based on self-play.

Majority Vote (MV).

We first consider the standard majority vote game (Example 1) with $3$ players. A payoff of 0.5 is evenly distributed among the majority, and a payoff of -0.5 is evenly distributed among the minority. Note that everyone receives 0 if choosing the same action. Here the mixed strategy of the human population is assumed to be $\pi_{{\text{human}}}=[0.49,0.51]$ . It can be seen that [1, 0], [0, 1], and [1/2, 1/2] are all NE.

Switch Dominance Game (SDG).

In each round, players simultaneously choose an action from actions $A$ , $B$ , or $C$ . Let $N$ be the total number of players and $n_{A}$ be the number of agents choosing action $A$ , We define the game rule as:

\begin{cases}B\succ A\succ C&\text{if }n_{A}>0.2N,\\ C\succ B\succ A&\text{otherwise },\end{cases}

where the rule $i\succ j\succ k$ intuitively means that action $i$ dominates both $j$ and $k$ , and action $j$ dominates $k$ . SDG is designed so that $C$ is a dominated action when there is a reasonable number of players taking action $A$ , but a dominating action otherwise. Concretely, for $i\succ j\succ k$ , we use the payoff defined as:

	$\displaystyle r_{i}$	$\displaystyle=\mathbb{I}[n_{j}+n_{k}>0],$
	$\displaystyle r_{j}$	$\displaystyle=\mathbb{I}[n_{k}>0]-\mathbb{I}[n_{j}+n_{k}>0]\frac{n_{i}}{n_{j}+% n_{k}},$
	$\displaystyle r_{k}$	$\displaystyle=-\mathbb{I}[n_{j}+n_{k}>0]\frac{n_{i}}{n_{j}+n_{k}}-\mathbb{I}[n% _{k}>0]\frac{n_{j}}{n_{k}}.$

This payoff design guarantees that SDG is a zero-sum game. Throughout our experiments, we choose $N=30$ and the human meta-strategy $\pi_{{\text{human}}}=[0.399,0.6,0.001]$ (in the order of $A$ , $B$ , and $C$ ). Note that while this game has an NE policy [0, 0, 1], its utility is negative against all other players playing $\pi_{{\text{human}}}$ .

6.1 Learning algorithms

To better focus on the game-theoretical part of the algorithms, we idealize the process of imitation learning by assuming that the agent has access to the human population policy $\pi_{{\text{human}}}$ . We aim to compare our solution concepts — best response to behavior cloning (BR_BC), with three meta-algorithms adopted previously to build superhuman AI agents: self-play from scratch (SP_scratch) Brown and Sandholm, (2019), self-play from behavior cloning (SP_BC) Li et al., (2020), and self-play from behavior cloning with regularization towards human behavior (SP_BC_reg) Jacob et al., (2022). While these AI systems further implement multi-step lookahead with a few additional techniques, many of them only apply to sequential games, rather than basic normal-form games. Here, we focus on the comparison of the high-level meta-algorithms.

Algorithm details.

We use the Hedge algorithm as the backbone for self-play. Compared with the vanilla SP_scratch which is initialized with the uniform action distribution, SP_BC starts from the human policy $\pi_{{\text{human}}}$ , while SP_BC_reg (Jacob et al.,, 2022) further regularizes the KL divergence between the learned policy and $\pi_{{\text{human}}}$ during training. The Hedge algorithm also provides a natural way to learn the best response in BR_BC. We choose the learning rate for the Hedge algorithm based on theoretically optimal value, and choose the regularization parameter according to Jacob et al., (2022). We refer readers to Appendix E for more details.

6.2 Results

To answer Q1 and Q2, we evaluate the utility of the learned strategy against the human population, i.e., $U_{1}(\pi_{{\text{agent}}},\pi_{{\text{human}}}^{\otimes(n-1)})$ , as well as the exploitability of $\pi_{{\text{agent}}}$ , i.e., $\min_{\pi}U_{1}(\pi_{{\text{agent}}},\pi^{\otimes(n-1)})$ . To measure the utility, we evaluate the payoff of the agent’s converged policy by Monte Carlo methods with $3\times 10^{5}$ games, and report the mean and standard deviation of 10 runs. As for the exploitability, we pick the best exploiter policy within 100 runs and report the corresponding utility.

Convergence analysis.

Before comparing the utility and exploitability, we first check the convergence policy for each algorithm:

(1) MV: We notice that BR_BC consistently converges to the policy [0,1], while self-play variants approach different equilibria across different runs, as summarized in Table 1. Therefore, all self-play variants can converge to a suboptimal strategy [1,0] with a non-negligible probability and suffer a negative payoff against the human population policy [0.49,0.51]. Based on this analysis, we report the utility and exploitability of the convergence policy [0,1] for BR_BC, and the suboptimal policy [1,0] for algorithms based on self-play.

(2) SDG: We notice that self-play algorithms consistently converge to the policy [0, 0, 1], while BR_BC converges to the policy [0, 1, 0]. Based on this analysis, we report the utility and exploitability of the convergence policy [0, 0, 1] for algorithms based on self-play, and the policy [0, 1, 0] for BR_BC.

MV	SP_scratch	SP_BC	SP_BC_reg ( $\lambda=10^{-5}/10^{-4}/10^{-3}/10^{-2}$ )
$[1,0]$	52	48	48
$[0,1]$	48	52	52

Table 1: The distribution of convergence policy for self-play algorithms in 100 runs. For SP_BC_reg, we report results with different regularization coefficients

\lambda

MV	SP_scratch / SP_BC / SP_BC_reg	BR_BC
Utility ( $\times 10^{-2}$ )	-0.50 $\pm$ 0.05	0.52 $\pm$ 0.05
Exploitability	-0.50 $\pm$ 0.00	-0.50 $\pm$ 0.00

SDG	SP_scratch / SP_BC / SP_BC_reg	BR_BC
Utility	-12.67 $\pm$ 0.01	1.00 $\pm$ 0.00
Exploitability	-29.00 $\pm$ 0.00	-29.00 $\pm$ 0.00

Table 2: The utility and exploitability of each algorithm. Particularly, for MV, as self-play algorithms converge to different solutions, we evaluate the suboptimal one with the negative utility, i.e., [1, 0]. The results show that in two well-designed games, all algorithms except for BR_BC achieve negative payoffs even against non-adaptive humans, and all the learned policies can be exploited.

Utility and Exploitability.

We summarize the results in Table 2, which show that even in these two simple cases, none of the self-play algorithms can consistently secure an equal share in symmetric constant-sum games. This undesirable behavior persists even without humans making any adaptations! Moreover, based on these two games, we further conclude two potential failure modes of self-play algorithms: (1) For games with multiple NE, self-play methods based on no-regret algorithms may converge to different NEs according to different initial policies. When human policy (the initial policy for SP_BC) lies close to the boundary of the convergence basins of two different NEs, self-play algorithms will have a non-zero probability to converge to both of them due to the statistical randomness in the game. It is likely one of the two NEs is highly suboptimal against $\pi_{\text{human}}$ . Admittedly, this statistical randomness can be, to some extent, mitigated by picking a sufficiently small learning rate. However, the closer the initial policy to the boundary, the smaller the learning rate we need, which quickly makes the learning rate too small to converge within a reasonable number of samples in practice. (2) Even for a game with a single NE, a carefully designed game structure can result in this NE yielding a negative utility when compared to a specific human policy, and hence jailbreak all self-play variants. The aforementioned failure modes highlight a significant limitation in self-play variants’ capability to generalize effectively to diverse and complex game scenarios. In contrast, the solution BR_BC indicated by our theory consistently outperforms human policies and receives much higher utility. On the exploitability side, all learned policies can be exploited easily by adversarial opponents.

7 Conclusion

This paper takes the initial step towards addressing unique challenges in multiplayer games by investigating symmetric zero-sum normal-form games, which commonly appear in our daily lives. We clarify a number of important solution concepts for multiplayer games, and investigate the behaviors and limitations of many existing meta-algorithms that were deployed by previous state-of-the-art AI systems for games. We hope our results promote further research in principled methodologies and algorithms that lead to general superhuman AI for a wide range of multiplayer games.

Acknowledgement

The authors would like to thank Haifeng Xu for helpful discussions. This work is supported by Office of Naval Research N00014-22-1-2253.

References

(1) Bai, Y., **, C., Mei, S., Song, Z., and Yu, T. (2022a). Efficient $\Phi$ -regret minimization in extensive-form games via online mirror descent. arXiv preprint arXiv:2205.15294.
(2) Bai, Y., **, C., Mei, S., and Yu, T. (2022b). Near-optimal learning of extensive-form games with imperfect information. arXiv preprint arXiv:2202.01752.
Bakhtin et al., (2022) Bakhtin, A., Wu, D. J., Lerer, A., Gray, J., Jacob, A. P., Farina, G., Miller, A. H., and Brown, N. (2022). Mastering the game of no-press diplomacy via human-regularized reinforcement learning and planning. arXiv preprint arXiv:2210.05492.
Berner et al., (2019) Berner, C., Brockman, G., Chan, B., Cheung, V., Debiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., et al. (2019). Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680.
Besbes et al., (2014) Besbes, O., Gur, Y., and Zeevi, A. (2014). Stochastic multi-armed-bandit problem with non-stationary rewards. Advances in neural information processing systems, 27.
Brown and Sandholm, (2018) Brown, N. and Sandholm, T. (2018). Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424.
Brown and Sandholm, (2019) Brown, N. and Sandholm, T. (2019). Superhuman ai for multiplayer poker. Science, 365(6456):885–890.
Campbell et al., (2002) Campbell, M., Hoane Jr, A. J., and Hsu, F.-h. (2002). Deep blue. Artificial intelligence, 134(1-2):57–83.
Celli et al., (2020) Celli, A., Marchesi, A., Farina, G., and Gatti, N. (2020). No-regret learning dynamics for extensive-form correlated equilibrium. Advances in Neural Information Processing Systems, 33:7722–7732.
Cesa-Bianchi and Lugosi, (2006) Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridge university press.
Chen and Deng, (2005) Chen, X. and Deng, X. (2005). 3-nash is ppad-complete. In Electronic Colloquium on Computational Complexity, volume 134, pages 2–29. Citeseer.
Chen and Peng, (2020) Chen, X. and Peng, B. (2020). Hedging in games: Faster convergence of external and swap regrets. Advances in Neural Information Processing Systems, 33:18990–18999.
Daniely et al., (2015) Daniely, A., Gonen, A., and Shalev-Shwartz, S. (2015). Strongly adaptive online learning. In International Conference on Machine Learning, pages 1405–1411. PMLR.
Daskalakis, (2009) Daskalakis, C. (2009). Nash equilibria: Complexity, symmetries, and approximation. Computer Science Review, 3(2):87–100.
Daskalakis et al., (2021) Daskalakis, C., Fishelson, M., and Golowich, N. (2021). Near-optimal no-regret learning in general games. Advances in Neural Information Processing Systems, 34:27604–27616.
Daskalakis et al., (2009) Daskalakis, C., Goldberg, P. W., and Papadimitriou, C. H. (2009). The complexity of computing a nash equilibrium. SIAM Journal on Computing, 39(1):195–259.
Daskalakis and Panageas, (2018) Daskalakis, C. and Panageas, I. (2018). Last-iterate convergence: Zero-sum games and constrained min-max optimization. arXiv preprint arXiv:1807.04252.
(18) (FAIR)†, M. F. A. R. D. T., Bakhtin, A., Brown, N., Dinan, E., Farina, G., Flaherty, C., Fried, D., Goff, A., Gray, J., Hu, H., et al. (2022). Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074.
Gray et al., (2020) Gray, J., Lerer, A., Bakhtin, A., and Brown, N. (2020). Human-level performance in no-press diplomacy via equilibrium search. arXiv preprint arXiv:2010.02923.
Hart and Mas-Colell, (2000) Hart, S. and Mas-Colell, A. (2000). A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150.
Jacob et al., (2022) Jacob, A. P., Wu, D. J., Farina, G., Lerer, A., Hu, H., Bakhtin, A., Andreas, J., and Brown, N. (2022). Modeling strong and human-like gameplay with kl-regularized search. In International Conference on Machine Learning, pages 9695–9728. PMLR.
** et al., (2021) **, C., Liu, Q., Wang, Y., and Yu, T. (2021). V-learning–a simple, efficient, decentralized algorithm for multiagent rl. arXiv preprint arXiv:2110.14555.
Li et al., (2020) Li, J., Koyamada, S., Ye, Q., Liu, G., Wang, C., Yang, R., Zhao, L., Qin, T., Liu, T.-Y., and Hon, H.-W. (2020). Suphx: Mastering mahjong with deep reinforcement learning. arXiv preprint arXiv:2003.13590.
Light et al., (2023) Light, J., Cai, M., Shen, S., and Hu, Z. (2023). Avalonbench: Evaluating llms playing the game of avalon. In NeurIPS 2023 Foundation Models for Decision Making Workshop.
Moravčík et al., (2017) Moravčík, M., Schmid, M., Burch, N., Lisỳ, V., Morrill, D., Bard, N., Davis, T., Waugh, K., Johanson, M., and Bowling, M. (2017). Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513.
Nash, (1951) Nash, J. (1951). Non-cooperative games. Annals of mathematics, pages 286–295.
Papadimitriou and Roughgarden, (2005) Papadimitriou, C. H. and Roughgarden, T. (2005). Computing equilibria in multi-player games. In SODA, volume 5, pages 82–91. Citeseer.
Silver et al., (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489.
Song et al., (2021) Song, Z., Mei, S., and Bai, Y. (2021). When can we learn general-sum markov games with a large number of players sample-efficiently? arXiv preprint arXiv:2110.04184.
Song et al., (2022) Song, Z., Mei, S., and Bai, Y. (2022). Sample-efficient learning of correlated equilibria in extensive-form games. arXiv preprint arXiv:2205.07223.
Syrgkanis et al., (2015) Syrgkanis, V., Agarwal, A., Luo, H., and Schapire, R. E. (2015). Fast convergence of regularized learning in games. Advances in Neural Information Processing Systems, 28.
Vinyals et al., (2019) Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. (2019). Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354.
Von Neumann and Morgenstern, (1947) Von Neumann, J. and Morgenstern, O. (1947). Theory of games and economic behavior, 2nd rev.
Wei et al., (2020) Wei, C.-Y., Lee, C.-W., Zhang, M., and Luo, H. (2020). Linear last-iterate convergence in constrained saddle-point optimization. arXiv preprint arXiv:2006.09517.
Ye et al., (2020) Ye, D., Chen, G., Zhang, W., Chen, S., Yuan, B., Liu, B., Chen, J., Liu, Z., Qiu, F., Yu, H., et al. (2020). Towards playing full moba games with deep reinforcement learning. Advances in Neural Information Processing Systems, 33:621–632.
Young, (2004) Young, H. P. (2004). Strategic Learning and its Limits. Oxford University Press.
Zha et al., (2021) Zha, D., Xie, J., Ma, W., Zhang, S., Lian, X., Hu, X., and Liu, J. (2021). Douzero: Mastering doudizhu with self-play deep reinforcement learning. In international conference on machine learning, pages 12333–12344. PMLR.

Appendix A Proofs for Section 4

A.1 $3$ -player majority and minority game

In this paper, we define the 3-player majority game as a symmetric zero-sum game with action space $\mathcal{A}:=\{0,1\}$ and the payoff function given by:

	$\displaystyle U_{1}(0,0,0)=U_{1}(1,1,1)=0$
	$\displaystyle U_{1}(0,1,0)=U_{1}(0,0,1)=U_{1}(1,1,0)=U_{1}(1,0,1)=1$
	$\displaystyle U_{1}(0,1,1)=U_{1}(1,0,0)=-2.$

In other words, players receive a positive payoff if they are part of the majority and a negative payoff if they are in the minority. Correspondingly, we define the 3-player minority game as a symmetric zero-sum game with action space $\mathcal{A}:=\{0,1\}$ and the payoff function given by:

	$\displaystyle U_{1}(0,0,0)=U_{1}(1,1,1)=0$
	$\displaystyle U_{1}(0,1,0)=U_{1}(0,0,1)=U_{1}(1,1,0)=U_{1}(1,0,1)=-1$
	$\displaystyle U_{1}(0,1,1)=U_{1}(1,0,0)=2.$

In other words, players receive a positive payoff if they are part of the minority and a negative payoff if they are in the majority.

A.2 Proof of Proposition 4.3

Proof of Proposition 4.3.

First of all, we show that

\displaystyle\max_{x_{1}}\min_{x_{2},\cdots,x_{n}}U_{1}(x_{1},\cdots,x_{n})% \leq\min_{x_{2},\cdots,x_{n}}\max_{x_{1}}U_{1}(x_{1},\cdots,x_{n})\leq\min_{x}% \max_{x_{1}}U_{1}(x_{1},x^{\otimes n-1})=0,

where all the inequalities can be strict.

For the first inequality, note that for any $(x_{1},\ldots,x_{n})$ :

\displaystyle U_{1}(x_{1},\cdots,x_{n})\leq\max_{x_{1}}U_{1}(x_{1},\cdots,x_{n% }),

which implies

\displaystyle\min_{x_{2},\cdots,x_{n}}U_{1}(x_{1},\cdots,x_{n})\leq\min_{x_{2}% ,\cdots,x_{n}}\max_{x_{1}}U_{1}(x_{1},\cdots,x_{n}).

By further taking maximum over $x_{1}\in\Delta(\mathcal{A})$ , we prove that

\displaystyle\max_{x_{1}}\min_{x_{2},\cdots,x_{n}}U_{1}(x_{1},\cdots,x_{n})% \leq\min_{x_{2},\cdots,x_{n}}\max_{x_{1}}U_{1}(x_{1},\cdots,x_{n}).

The second inequality is straightforward due to a restriction on the minimization constraints. In the sequel, we prove the equation via contradiction. Note that by choosing $x_{1}=x$ , we can show that

\min_{x,\cdots,x}\max_{x_{1}}U_{1}(x_{1},x,\cdots,x)\geq 0.

Suppose for some game inequality holds, then by definition

\displaystyle\forall x\in\Delta({\mathcal{A}}),\exists x^{\prime}\in\Delta({% \mathcal{A}}),s.t.\ U_{1}(x^{\prime},x,\cdots,x)>0.

Define the set-valued argmax function $\phi:\Delta({\mathcal{A}})\to 2^{\Delta({\mathcal{A}})}$ :

\displaystyle\phi(x):=\{x^{\prime}\in\Delta({\mathcal{A}})\mid U_{1}(x^{\prime% },x,\cdots,x)=\max_{x^{\prime\prime}}U_{1}(x^{\prime\prime},x,\cdots,x)\}.

We claim that argmax function $\phi(x)$ is:

•

Always non-empty and convex;
•

Has a closed graph.

The first property is obvious, so we focus on the second one. Suppose that sequences $\{x_{i}\}$ , $\{y_{i}\}$ satisfy $x_{i}\to x$ , $y_{i}\to y$ and $y_{i}\in\phi(x_{i})$ . Since the payoff function is (Lipschitz) continuous, $\max_{x^{\prime\prime}}U_{1}(x^{\prime\prime},\cdot)$ is continuous by Berge’s maximum theorem. Thus $\max_{x^{\prime\prime}}U_{1}(x^{\prime\prime},x_{i},\cdots,x_{i})$ converges to $\max_{x^{\prime\prime}}U_{1}(x^{\prime\prime},x,\cdots,x)$ . Meanwhile $U_{1}(y_{i},x_{i},\cdots,x_{i})$ converges to $U_{1}(y,x,\cdots,x)$ . Thus

\displaystyle U_{1}(y,x\cdots,x)=\lim_{i\to\infty}U_{1}(y_{i},x_{i},\cdots,x_{% i})=\lim_{i\to\infty}\max_{x^{\prime\prime}}U_{1}(x^{\prime\prime},x_{i},% \cdots,x_{i})=\max_{x^{\prime\prime}}U_{1}(x^{\prime\prime},x,\cdots,x).

This implies $y\in\phi(x)$ , and that $\phi$ has a closed graph. Thus by Kakutani’s fixed point theorem, $\exists x^{*}:x^{*}\in\phi(x^{*})$ . Now we have

\displaystyle U_{1}(x^{*},\cdots,x^{*})=\max_{x^{\prime\prime}}U_{1}(x^{\prime% \prime},x^{*},\cdots,x^{*})>0,

which contradicts with the assumption that the game is zero-sum and symmetric. Consequently, we prove the equation.

In Claim 4.1, we have proved that the second inequality can be strict. To show the first inequality can be strict, we consider the $3$ -player majority vote. Suppose $3$ players adopt the mixed strategies $(\alpha_{1},1-\alpha_{1})$ , $(\alpha_{2},1-\alpha_{2})$ and $(\alpha_{3},1-\alpha_{3})$ , respectively. It then holds that

	$\displaystyle U_{1}(x_{1},x_{2},x_{3})$
	$\displaystyle=U_{1}(\alpha_{1},\alpha_{2},\alpha_{3})$
	$\displaystyle=\alpha_{1}\left(-2(1-\alpha_{2})(1-\alpha_{3})+\alpha_{2}(1-% \alpha_{3})+\alpha_{3}(1-\alpha_{2})\right)$
	$\displaystyle+(1-\alpha_{1})\left(-2\alpha_{2}\alpha_{3}+\alpha_{2}(1-\alpha_{% 3})+\alpha_{3}(1-\alpha_{2})\right).$

By choosing $\alpha_{2}=\alpha_{3}=0$ when $\alpha_{1}>1/2$ and $\alpha_{2}=\alpha_{3}=1$ when $\alpha_{1}\leq 1/2$ , it can be seen that

\displaystyle\max_{\alpha_{1}}\min_{\alpha_{2},\alpha_{3}}U_{1}(\alpha_{1},% \alpha_{2},\alpha_{3})\leq\max_{\alpha_{1}}\min\{-2\alpha_{1},-2(1-\alpha_{1})% \}=-1.

Note that

	$\displaystyle\min_{\alpha_{2},\alpha_{3}}\max_{\alpha_{1}}U_{1}(\alpha_{1},% \alpha_{2},\alpha_{3})$
	$\displaystyle=\min_{\alpha_{2},\alpha_{3}}\max\{-2(1-\alpha_{2})(1-\alpha_{3})% +\alpha_{2}(1-\alpha_{3})+\alpha_{3}(1-\alpha_{2}),-2\alpha_{2}\alpha_{3}+% \alpha_{2}(1-\alpha_{3})+\alpha_{3}(1-\alpha_{2})\}$
	$\displaystyle=\min_{\alpha_{2},\alpha_{3}}\max\{3(\alpha_{2}+\alpha_{3})-4% \alpha_{2}\alpha_{3}-2,\alpha_{2}+\alpha_{3}-4\alpha_{2}\alpha_{3}\}$
	$\displaystyle=0.$

Thus, we show that $\max_{\alpha_{1}}\min_{\alpha_{2},\alpha_{3}}U_{1}(\alpha_{1},\alpha_{2},% \alpha_{3})<\min_{\alpha_{2},\alpha_{3}}\max_{\alpha_{1}}U_{1}(\alpha_{1},% \alpha_{2},\alpha_{3})$ .

We then show that

\displaystyle\max_{x_{1}\in\Delta(\mathcal{A})}\min_{x\in\Delta(\mathcal{A})}U% _{1}(x_{1},x^{\otimes n-1})\leq\min_{x\in\Delta(\mathcal{A})}\max_{x_{1}\in% \Delta(\mathcal{A})}U_{1}(x_{1},x^{\otimes n-1})=0,

where the inequality can be strict.

Note that for any $x_{1},x\in\Delta(\mathcal{A})$ , we have

\displaystyle U_{1}(x_{1},x^{\otimes n-1})\leq\max_{x_{1}\in\Delta(\mathcal{A}% )}U_{1}(x_{1},x^{\otimes n-1}),

which implies for any $x_{1}\in\Delta(\mathcal{A})$

\displaystyle\min_{x\in\Delta(\mathcal{A})}U_{1}(x_{1},x^{\otimes n-1})\leq% \min_{x\in\Delta(\mathcal{A})}\max_{x_{1}\in\Delta(\mathcal{A})}U_{1}(x_{1},x^% {\otimes n-1}).

By further taking maximum over $x_{1}\in\Delta(\mathcal{A})$ , we show that

\displaystyle\max_{x_{1}\in\Delta(\mathcal{A})}\min_{x\in\Delta(\mathcal{A})}U% _{1}(x_{1},x^{\otimes n-1})\leq\min_{x\in\Delta(\mathcal{A})}\max_{x_{1}\in% \Delta(\mathcal{A})}U_{1}(x_{1},x^{\otimes n-1}).

In Claim 4.2, we have proved that the inequality can be strict. ∎

A.3 Proof of Proposition 4.4

Proof of Proposition 4.4.

Let $\mathbb{P}^{w/o}(i_{1},\ldots,i_{n-1})$ denote the probability of observing $(i_{1},\ldots,i_{n-1})$ when sampling $n-1$ points from $N$ without replacement, and let $\mathbb{P}^{w}(i_{1},\ldots,i_{n-1})$ denote the probability of observing $(i_{1},\ldots,i_{n-1})$ when sampling $n-1$ points from $N$ with replacement. For any $a$ , we then have

	$\displaystyle\mathbb{E}_{x_{-1}}\left[U_{1}(a,x_{-1})\right]=\sum_{(i_{1},% \ldots,i_{n-1})}\mathbb{P}^{w/o}(i_{1},\ldots,i_{n-1})U_{1}(a,x_{i_{1}},\ldots% ,x_{i_{n-1}})$
	$\displaystyle U_{1}(a,\bar{x}^{\otimes n-1})=\sum_{(i_{1},\ldots,i_{n-1})}% \mathbb{P}^{w}(i_{1},\ldots,i_{n-1})U_{1}(a,x_{i_{1}},\ldots,x_{i_{n-1}}).$

Note that $\|U_{1}\|_{\infty}\leq 1$ . Thus, we have

	$\displaystyle\left\|\mathbb{E}_{x_{-1}}\left[U_{1}(a,x_{-1})\right]-U_{1}(a,% \bar{x}^{\otimes n-1})\right\|$
	$\displaystyle\leq\sum_{(i_{1},\ldots,i_{n-1})}\left\|\mathbb{P}^{w/o}(i_{1},% \ldots,i_{n-1})-\mathbb{P}^{w}(i_{1},\ldots,i_{n-1})\right\|$
	$\displaystyle=\sum_{(i_{1},\ldots,i_{n-1})\text{ has repeated value}}\mathbb{P% }^{w}(i_{1},\ldots,i_{n-1})-\mathbb{P}^{w/o}(i_{1},\ldots,i_{n-1})$
	$\displaystyle\quad+\sum_{(i_{1},\ldots,i_{n-1})\text{ no repeated value}}% \mathbb{P}^{w/o}(i_{1},\ldots,i_{n-1})-\mathbb{P}^{w}(i_{1},\ldots,i_{n-1})$
	$\displaystyle=2\sum_{(i_{1},\ldots,i_{n-1})\text{ has repeated value}}\mathbb{% P}^{w}(i_{1},\ldots,i_{n-1})-\mathbb{P}^{w/o}(i_{1},\ldots,i_{n-1})$
	$\displaystyle=2\left(1-\frac{N(N-1)\ldots(N-n+2)}{N^{n-1}}\right)$
	$\displaystyle=2\left(1-\left(1-\frac{1}{N}\right)\left(1-\frac{2}{N}\right)% \ldots\left(1-\frac{n-2}{N}\right)\right)$
	$\displaystyle\leq 2\left(1-\left(1-\frac{n-2}{N}\right)^{n-2}\right)$
	$\displaystyle\leq\frac{2(n-2)^{2}}{N}.$

∎

Appendix B Proofs for Section 5.1

B.1 Proof of Theorem 5.1

Proof of Theorem 5.1.

Let $a^{\star}\in{\arg\max}_{a\in\mathcal{A}}U_{1}(\cdot,y^{\otimes n-1})$ . We then have

	$\displaystyle u^{\star}-\frac{1}{T}\sum^{T}_{t=1}u^{t}(x^{t})$
	$\displaystyle=U_{1}(a^{\star},y^{\otimes n-1})-\frac{1}{T}\sum^{T}_{t=1}u^{t}(% x^{t})$
	$\displaystyle=\underbrace{U_{1}(a^{\star},y^{\otimes n-1})-\frac{1}{T}\sum^{T}% _{t=1}U_{1}(a^{\star},a^{t}_{-1})}_{\rm(i)}+\underbrace{\frac{1}{T}\sum^{T}_{t% =1}U_{1}(a^{\star},a^{t}_{-1})-\frac{1}{T}\sum^{T}_{t=1}U_{1}(x^{t},a^{t}_{-1}% )}_{\rm(ii)}$
	$\displaystyle\quad+\underbrace{\frac{1}{T}\sum^{T}_{t=1}U_{1}(x^{t},a^{t}_{-1}% )-\frac{1}{T}\sum^{T}_{t=1}u^{t}(x^{t})}_{\rm(iii)}$

For (i), by Hoeffding’s inequality and union bound, we have with probability at least $1-\delta$ that

\displaystyle{\rm(i)}\leq O(\sqrt{\frac{\log(A/\delta)}{T}})

For (ii), by Hedge algorithm, we have

\displaystyle{\rm(ii)}\leq O(\sqrt{\frac{\log(A)}{T}})

For (iii), note that $\{U_{1}(x^{t},a^{t}_{-1})-u^{t}(x^{t})\}^{T}_{t=1}$ is a martingale difference sequence, thus by Azuma–Hoeffding inequality, we have with probability at least $1-\delta$ that

\displaystyle{\rm(iii)}\leq O(\sqrt{\frac{\log(1/\delta)}{T}}).

Combining the above results, we have

\displaystyle u^{\star}-\frac{1}{T}\sum^{T}_{t=1}u^{t}(x^{t})\leq C\sqrt{\frac% {\log(A/\delta)}{T}}

for some absolute constant $C>0$ . Thus we finish the proof. ∎

Appendix C Proofs for Section 5.2

C.1 Proof of Fact 5.2

Proof of Fact 5.2.

The proof of Fact 5.2 follows directly from the construction in Theorem 5.6 with $n=3$ , $\mathcal{A}=\{0,1\}$ and $\Delta_{T}=1$ . ∎

C.2 Proof of Theorem 5.3

Proof of Theorem 5.3.

The basic idea behind $\mathrm{SAOL}^{\mathcal{H}}$ is to execute $\mathcal{H}$ in parallel over each interval within a carefully selected set. This algorithm dynamically adjusts the weight of each interval based on the previously observed regret. In each round, $\mathrm{SAOL}^{\mathcal{H}}$ selects an interval in proportion to its assigned weight, applies $\mathcal{H}$ to each time slot within this interval, and follows its advice. Through this mechanism, $\mathrm{SAOL}^{\mathcal{H}}$ achieves a near-optimal performance on every time interval. We will leverage the strong adaptivity of $\mathrm{SAOL}^{\mathcal{H}}$ in our proofs.

Let $\mathcal{I}$ be any fixed interval in $[0,T]$ , $a_{0}\in{\arg\max}_{a\in\mathcal{A}}\left\{\sum_{t\in\mathcal{I}}u^{t}(a)\right\}$ and $u^{t,\star}:=\max_{a\in\mathcal{A}}u^{t}(a)$ . It holds that

	$\displaystyle\sum_{t\in\mathcal{I}}\left(u^{t,\star}-u^{t}(x^{t})\right)$
	$\displaystyle=\underbrace{\sum_{t\in\mathcal{I}}\left(u^{t,\star}-u^{t}(a_{0})% \right)}_{\rm(i)}+\sum_{t\in\mathcal{I}}\left(u^{t}(a_{0})-U_{1}(a_{0},a^{t}_{% -1})\right)$
	$\displaystyle\quad+\underbrace{\sum_{t\in\mathcal{I}}\left(U_{1}(a_{0},a^{t}_{% -1})-U_{1}(x^{t},a^{t}_{-1})\right)}_{\rm(ii)}+\sum_{t\in\mathcal{I}}\left(U_{% 1}(x^{t},a^{t}_{-1})-u^{t}(x^{t})\right)$

For (i), it can be seen that

\displaystyle{\rm(i)}=\sum_{t\in\mathcal{I}}\left(u^{t,\star}-u^{t}(a_{0})% \right)\leq|\mathcal{I}|\max_{t\in\mathcal{I}}\left\{u^{t,\star}-u^{t}(a_{0})% \right\}\leq 2V_{\mathcal{I}}|\mathcal{I}|.

Here the last inequality follows from the following argument: otherwise there exists $t_{0}\in\mathcal{I}$ such that $u^{t_{0},\star}-u^{t_{0}}(a_{0})>2V_{\mathcal{I}}$ . Let $a_{1}\in{\arg\max}_{a\in\mathcal{A}}u^{t_{0}}(a)$ . For all $t\in\mathcal{I}$ , it then holds that $u^{t}(a_{1})\geq u^{t_{0}}(a_{1})-V_{\mathcal{I}}=u^{t_{0},\star}-V_{\mathcal{% I}}>u^{t_{0}}(a_{0})+V_{\mathcal{I}}\geq u^{t}(a_{0})$ . Contradict to the definition of $a_{0}$ !

For (ii), we have

\displaystyle{\rm(ii)}\leq\max_{a\in\mathcal{A}}\sum_{t\in\mathcal{I}}\left(U_% {1}(a,a^{t}_{-1})-U_{1}(x^{t},a^{t}_{-1})\right)\leq C(\sqrt{\log A}+\log T)% \sqrt{|\mathcal{I}|},

where the last inequality follows from Theorem 1 in (Daniely et al.,, 2015).

Combining the upper bound of (i) and (ii), we have for any fixed interval $\mathcal{I}\subset[0,T]$ ,

	$\displaystyle\sum_{t\in\mathcal{I}}\left(u^{t,\star}-u^{t}(x^{t})\right)$
	$\displaystyle\leq 2V_{\mathcal{I}}\|\mathcal{I}\|+\sum_{t\in\mathcal{I}}\left(u^% {t}(a_{0})-U_{1}(a_{0},a^{t}_{-1})\right)+C(\sqrt{\log A}+\log T)\sqrt{\|% \mathcal{I}\|}+\sum_{t\in\mathcal{I}}\left(U_{1}(x^{t},a^{t}_{-1})-u^{t}(x^{t})% \right).$

We segment the time horizon $T$ into $T/|\mathcal{I}|$ batches $\{\mathcal{I}_{j}\}$ with each length $|\mathcal{I}|$ . It then holds for all $j$ that

	$\displaystyle\sum_{t\in\mathcal{I}_{j}}\left(u^{t,\star}-u^{t}(x^{t})\right)$
	$\displaystyle\leq 2V_{\mathcal{I}_{j}}\|\mathcal{I}\|+\sum_{t\in\mathcal{I}_{j}}% \left(u^{t}(a_{0})-U_{1}(a_{0},a^{t}_{-1})\right)+C(\sqrt{\log A}+\log T)\sqrt% {\|\mathcal{I}\|}+\sum_{t\in\mathcal{I}_{j}}\left(U_{1}(x^{t},a^{t}_{-1})-u^{t}(% x^{t})\right).$

Sum over $j$ gives

	$\displaystyle{\text{D-Reg}(T)}$
	$\displaystyle\leq 2V_{T}\|\mathcal{I}\|+\underbrace{\sum^{T}_{t=1}\left(u^{t}(a_% {0})-U_{1}(a_{0},a^{t}_{-1})\right)}_{\rm(iii)}+C(T/\sqrt{\|\mathcal{I}\|})\cdot% (\sqrt{\log A}+\log T)+\underbrace{\sum^{T}_{t=1}\left(U_{1}(x^{t},a^{t}_{-1})% -u^{t}(x^{t})\right)}_{\rm(iv)}.$

For (iii), note that $\{u^{t}(a)-U_{1}(a,a^{t}_{-1})\}^{T}_{t=1}$ is a martingale difference sequence, we have with probability at least $1-\delta$ that

\displaystyle{\rm(iii)}\leq\max_{a\in\mathcal{A}}\sum^{T}_{t=1}\left(u^{t}(a)-% U_{1}(a,a^{t}_{-1})\right)\leq O\left(\sqrt{T\log(A/\delta)}\right),

where the last inequality follows from Azuma–Hoeffding inequality and union bound.

For (iv), note that $\{U_{1}(x^{t},a^{t}_{-1})-u^{t}(x^{t})\}^{T}_{t=1}$ is a martingale difference sequence, thus by Azuma–Hoeffding inequality, we have with probability at least $1-\delta$ that

\displaystyle{\rm(iv)}\leq O\left(\sqrt{T\log(1/\delta)}\right).

Consequently we have with probability at least $1-\delta$ that

\displaystyle{\text{D-Reg}(T)}\leq 2V_{T}|\mathcal{I}|+C(T/\sqrt{|\mathcal{I}|% })\cdot(\sqrt{\log A}+\log T)+O\left(\sqrt{T\log(A/\delta)}\right).

Choosing $|\mathcal{I}|=(T/V_{T})^{2/3}$ , we have with probability at least $1-\delta$ that

\displaystyle{\text{D-Reg}(T)}\leq O\left(V^{1/3}_{T}T^{2/3}(\sqrt{\log(A/% \delta)}+\log T)\right).

∎

C.3 Proof of Theorem 5.4

Algorithm 1 Behavior Cloning

1: In the first round, play

a\sim\mathrm{Uniform}(\mathcal{A})

2: for

t=2,\ldots,T

3: Play

a^{t-1}_{2}

, i.e. the action played by Player 2 in the last round.

4: end for

Proof of Theorem 5.4.

Note that

$\displaystyle\mathbb{E}\left[\sum^{T}_{t=1}u^{t}(x^{t})\right]$	$\displaystyle\geq-1+\mathbb{E}\left[\sum^{T}_{t=2}u^{t}(x^{t})\right]$
	$\displaystyle=-1+\mathbb{E}\left[\sum^{T}_{t=2}U_{1}(a^{t-1}_{2},(y^{t})^{% \otimes n-1})\right]$
	$\displaystyle\geq-1-V_{T}-\mathbb{E}\left[\sum^{T}_{t=2}U_{1}(a^{t-1}_{2},(y^{% t-1})^{\otimes n-1})\right]$	(by the defition of $V_{T}$ )
	$\displaystyle=-1-V_{T}-\mathbb{E}\left[\sum^{T}_{t=2}U_{1}(y^{t-1},(y^{t-1})^{% \otimes n-1})\right]$	(since $a^{t-1}_{2}\sim y^{t-1}$ )
	$\displaystyle=-1-V_{T}$	(since the game is symmetric and zero-sum)

Thus we finish the proof. ∎

Appendix D Fundamental limits

Upon examining Theorem 5.3 alongside Theorem 5.4, it becomes apparent that Theorem 5.3 benchmarks against a more stringent standard (i.e., the dynamic oracle) and incurs a larger regret of $V^{1/3}_{T}T^{2/3}$ , while Theorem 5.4 sets its comparison against a baseline metric (i.e., the average payoff) and attains a smaller regret of $V_{T}$ . Regarding this observation, one might aspire to devise an algorithm whose payoff satisfies: $\sum^{T}_{t=1}u^{t}(x^{t})\geq\sum_{t=1}^{T}\max_{a\in\mathcal{A}}u^{t}(a)-% \tilde{O}(V_{T}).$ However, Theorem 5.5 and Theorem 5.6 demonstrate that such a goal is unattainable, by exploring the fundamental limits faced when competing against non-stationary opponents.

Theorem 5.5 shows, when contending with non-stationary opponent, the optimal algorithm must incur a dynamic regret at least order of $V^{1/3}_{T}T^{2/3}$ , closing off the possibility of attaining a better $V_{T}$ rate. It’s noteworthy that a similar lower bound for dynamic regret has already been established under broader conditions Besbes et al., (2014). The distinction of Theorem 5.5 lies in further restricting the hard problems to be symmetric games, implying that the structure of symmetric game does not offer an advantage in improving dynamic regret in the worst case. By comparing this lower bound with Theorem 5.3, it is evident that $\mathrm{SAOL}^{\mathcal{H}}$ is demonstrated to be minimax optimal, albeit with the inclusion of some logarithmic factors.

Theorem 5.6 establishes the fundamental limit when comparing to average payoff $0$ . The guarantees achieved by Theorem 5.4 can not be improved in the worst case, showing behavior cloning is demonstrated to be optimal upto some constant.

D.1 Proof of Theorem 5.5

Proof of Theorem 5.5.

We define

\displaystyle U^{(3)}_{1}(a,b,c):=\begin{cases}\text{payoff for $3$-player % majority game}&\text{if }a,b,c\in\{0,1\}\\ -2&\text{if }a\notin\{0,1\},b,c\in\{0,1\}\\ \text{defined by symmetric}&\text{o.w.}\end{cases}

which is basically the payoff function for $3$ -player majority game with extra dummy actions. We then define

\displaystyle U^{(n)}_{1}(a,a_{2},\ldots,a_{n}):=\frac{1}{(n-1)(n-2)}\sum_{2% \leq i\neq j\leq n}U_{1}^{(3)}(a,a_{i},a_{j}).

We consider a game that evolves stochastically, with $n$ players, action space $\mathcal{A}=\{0,1,\ldots,A-1\}$ , and the payoff function of the first player given by $U_{1}^{(n)}$ . We segment the decision horizon $T$ into $T/\Delta_{T}$ batches $\{\mathcal{T}_{j}\}$ , with each batch comprising $\Delta_{T}$ episodes. We consider two distinct scenarios:

•

Case1: All the other players employ a mixture strategy $(1/2-\epsilon,1/2+\epsilon)$ (i.e., playing 0 with probability $1/2-\epsilon$ , playing 1 with probability $1/2+\epsilon$ );
•

Case 2: All the other players employ a mixture strategy $(1/2+\epsilon,1/2-\epsilon)$ (i.e., playing 0 with probability $1/2+\epsilon$ , playing 1 with probability $1/2-\epsilon$ );

At the beginning of each batch, one of these scenarios is randomly selected (with equal probability) and remains constant throughout that batch.

Let $m=T/\Delta_{T}$ represent total number of batches. We fix some algorithm and a batch $j\in\{1,\ldots,m\}$ . Let $\delta_{j}\in\{1,2\}$ indicate batch $j$ belongs to Case1 or Case2. We denote by $\mathbb{P}^{j}_{\delta_{j}}$ the probability distribution conditioned on batch $j$ belongs to Case $\delta_{j}$ , and by $\mathbb{P}_{0}$ the probability distribution when all the other players employ a mixture strategy $(1/2,1/2)$ . We further denote by $\mathbb{E}^{j}_{\delta_{j}}[\cdot]$ and $\mathbb{E}_{0}[\cdot]$ the corresponding expectations. We denote by $N^{j}_{a}$ the number of times action $a$ was played in batch $j$ . If the batch $j$ belongs to Case $\delta_{j}$ , then the optimal action in the batch is $-\delta_{j}+2$ . We first present a useful lemma.

Lemma D.1.

Let $f:\{-2,0,1\}^{|\mathcal{T}_{j}|\times A}\rightarrow[0,M]$ be any bounded real function defined on the payoff matrices $R$ . Then, for any $\delta_{j}\in\{1,2\}$ , ${\epsilon}\leq 1/4$ :

\displaystyle\mathbb{E}^{j}_{\delta_{j}}[f(R)]-\mathbb{E}_{0}[f(R)]\leq\frac{M% }{2}\sqrt{-2|\mathcal{T}_{j}|\ln\left(1-4{\epsilon}^{2}\right)}\leq 2M{% \epsilon}\sqrt{\Delta_{T}}.

By Lemma D.1 with $f=N^{j}_{-\delta_{j}+2}$ , we have

\displaystyle\mathbb{E}^{j}_{\delta_{j}}[N^{j}_{-\delta_{j}+2}]-\mathbb{E}_{0}% [N^{j}_{-\delta_{j}+2}]\leq 2{\epsilon}|\mathcal{T}_{j}|\sqrt{\Delta_{T}}.

(3)

Note that

	$\displaystyle\mathbb{E}^{j}_{\delta_{j}}[u^{t}(x^{t})]$	$\displaystyle=-2\mathbb{P}^{j}_{\delta_{j}}(x^{t}\notin\{0,1\})+(-2{\epsilon}-% 4{\epsilon}^{2})\mathbb{P}^{j}_{\delta_{j}}(x^{t}=\delta_{j}-1)+(2{\epsilon}-4% {\epsilon}^{2})\mathbb{P}^{j}_{\delta_{j}}(x^{t}=-\delta_{j}+2)$
		$\displaystyle\leq(-2{\epsilon}-4{\epsilon}^{2})\mathbb{P}^{j}_{\delta_{j}}(x^{% t}\neq-\delta_{j}+2)+(2{\epsilon}-4{\epsilon}^{2})\mathbb{P}^{j}_{\delta_{j}}(% x^{t}=-\delta_{j}+2)$
		$\displaystyle=-2{\epsilon}-4{\epsilon}^{2}+4{\epsilon}\cdot\mathbb{P}^{j}_{% \delta_{j}}(x^{t}=-\delta_{j}+2),$

therefore,

	$\displaystyle\mathbb{E}^{j}_{\delta_{j}}\left[\sum_{t\in\mathcal{T}_{j}}u^{t}(% x^{t})\right]$	$\displaystyle\leq(-2{\epsilon}-4{\epsilon}^{2})\|\mathcal{T}_{j}\|+4{\epsilon}% \cdot\mathbb{E}^{j}_{\delta_{j}}[N^{j}_{-\delta_{j}+2}]$
		$\displaystyle\leq(-2{\epsilon}-4{\epsilon}^{2})\|\mathcal{T}_{j}\|+4{\epsilon}% \cdot\mathbb{E}^{j}_{0}[N^{j}_{-\delta_{j}+2}]+8{\epsilon}^{2}\|\mathcal{T}_{j}% \|\sqrt{\Delta_{T}}.$		(by (3))

Consequently, we have

\displaystyle\frac{1}{2}\mathbb{E}^{j}_{1}\left[\sum_{t\in\mathcal{T}_{j}}u^{t% }(x^{t})\right]+\frac{1}{2}\mathbb{E}^{j}_{2}\left[\sum_{t\in\mathcal{T}_{j}}u% ^{t}(x^{t})\right]\leq(-2{\epsilon}-4{\epsilon}^{2})|\mathcal{T}_{j}|+2{% \epsilon}|\mathcal{T}_{j}|+8{\epsilon}^{2}|\mathcal{T}_{j}|\sqrt{\Delta_{T}}

(4)

It then holds that

	$\displaystyle\mathbb{E}_{{\text{Alg}}}\left[\sum^{m}_{j=1}\sum_{t\in\mathcal{T% }_{j}}u^{t}(x^{t})\right]$	$\displaystyle=\sum^{m}_{j=1}\mathbb{E}_{{\text{Alg}}}\left[\sum_{t\in\mathcal{% T}_{j}}u^{t}(x^{t})\right]$
		$\displaystyle=\sum^{m}_{j=1}\mathbb{E}_{{\text{Alg}}}\left[\frac{1}{2}\mathbb{% E}^{j}_{1}\left[\sum_{t\in\mathcal{T}_{j}}u^{t}(x^{t})\right]+\frac{1}{2}% \mathbb{E}^{j}_{2}\left[\sum_{t\in\mathcal{T}_{j}}u^{t}(x^{t})\right]\right]$
		$\displaystyle\leq\sum^{m}_{j=1}((-2{\epsilon}-4{\epsilon}^{2})\|\mathcal{T}_{j}% \|+2{\epsilon}\|\mathcal{T}_{j}\|+8{\epsilon}^{2}\|\mathcal{T}_{j}\|\sqrt{\Delta_{T% }})$
		$\displaystyle=-4{\epsilon}^{2}T+8{\epsilon}^{2}T\sqrt{\Delta_{T}}.$

Set ${\epsilon}=\min\{1/(8\sqrt{\Delta_{T}}),V_{T}\Delta_{T}/T\}$ . We then have

	$\displaystyle\mathbb{E}_{{\text{Alg}}}[{\text{D-Reg}(T)}]$	$\displaystyle=(2{\epsilon}-4{\epsilon}^{2})T-\mathbb{E}_{{\text{Alg}}}\left[% \sum^{T}_{t=1}u^{t}(x^{t})\right]$
		$\displaystyle\geq(2{\epsilon}-4{\epsilon}^{2})T-(-4{\epsilon}^{2}T+8{\epsilon}% ^{2}T\sqrt{\Delta_{T}})$
		$\displaystyle=2{\epsilon}T-8{\epsilon}^{2}T\sqrt{\Delta_{T}}$
		$\displaystyle=2{\epsilon}T(1-4{\epsilon}\sqrt{\Delta_{T}})$
		$\displaystyle\geq{\epsilon}T$
		$\displaystyle=\min\left\{\frac{1}{8\sqrt{\Delta_{T}}},\frac{V_{T}\Delta_{T}}{T% }\right\}T.$

Choosing $\Delta_{T}=(T/V_{T})^{2/3}$ , we then have

\displaystyle\mathbb{E}_{{\text{Alg}}}[{\text{D-Reg}(T)}]\geq CV_{T}^{1/3}T^{2% /3},

which completes the proof. ∎

We prove Lemma D.1 in the following.

Proof of Lemma D.1.

We have that

$\displaystyle\mathbb{E}^{j}_{\delta_{j}}[f(R)]-\mathbb{E}_{0}[f(R)]$	$\displaystyle=\sum_{R}f(R)\left(\mathbb{P}^{j}_{\delta_{j}}(R)-\mathbb{P}_{0}(% R)\right)$
	$\displaystyle\leq\sum_{R:\mathbb{P}^{j}_{\delta_{j}}(R)\geq\mathbb{P}_{0}(R)}f% (R)\left(\mathbb{P}^{j}_{\delta_{j}}(R)-\mathbb{P}_{0}(R)\right)$
	$\displaystyle\leq M\sum_{R:\mathbb{P}^{j}_{\delta_{j}}(R)\geq\mathbb{P}_{0}(R)% }\left(\mathbb{P}^{j}_{\delta_{j}}(R)-\mathbb{P}_{0}(R)\right)$
	$\displaystyle=\frac{M}{2}\\|\mathbb{P}^{j}_{\delta_{j}}-\mathbb{P}_{0}\\|_{\rm TV}$
	$\displaystyle\leq\frac{M}{2}\sqrt{2{\sf KL}(\mathbb{P}_{0}\ \\|\ \mathbb{P}^{j}% _{\delta_{j}})},$	(5)

where the last ineqaulity follows from Pinsker’s inequality. Let $R_{t}\in\mathbb{R}^{A}$ be a random vector denoting the payoff for each action at time $t$ , and let $R^{t}\in\mathbb{R}^{t\times A}$ denote the payoff matrix received upon time $t$ : $R^{t}=[R_{1},\ldots,R_{t}]^{T}$ . By the chain rule for the relative entropy, we have

\displaystyle{\sf KL}(\mathbb{P}_{0}\ \|\ \mathbb{P}^{j}_{\delta_{j}})=\sum^{|% \mathcal{T}_{j}|}_{t=1}\mathbb{E}_{R^{t-1}}\left[{\sf KL}\left(\mathbb{P}_{0}(% R_{t}\mid R^{t-1})\ \|\ \mathbb{P}^{j}_{\delta_{j}}(R_{t}\mid R^{t-1})\right)% \right].

(6)

Note that

	$\displaystyle\mathbb{P}_{0}(R_{t}=[-2,0,-2,\ldots,-2]\mid R^{t-1})=\mathbb{P}_% {0}(R_{t}=[0,-2,-2,\ldots,-2]\mid R^{t-1})=1/4$
	$\displaystyle\mathbb{P}_{0}(R_{t}=[1,1,-2,\ldots,-2]\mid R^{t-1})=1/2.$

In the case $\delta_{j}=1$ , we have

	$\displaystyle\mathbb{P}^{j}_{\delta_{j}}(R_{t}=[-2,0,-2,\ldots,-2]\mid R^{t-1}% )=(1/2+{\epsilon})^{2}$
	$\displaystyle\mathbb{P}^{j}_{\delta_{j}}(R_{t}=[0,-2,-2,\ldots,-2]\mid R^{t-1}% )=(1/2-{\epsilon})^{2}$
	$\displaystyle\mathbb{P}^{j}_{\delta_{j}}(R_{t}=[1,1,-2,\ldots,-2]\mid R^{t-1})% =2(1/2+{\epsilon})(1/2-{\epsilon}).$

In the case $\delta_{j}=2$ , we have

	$\displaystyle\mathbb{P}^{j}_{\delta_{j}}(R_{t}=[-2,0,-2,\ldots,-2]\mid R^{t-1}% )=(1/2-{\epsilon})^{2}$
	$\displaystyle\mathbb{P}^{j}_{\delta_{j}}(R_{t}=[0,-2,-2,\ldots,-2]\mid R^{t-1}% )=(1/2+{\epsilon})^{2}$
	$\displaystyle\mathbb{P}^{j}_{\delta_{j}}(R_{t}=[1,1,-2,\ldots,-2]\mid R^{t-1})% =2(1/2+{\epsilon})(1/2-{\epsilon}).$

Thus, we have

	$\displaystyle{\sf KL}\left(\mathbb{P}_{0}(R_{t}\mid R^{t-1})\ \\|\ \mathbb{P}^{% j}_{\delta_{j}}(R_{t}\mid R^{t-1})\right)$	(7)
$\displaystyle=$	$\displaystyle\frac{1}{4}\ln\frac{1/4}{(1/2+{\epsilon})^{2}}+\frac{1}{4}\ln% \frac{1/4}{(1/2-{\epsilon})^{2}}+\frac{1}{2}\ln\frac{1/2}{2(1/2+{\epsilon})(1/% 2-{\epsilon})}$
$\displaystyle=$	$\displaystyle-\ln\left(1-4{\epsilon}^{2}\right).$	(8)

Combining (D.1), (6) and (7), we have

\displaystyle\mathbb{E}^{j}_{\delta_{j}}[f(R)]-\mathbb{E}_{0}[f(R)]\leq\frac{M% }{2}\sqrt{-2|\mathcal{T}_{j}|\ln\left(1-4{\epsilon}^{2}\right)}.

If we further have ${\epsilon}\leq 1/4$ , it then holds that $-\ln\left(1-4{\epsilon}^{2}\right)\leq 16\ln(4/3){\epsilon}^{2}$ and consequently

\displaystyle\mathbb{E}^{j}_{\delta_{j}}[f(R)]-\mathbb{E}_{0}[f(R)]\leq\frac{M% }{2}\sqrt{-2|\mathcal{T}_{j}|\ln\left(1-4{\epsilon}^{2}\right)}\leq 2M{% \epsilon}\sqrt{|\mathcal{T}_{j}|}\leq 2M{\epsilon}\sqrt{\Delta_{T}}.

∎

D.2 Proof of Theorem 5.6

Proof of Theorem 5.6.

We consider a game that evolves stochastically, with $n$ players, action space $\mathcal{A}=\{0,1,\ldots,A-1\}$ , and the same payoff function $U^{(n)}_{1}$ as outlined in Theorem 5.5. We segment the decision horizon $T$ into $T/\Delta_{T}$ batches $\{\mathcal{T}_{j}\}$ , with each batch comprising $\Delta_{T}$ episodes. We consider two distinct scenarios:

•

Case1: All the other players play 0;
•

Case 2: All the other players play 1.

In Case 1, we have $u^{t}(0)=0$ and $u^{t}(a)=-2$ for all $a\neq 0$ . In Case 2, we have $u^{t}(1)=0$ and $u^{t}(a)=-2$ for all $a\neq 1$ . At the beginning of each batch, one of these scenarios is randomly selected (with equal probability) and remains constant throughout that batch.

Let $m=T/\Delta_{T}$ represent total number of batches. We fix some algorithm. Let $\delta_{j}\in\{1,2\}$ indicate batch $j$ belongs to Case1 or Case2. We denote by $\mathbb{P}^{j}_{\delta_{j}}$ the probability distribution conditioned on batch $j$ belongs to Case $\delta_{j}$ , and by $\mathbb{E}^{j}_{\delta_{j}}[\cdot]$ the corresponding expectation. It then holds that

	$\displaystyle\mathbb{E}_{{\text{Alg}}}\left[\sum^{m}_{j=1}\sum_{t\in\mathcal{T% }_{j}}u^{t}(x^{t})\right]$	$\displaystyle=\sum^{m}_{j=1}\mathbb{E}_{{\text{Alg}}}\left[\sum_{t\in\mathcal{% T}_{j}}u^{t}(x^{t})\right]$
		$\displaystyle=\sum^{m}_{j=1}\mathbb{E}_{{\text{Alg}}}\left[\frac{1}{2}\mathbb{% E}^{j}_{1}\left[\sum_{t\in\mathcal{T}_{j}}u^{t}(x^{t})\right]+\frac{1}{2}% \mathbb{E}^{j}_{2}\left[\sum_{t\in\mathcal{T}_{j}}u^{t}(x^{t})\right]\right]$
		$\displaystyle\leq\sum^{m}_{j=1}\mathbb{E}_{{\text{Alg}}}\left[\frac{1}{2}% \mathbb{E}^{j}_{1}\left[u^{t^{j,1}}(x^{t^{j,1}})\right]+\frac{1}{2}\mathbb{E}^% {j}_{2}\left[u^{t^{j,1}}(x^{t^{j,1}})\right]\right],$

where $t^{j,1}$ represents the first episode of batch $j$ and the inequality follows from the fact that $u^{t}\leq 0$ . Note that

	$\displaystyle\frac{1}{2}\mathbb{E}^{j}_{1}\left[u^{t^{j,1}}(x^{t^{j,1}})\right% ]+\frac{1}{2}\mathbb{E}^{j}_{2}\left[u^{t^{j,1}}(x^{t^{j,1}})\right]$	$\displaystyle=-\left(\mathbb{P}^{j}_{1}(x^{t^{j,1}}\neq 0)+\mathbb{P}^{j}_{2}(% x^{t^{j,1}}\neq 1)\right)$
		$\displaystyle=-\left(\mathbb{P}(x^{t^{j,1}}\neq 0)+\mathbb{P}(x^{t^{j,1}}\neq 1% )\right)\leq-1,$

where the second equation follows from the fact that $x^{t^{j,1}}$ is independent of $\delta_{j}$ . Thus, we have

\displaystyle\mathbb{E}_{{\text{Alg}}}\left[\sum^{m}_{j=1}\sum_{t\in\mathcal{T% }_{j}}u^{t}(x^{t})\right]\leq-m=-T/\Delta_{T}.

Choosing $\Delta_{T}=2T/V_{T}$ , we have

\displaystyle\mathbb{E}_{{\text{Alg}}}\left[\sum^{m}_{j=1}\sum_{t\in\mathcal{T% }_{j}}u^{t}(x^{t})\right]\leq-V_{T}/2.

∎

Appendix E Experiments Details

E.1 Algorithms

We refer readers to Algorithm 2-5 for detailed implementation of algorithms in the experiment. For MV, we choose $\eta=1$ . For SDG, we choose $\eta=2$ .

Algorithm 2 Self-Play

0: Number of iterations

T

, action space

\mathcal{A}

, learning rate

\eta_{t}=\eta\sqrt{\frac{\log{|\mathcal{A}|}}{t}}

, number of players

n

, and initialize policy

\pi_{0}

1: for t = 1 to T do

2: Sample actions

a^{t-1}_{i}\sim\pi_{t-1}

for

i=2,\ldots,n

. Denote

a^{t-1}_{-1}:=(a^{t-1}_{2},\ldots,a^{t-1}_{n})

3: Update

\pi_{t}(a)\propto\pi_{t-1}(a){\sf exp}{\{\eta_{t}U_{1}(a,a^{t-1}_{-1})\}},% \forall a\in\mathcal{A}.

4: end for

Algorithm 3 Self-Play with Regularization

0: Number of iterations

T

, action space

\mathcal{A}

, learning rate

\eta_{t}=\eta\sqrt{\frac{\log{|\mathcal{A}|}}{t}}

, number of players

n

, initialize policy

\pi_{0}

, human policy

\pi_{{\text{human}}}

and regularization parameter

\lambda

1: for t = 1 to T do

2: Sample actions

a^{t-1}_{i}\sim\pi_{t-1}

for

i=2,\ldots,n

. Denote

a^{t-1}_{-1}:=(a^{t-1}_{2},\ldots,a^{t-1}_{n})

3: Update

\pi_{t}(a)\propto{\sf exp}{\left\{\frac{\log\pi_{0}(a)+\sum_{\tau<t}\eta_{\tau% }U_{1}(a,a^{\tau}_{-1})+\lambda\sum_{\tau<t}\eta_{\tau}\log\pi_{{\text{human}}% }(a)}{1+\lambda\sum_{\tau<t}\eta_{\tau}}\right\}},\forall a\in\mathcal{A}.

4: end for

Algorithm 4 Best Response to

\pi_{{\text{human}}}

0: Number of iterations

T

, action space

\mathcal{A}

, learning rate

\eta_{t}=\eta\sqrt{\frac{\log{|\mathcal{A}|}}{t}}

, number of players

n

, and initialize policy

\pi_{0}

1: for t = 1 to T do

2: Sample actions

a^{t-1}_{i}\sim\pi_{{\text{human}}}

for

i=2,\ldots,n

. Denote

a^{t-1}_{-1}:=(a^{t-1}_{2},\ldots,a^{t-1}_{n})

3: Update

\pi_{t}(a)\propto\pi_{t-1}(a){\sf exp}{\{\eta_{t}U_{1}(a,a^{t-1}_{-1})\}},% \forall a\in\mathcal{A}.

4: end for

Algorithm 5 Exploiter for

\pi

0: Number of iterations

T

, action space

\mathcal{A}

, learning rate

\eta_{t}=\eta\sqrt{\frac{\log{|\mathcal{A}|}}{t}}

, number of players

n

, and initialize policy

\pi_{0}

1: for t = 1 to T do

2: Sample actions

a^{t-1}_{1}\sim\pi

3: Sample actions

a^{t-1}_{i}\sim\pi_{t-1}

for

i=2,\ldots,n

. Denote

a^{t-1}_{-1}:=(a^{t-1}_{2},\ldots,a^{t-1}_{n})

4: Update

\pi_{t}(a)\propto\pi_{t-1}(a){\sf exp}{\left\{-\frac{\eta_{t}}{n-1}\sum^{n}_{i% =2}U_{1}\left(a^{t-1}_{1},a^{t-1}_{-1}[:i-1],a,a^{t-1}_{-1}[i+1:]\right)\right% \}},\forall a\in\mathcal{A}.

5: end for

E.2 Computation Resources

The experiments are conducted on a server with 256 CPUs. Each experiment can be completed in a few minutes.

Towards Principled Superhuman AI for Multiplayer Symmetric Games

Abstract

1 Introduction

1.1 Related work

Human-level or superhuman AI in practice

Existing results on symmetric games

No-regret learning in games

2 Preliminaries

Notation.

Normal-form game

Definition 2.1 (Symmetric zero-sum normal-form game).

Best response

Equilibrium

No-regret learning

Relation between no-regret learning and equilibria

3 Equilibria and Self-play are Insufficient for Multiplayer Games

Example 1 (Three-player majority vote game).

Limitation of equilibria

Limitation of self-play from scratch

4 New Solution Concepts

Diverse strategies by opponents

Claim 4.1.

Proof.

Identical strategy by opponents

Claim 4.2.

Proof.

Proposition 4.3.

4.1 Multiplayer games with a large player base

Proposition 4.4.

5 Efficient Algorithms

5.1 Fixed opponents

Theorem 5.1 (Stationary opponents).

5.2 Adaptive opponents

Challenges when facing fast adapting opponents

Fact 5.2.

Slowly adapting opponents

Theorem 5.3.

Middle regime

Theorem 5.4.

Fundamental limit

Theorem 5.5.

Theorem 5.6.

6 Experiments

Majority Vote (MV).

Switch Dominance Game (SDG).

6.1 Learning algorithms

Algorithm details.

6.2 Results

Convergence analysis.

Utility and Exploitability.

7 Conclusion

Acknowledgement

References

Appendix A Proofs for Section 4

A.1 3333-player majority and minority game

A.2 Proof of Proposition 4.3

Proof of Proposition 4.3.

A.3 Proof of Proposition 4.4

Proof of Proposition 4.4.

Appendix B Proofs for Section 5.1

B.1 Proof of Theorem 5.1

Proof of Theorem 5.1.

Appendix C Proofs for Section 5.2

C.1 Proof of Fact 5.2

Proof of Fact 5.2.

C.2 Proof of Theorem 5.3

Proof of Theorem 5.3.

C.3 Proof of Theorem 5.4

Proof of Theorem 5.4.

Appendix D Fundamental limits

D.1 Proof of Theorem 5.5

Proof of Theorem 5.5.

Lemma D.1.

Proof of Lemma D.1.

D.2 Proof of Theorem 5.6

Proof of Theorem 5.6.

Appendix E Experiments Details

E.1 Algorithms

E.2 Computation Resources

A.1 $3$ -player majority and minority game