On the Complexity of Learning to Cooperate with Populations of Socially Rational Agents

Robert Loftin
Department of Computer Science
University of Sheffield
Sheffield, S10 2TN, UK
[email protected]
&Saptarashmi Bandyopadhyay
Department of Computer Science
University of Maryland
College Park, MD 20742, USA
[email protected]
&Mustafa Mert Çelikok
Department of Intelligent Systems
Delft University of Technology
Delft, 2600 AA, The Netherlands
[email protected]

Abstract

Artificially intelligent agents deployed in the real-world will require the ability to reliably cooperate with humans (as well as other, heterogeneous AI agents). To provide formal guarantees of successful cooperation, we must make some assumptions about how partner agents could plausibly behave. Any realistic set of assumptions must account for the fact that other agents may be just as adaptable as our agent is. In this work, we consider the problem of cooperating with a population of agents in a finitely-repeated, two player general-sum matrix game with private utilities. Two natural assumptions in such settings are that: 1) all agents in the population are individually rational learners, and 2) when any two members of the population are paired together, with high-probability they will achieve at least the same utility as they would under some Pareto efficient equilibrium strategy. Our results first show that these assumptions alone are insufficient to ensure zero-shot cooperation with members of the target population. We therefore consider the problem of learning a strategy for cooperating with such a population using prior observations its members interacting with one another. We provide upper and lower bounds on the number of samples needed to learn an effective cooperation strategy. Most importantly, we show that these bounds can be much stronger than those arising from a "naive” reduction of the problem to one of imitation learning.

1 Introduction

In this work, we address the problem of learning to cooperate with a socially intelligent population of agents from observations interactions between members of this population. We study cooperation in finitely-repeated, two-player, general-sum matrix games with private payoffs. W say that a population of adaptive agents is socially intelligent if its members are (1) individually Hannan-consistent and (2) compatible in the sense that any pair of agents will perform nearly as well as some Pareto-optimal Nash equilibrium of the matrix game. We argue that this model of cooperation is more realistic than those that assume identical payoffs or public utilities. In real-world applications it is unlikely that independent agents will have identical utilities, or that they will provide complete information about their preferences or future behaviour to others. In the case of AI–AI cooperation, agents developed by different companies will not have access to each other’s source-code, while in the case of human–AI cooperation, having the human fully describe there preferences or behaviour in advance may be infeasible. Therefore, the question we address in this work is: Can we learn to cooperate with a socially intelligent population of agent by observing its members cooperate with each other? We answer this question by providing upper and lower bounds on the sample complexity of learning good cooperation strategies.

If we make no assumptions about the target population, we can do little more than attempt to mimic observed behavior as closely as possible, reducing the problem to one of imitation learning. Unfortunately, the strategies of adaptive agents may depend on the full history of interaction, and so the sample complexity of imitation learning will grow exponentially in the length of the repeated game. Our main contribution is an upper-bound showing that, for partners drawn from a socially intelligent (consistent and compatible) population, we can learn to cooperate with far fewer samples than would be required by a pure imitation learning approach.

This result utilizes a class of what we refer to as imitate-then-commit strategies, which leverage the fact that the population is socially intelligent to achieve cooperation without perfect imitation. The key idea is that our agent only needs to learn to imitate a member of the target population long enough to for the average strategy to approximate a Pareto-efficient solution. Once such a strategy is identified, our agent can switch to a coercive strategy such that any Hannan-consistent partner will either continue to adhere to the current joint strategy, or else switch to a superior strategy, with either case corresponding to “successful” cooperation.

In section 2 formalize our repeated game setting, and provide background on external regret and Hannan-consistency. We also propose a definition of cooperative compatibility (Definition 2.2) that is closely related to the notion of compatibility used in [1]. In Section 2.3, we provide our novel definition of social intelligence, and describe a realistic class of agents that satisfy it. In Section 3 we formalize our learning problem as that of trying to minimize altruistic regret, which we argue is the most natural measure of successful cooperation in this setting. We also give lower bounds on its sample complexity under different sets of assumptions. Finally, in Section 4 we present an upper-bound on the number of samples needed to learn strategies that achieve small altruistic regret.

2 Preliminaries

Repeated bi-matrix games with private types.

Let $i\in\{1,2\}$ denote the agent index. We assume both agents have $N$ pure strategies (henceforth "actions"). Let $\Theta$ denote the finite type space, where $\theta_{1},\theta_{2}\in\Theta$ denote the private types of the two agents, and $\boldsymbol{\theta}=(\theta_{1},\theta_{2})$ denotes the joint type. We denote agent $i$ ’s payoff matrix as $G(\theta_{i})\in\Re^{N\times N}$ , and let $G(\boldsymbol{\theta})=[G(\theta_{1}),G(\theta_{2})^{\top}]$ denote the bi-matrix game parameterized by $\boldsymbol{\theta}$ (with agent 1 as the row player). In a single episode, the agents play $G(\boldsymbol{\theta})$ for a fixed number of stages $0<T<\infty$ . We let $a^{1}_{t}$ and $a^{2}_{t}$ denote the actions chosen by agents 1 and 2 in stage $0<t\leq T$ . For mixed strategies $\sigma,\sigma^{\prime}\in\Delta(N)$ , we let $G(\sigma,\sigma^{\prime};\theta)=\sigma^{\top}G(\theta)\sigma^{\prime}$ . We overload $a^{1}_{t}$ and $a^{2}_{t}$ to also denote the mixed strategies that assign all probability mass to actions $a^{1}_{t}$ and $a^{2}_{t}$ , such that $G(a^{1}_{t},a^{2}_{t};\theta_{1})$ and $G(a^{1}_{t},a^{2}_{t};\theta_{2})$ are agent $1$ and $2$ ’s respective payoffs at stage $t$ . We also assume that for all $\theta\in\Theta$ , $G_{ij}(\theta)\in[0,1],\forall i,j\in[N]$ .

Let $\mathcal{H}_{t}=(N\times N)^{t}$ be the set of histories of length $t$ (with $\mathcal{H}_{0}=\{\emptyset\}$ ), and let $\mathcal{H}_{\leq t}=\bigcup^{t}_{s=0}\mathcal{H}_{s}$ be the set of all histories of length at most $t$ . The strategy space $\Pi$ for an agent is then the space of map**s $\pi:\Theta\times\mathcal{H}_{\leq T-1}\mapsto\Delta(N)$ , where $\Delta(N)$ is the set of probability distributions over the action set $[N]$ . As a functional, a strategy $\pi$ maps each type $\theta$ to a behavioral strategy [2, Chapter 5.2.2] that maps histories of play to action distributions, such that $a^{i}_{t}\sim\pi_{i}(\theta_{i},h_{t-1})$ . We denote agent $i$ ’s expected total payoff for following strategy $\pi$ against $\pi^{\prime}$ as

M_{i}(\pi,\pi^{\prime};\theta,\theta^{\prime})=\text{E}\left[\left.\sum^{T}_{t% =1}G(a^{i}_{t},a^{-i}_{t};\theta_{i})\right|\pi_{i}=\pi,\pi_{-i}=\pi^{\prime},% \theta_{i}=\theta,\theta_{-i}=\theta^{\prime}\right],

(1)

where the expectation is taken over the actions $a^{i}_{t}$ and $a^{-i}_{t}$ sampled from the agents’ strategies.

2.1 Consistency

A natural criterion for rationality is that an agent should attempt to to achieve a payoff nearly as large as the best response to its partner’s average strategy, which we refer to as consistency. To account for the non-stationary behavior of other agents’, we specifically consider Hannan consistency [3], which in our finite-time setting simply requires that an agent have bounded external regret over $T$ stages. The external regret for agent $i$ is defined as

R^{\text{ext}}_{i}(h;\theta)=\max_{a^{i}\in[N]}\sum^{|h|}_{t=1}\left\{G(a^{i},% a^{-i}_{t}(h);\theta_{i})-G(a^{i}_{t}(h),a^{-i}_{t}(h);\theta_{i})\right\}

(2)

where $a^{i}_{t}(h)$ denotes the action $i$ played at stage $t$ within the history $h\in\mathcal{H}_{\leq T}$ .

Definition 2.1 (Consistency).

For $\delta,\epsilon,T>0$ , an agent $i\in\{1,2\}$ is $(\delta,\epsilon,T)$ -consistent if, for all types $\theta\in\Theta$ , and any partner strategy, we have that $\frac{1}{T}R^{\text{ext}}_{i}(h_{T};\theta)\leq\epsilon$ with probability at least $1-\delta$ .

We also define the expected external regret $\bar{R}^{\text{ext}}_{i}(h;\theta)$ by replacing the $a^{i}_{t}(h)$ (the action $i$ played at stage $t$ ) with their full strategy $\pi^{i}(\theta,h_{t})$ . $R^{\text{ext}}_{i}(h;\theta)$ and $\bar{R}^{\text{ext}}_{i}(h;\theta)$ are related by the inequality

\displaystyle R^{\text{ext}}_{i}(h_{t};\theta)

\displaystyle\leq\bar{R}^{\text{ext}}_{i}(h_{t};\theta)+\sqrt{\frac{T}{2}\ln% \frac{1}{\delta}},

(3)

which holds w.p. at least $1-\delta$ for all $t\leq T$ simultaneously (this follows directly from [4, Lemma 4.1]). We therefore only need to bound $\bar{R}^{\text{ext}}_{i}(h_{t};\theta)$ to provide high-probability regret bounds.

2.2 Cooperative compatibility

	$A$	$B$
$A$	$2,2$	$0,0$
$B$	$0,0$	$1,1$

(a) A fully-cooperative 2x2 matrix game.

	$C$	$D$
$C$	$2,2$	$0,3$
$D$	$3,0$	$1,1$

(b) The prisoner’s dilemma game.

Table 1:

Even in a fully cooperative game, the fact that both agents are consistent does not guarantee that they will achieve an optimal outcome. In the $2\times 2$ game in Table 1a for example, both $(A,A)$ and $(B,B)$ are Nash equilibria to which consistent agents could converge, but only $(A,A)$ is optimal. In general-sum games, consistency may preclude Pareto-optimal outcomes, as in the classic prisoner’s dilemma game (Table 1b), where the only outcome in which neither player incurs positive regret is $(D,D)$ , which is Pareto-dominated by $(C,C)$ .Therefore, similar to [1], we define successful cooperation in terms of the Pareto-optimal Nash equilibria (PONE) [5] of a game $G$ .

Let $\mathcal{N}(G)\subseteq\Delta(N)\times\Delta(N)$ be the set of Nash equilibria (NE) of $G$ . For a fully-cooperative game, $\mathcal{N}(G)$ will contain all globally optimal strategy profiles for $G$ . It may, however, also contain joint strategies that are highly sub-optimal. Let $\mathcal{P}(G)\subseteq\mathcal{N}(G)$ denote the set of Pareto optimal Nash equilibria. In this work, we say that a strategy profile $\langle\sigma_{1},\sigma_{2}\rangle\in\mathcal{P}(G)$ if and only if $\langle\sigma_{1},\sigma_{2}\rangle\in\mathcal{N}(G)$ , and there does not exist $\langle\sigma^{\prime}_{1},\sigma^{\prime}_{2}\rangle\in\mathcal{N}(G)$ such that $G(\sigma^{\prime}_{1},\sigma^{\prime}_{2};\theta_{1})>G(\sigma_{1},\sigma_{2};% \theta_{1})$ and $G(\sigma^{\prime}_{2},\sigma^{\prime}_{1};\theta_{2})>G(\sigma_{2},\sigma_{1};% \theta_{2})$ . This means that $\langle\sigma_{1},\sigma_{2}\rangle$ is a PONE if it is a Nash equilibrium of $G$ , and it is not strongly Pareto-dominated by any other Nash equilibrium of $G$ . Intuitively, if two agents are individually consistent, and willing to cooperate with each other, their joint payoff profile should not be dominated by any PONE. We formalize this intuition as follows:

Definition 2.2 (Compatibility).

For $\delta,\epsilon,T>0$ , two agents $\pi^{1}$ and $\pi^{2}$ are $(\delta,\epsilon,T)$ -compatible if, when played together, for any joint type $\boldsymbol{\theta}\in\Theta\times\Theta$ , w.p. at least $1-\delta$ , $\exists\langle\sigma^{*}_{1},\sigma^{*}_{2}\rangle\in\mathcal{P}(G(\boldsymbol% {\theta}))$ s.t.

\frac{1}{T}\sum^{T}_{t=1}G(\sigma^{*}_{i},\sigma^{*}_{-i};\theta_{i})-G(a^{i}_% {t},a^{-i}_{t};\theta_{i})\leq\epsilon,

(4)

for both $i=1$ and $i=2$ .

A pair of agents is compatible if, when paired together, with high-probability over their path of play $h_{T}$ there will exist some PONE that does not $\epsilon$ -dominate their realized payoffs. Note that this definition is the approximate and finite-horizon version of the one provided in [1].

2.3 Socially intelligent agents

We argue that it is natural to model an existing population of cooperating agents as a set of approximately compatible, but otherwise heterogeneous agents. We therefore introduce the more general idea of a socially intelligent class of agents that are compatible with any other member of their class:

Definition 2.3 (Social Intelligence).

A set $C$ of agents forms a socially intelligent class w.r.t. $\Theta$ if, for some $\delta,\epsilon,T>0$ , each agent $\pi\in C$ is $(\delta,\epsilon,T)$ -consistent for all $\theta\in\Theta$ , and any two agents $\pi,\pi^{\prime}\in C$ are $(\delta,\epsilon,T)$ -compatible over all joint types $\Theta$ . An individual agent $\pi$ is called socially intelligent if it forms a socially intelligent class $\{\pi\}$ with itself.

The Hannan consistency requirement ensures that any agent in the population always has bounded average regret, whereas the approximate compatibility means if both agents are from $C$ , with high probability there will exist some PONE that does not $\epsilon$ -dominate their path of play. Below we describe a socially intelligent class based on a pre-agreed coordination protocol.

Coordination protocols

For a type space $\Theta$ , we first define a function $s(\boldsymbol{\theta})\in\mathcal{P}(G(\boldsymbol{\theta}))$ that maps from each joint type $\boldsymbol{\theta}$ to a strategy profile in $\mathcal{P}(G(\boldsymbol{\theta}))$ . We can think of $s(\boldsymbol{\theta})$ as a common “convention” the agents in $C$ have settled upon. Since we assume private types, members of $C$ do not know each other’s type at the beginning of their interaction. If any type $\theta\in\Theta$ can be communicated to others in a sequence of $k<T$ actions, then agents in $C$ can agree on a coordination protocol similar to a handshake. Let the protocol be a map $\kappa(\theta)$ from types to a history-dependent policy. Then, at the beginning of each interaction, both agents will play $\kappa$ for $k$ -steps in order to communicate their types. After coordinating with each other, the agents play $s((\theta_{i},\theta_{-i}))$ for the remaining $T-k$ steps. The agents must still ensure their partner does not deviate from $s((\theta_{i},\theta_{-i}))$ for safety against adversarial “imposters”. Since playing a PONE jointly will lead to low regret for both, if $i$ ’s regret exceeds a certain threshold, this would indicate $-i$ is deviating from $s$ significantly. The threshold can be chosen by the aid of the following lemma,

Lemma 2.4.

For any $\delta,T>0$ , if both players follow strategy $s(\boldsymbol{\theta})$ at each stage, then with probability at least $1-\delta$ we have

\bar{R}^{\text{ext}}_{i}(h_{t};\theta_{i})\leq\sqrt{2T\ln\frac{2}{\delta}}% \quad\text{and}\quad R^{\text{ext}}_{i}(h_{t};\theta_{i})\leq 2\sqrt{2T\ln% \frac{4}{\delta}},

(5)

which follows from an application of the Azuma-Hoeffding inequality (shown in Appendix A.1). Then the question is what safe strategy should the $i$ fall back into, if the rule is triggered. We base the fallback strategy on the multiplicative weights [6] update rule, defined as:

s^{i}_{\text{mw},k}(h_{t};\theta_{i})\propto s^{i}_{\text{mw},k}(h_{t-1};% \theta_{i})\exp\left(-\eta G(k,a^{-i}_{t-1}(h);\theta_{i})\right)

(6)

for $k\in N$ , where $s^{i}_{\text{mw}}(h_{0};\theta_{i})$ is the uniform strategy. Define $\pi^{\text{mw},T}$ as the agent that plays $s^{i}_{\text{mw}}(h_{t};\theta_{i})$ with learning rate $\eta=\sqrt{8\ln(N/T)}$ . The expected external regret of $\pi^{\text{mw},T}$ is bounded as

\bar{R}^{\text{ext}}_{i}(h_{T};\theta_{i})\leq\sqrt{\frac{T}{2}\ln N}

(7)

surely [4, Theorem 2.2]. We then define the agent’s overall strategy $\pi^{T,\epsilon}$ as follows:

1.

In first $k$ steps, play $\kappa(\theta_{i}).$
2.

If $-i$ ’s behaviour in $h_{k}$ not compatible with $\kappa(\theta)$ for any $\theta\in\Theta$ , switch to $\pi^{\text{mw},T}$ for all subsequent stages.
3.

While $\bar{R}^{\text{ext}}_{i}(h_{t};\theta_{i})\leq k+\epsilon(T-k)-\sqrt{\frac{T-k% }{2}\ln N}-1$ , play $s_{i}(\boldsymbol{\theta})$ .
4.

Otherwise, switch to $\pi^{\text{mw},T}$ for all subsequent stages.

The theorem below shows that agents that follow the social authentication strategy above form a socially intelligent class among themselves. All proofs have been deferred to appendix A.2.

Theorem 2.5.

For any $\delta,T>k$ , let $\epsilon_{0}\geq\sqrt{\frac{2}{(T-k)}\ln\frac{2}{\delta}}$ , and let $\epsilon_{1}=\epsilon_{0}+\sqrt{\frac{1}{2(T-k)}\ln N}+\frac{1}{(T-k)}$ . Then for $\epsilon=\epsilon_{1}+\sqrt{\frac{(T-k)}{2}\ln\frac{1}{\delta}}$ , the $\pi^{T,\epsilon_{1}}$ is $(\delta,\epsilon,T)$ -socially intelligent.

3 Learning to Cooperate

Going forward, we will assume that our agent (henceforth referred to as the “AI”) will take the role of agent $1$ , while the other agent (referred to as the “partner”) will be agent $2$ . Our goal is to choose a strategy for the AI that can cooperate with a partner drawn from some target population nearly as effectively as agents from this population cooperate with one another. For parametric game $G$ , with type space $\Theta$ , we will let the target population be a set $C$ of strategies forming an $(\delta,\epsilon,T)$ -SI class w.r.t. $\Theta$ . Ideally, we would hope to choose an AI strategy $\pi$ that can cooperate with $C$ without any additional information the strategies in $C$ . Looking at the coordination protocol example in Section 2.3, we can see that in many cases a population is likely to use arbitrary conventions to coordinate their behavior, and intuitively we would imagine cooperation to be impossible without prior knowledge of these conventions. (We make this intuition formal in Theorem 3.5).

We therefore consider the problem of learning an cooperative AI strategy using prior observations of members of the target population interacting with one another. We define a social learning problem by a tuple $\{G,\Theta,C,\rho,\mu\}$ , where $C$ is the target population (SI w.r.t. $\Theta$ ), $\rho$ is a distribution over $C$ , while $\mu$ is a distribution over the joint type space $\Theta\times\Theta$ . We can think of $C$ as the set of possible strategies that any member of the target population might follow, while $\rho$ is the frequency of those strategies within the population. To choose an AI strategy, we leverage a dataset $\mathcal{D}=\{(\theta^{j}_{1},\theta^{j}_{2},h^{j}_{T})|j\in[n]\}$ covering $n$ episodes of length $T$ . In each episode $j$ , two agents $\pi^{1}_{j}$ and $\pi^{2}_{j}$ are sampled independently from $\rho$ , and played together under the joint type $\boldsymbol{\theta}_{j}\sim\mu$ . The AI observes the full history $h^{j}_{T}$ , along with the agents’ types $\theta^{j}_{1}$ and $\theta^{j}_{2}$ . We denote a specific learning algorithm as a data conditioned strategy $\pi(\mathcal{D})$ .

3.1 Altruistic Regret

We seek an AI strategy that minimizes the regret relative to some Pareto optimal solution to $G(\boldsymbol{\theta})$ . Rather than minimizing regret in terms of the AI’s own payoffs, however, we seek to minimize partner’s relative to their (worst case) PONE in $G(\boldsymbol{\theta})$ . We formalize this regret with the following definition:

Definition 3.1 (Altruistic Regret).

Let the $(\sigma^{*}_{i},\sigma^{*}_{-i})\in\mathcal{P}(G_{-i}(\theta_{-i}))$ denote the PONE with the lowest payoff for the agent $-i$ where $i\in\{1,2\}$ . The altruistic regret of agent $i$ is defined as

R^{\text{alt}}_{i}(h_{T};\theta_{-i})=\sum^{T}_{t=1}G(\sigma^{*}_{i},\sigma^{*% }_{-i};\theta_{-i})-G(a^{i}(h_{t}),a^{-i}(h_{t});\theta_{-i}).

(8)

In practical cooperation tasks, we would expect outcomes that have low regret for the partner will have low regret for the AI as well.

The cooperation objective for the AI agent can then be formalized as minimising the altruistic regret. Unlike the definition suggests, the AI agent must know its own type as well. This is due to the fact that as seen in the coordination protocols example, if the AI fails to imitate a human of its type or fail to communicate its type correctly, the partner might switch to a safe strategy.

The goal for the AI is to minimize its expected altruistic regret over partners sampled from $\rho$ and types sampled from $\mu$ . The following lemma shows that we can treat the problem of minimizing regret with respect to a heterogeneous population $C$ as that of minimizing regret w.r.t. a single stochastic strategy.

Lemma 3.2.

Let $C$ be a finite set of agents that are $(\delta,\epsilon,T)$ -socially intelligent w.r.t. type space $\Theta$ , and let $\rho$ be a distribution over $C$ . There exists a mixed strategy $\bar{\rho}$ that forms an $(\delta,\epsilon,T)$ -socially intelligent class, and which is equivalent to playing against partners sampled from $\rho$ in expectation.

Proof. In a perfect recall game, every behavioural strategy has an equivalent mixed strategy and vice-versa [7]. Thus $\rho$ can equivalently be defined as a distribution over mixed strategies so that $\rho\in\Delta(\Delta(N))$ . Then defining $\bar{\rho}(a)=\int_{\Delta(N)}\sigma(a)\,d\rho(\sigma)$ where $a\in[N]$ denotes a pure strategy (i.e. action) completes the proof.

In order to show the joint impact of consistency and compatibility on the learning problem, we discuss the cases where the population is either consistent or compatible, but not both.

3.2 Consistency without Compatibility

Assume that $C$ consists of agents that are consistent but not necessarily compatible. The most general class in this case is the class of all no-external-regret learners (no-regret henceforth). It is a well-established result that the long-run average of no-regret learning converges to the set of coarse correlated equilibria. The question is whether the AI agent can learn to do better than a coarse correlated equilibrium when paired with a member of $C$ , using only a dataset $\mathcal{D}$ that consists of histories of play for different CCEs.

Theorem 3.3.

There exists a consistent yet incompatible class of agents $C$ such that even with an infinite amount of data, the AI cannot learn strategies that minimise altruistic regret.

Proof. The proof follows from the theorem 3 of Monnot and Piliouras [8] which shows that given any coarse correlated equilibrium of a two-player normal-form game, there exists a pair of no-regret learners that would converge to it. Since $C$ can be any subset of no-regret learners, we cannot exclude those who converge to inefficient CCE. If the class $C$ contains only the agents that converge to Pareto-inefficient CCE, we cannot hope to learn optimal strategies from any dataset. Given an observed CCE $z$ in the dataset, assume that the AI knows it is facing one of the two agents that generated $z$ , but does not know their type explicitly. Using a Stackelberg argument similar to Brown et al. [9], we prove in appendix B.2 that the AI can compute and commit to a leader strategy such that the payoffs are never strongly Pareto-dominated by $z.$ However even in this case, we cannot eliminate the possibility of it being weakly dominated.

Regardless of the dataset, in the online phase, the AI faces a new agent from $C$ each time and does not know their type. We may hope to learn a classifier to quickly infer our partner’s type online from their behaviour, assuming there exists a map** from initial behaviour to types. However, since $C$ consists only of no-regret learners guaranteed to converge to a CCE in self-play, they have no reason to initially communicate their types to each other.

3.3 Compatibility without consistency

Assume that the members of $C$ are compatible but not consistent. We can construct such a class by using the coordination protocols example from section 2.3. Now, when agents from $C$ successfully identify each other after the authentication phase, they proceed with playing the agreed-upon PONE. However, if at any moment they play the wrong action, there is no constraint on what strategy they will switch to. This setting is equivalent to the case considered by Loftin and Oliehoek [10] in their impossibility result. The members of $C$ can employ grim-trigger strategies that forever punish the other agent, triggered by a mistake at any point. Even if we eliminate grim-trigger strategies, the impossibility result has proven that there still exists strategies the members of $C$ can play once triggered, and make the other agent suffer regret arbitrarily close to $\frac{1}{2}$ with payoffs in $[0,1].$ Since a single mistake during the online interaction can lead to partner playing strategies that yield linear regret, the outsider must learn to imitate at least one member of $C$ perfectly from the dataset. Therefore the offline problem in this setting reduces to imitation learning, in particular the no-interaction case from Rajaraman et al. [11].

For each agent, the authentication protocol $\kappa$ is equivalent to a history-dependent policy that they commit to playing in the first $k$ time-steps. The lower-bound on the expected sub-optimality of the imitation learning from Rajaraman et al. [11] is based on the fact that the imitator cannot do better than uniformly random in unseen states. In the case of $\kappa$ , states correspond to histories up to length $k.$ Since every $k$ -step history can be uniquely embedding a type, an unseen history means a high probability of making a mistake if paired with the corresponding type. Therefore, to avoid linear altruistic regret, the AI must observe at least $|\mathcal{H}_{k}|$ samples, where $\mathcal{H}_{k}$ is the set of all possible $k$ -step histories.

Theorem 3.4.

Let $M$ be the number of unique samples of $k$ -step histories in the dataset. There exists a class of agents $C$ with a $k$ -step social authentication protocol such that to bound the probability of failing to authenticate, we need $M\geq\frac{N^{3k}-\delta N^{3k}-N^{2k}}{N-1}$ samples. Then for growing $k$ , the sample complexity lower bound is $M=\Omega(N^{2k}).$

Proof. Consider the coordination protocol example mentioned above. Let $h_{k}\in\mathcal{H}_{k}$ be missing from the dataset. When the AI is paired with the corresponding partner type, the probability of correctly authenticating is $\frac{1}{N^{k}},$ and thus authentication fails with probability $\frac{N^{k}-1}{N^{k}}.$ Assuming we face each type uniformly randomly, if we have $M$ unique samples, the probability of facing an unobserved history is $\frac{N^{2k}-M}{N^{2k}}$ since $|\mathcal{H}_{k}|=N^{2k}$ . Then the probability of failing is $\frac{N^{k}-1}{N^{k}}\times\frac{N^{2k}-M}{N^{2k}}=1-\frac{M}{N^{2k}}-\frac{1}% {N}+\frac{M}{N^{3k}}.$ In order to bound this by $\delta,$ we need $M\geq\frac{N^{3k}-\delta N^{3k}-N^{2k}}{N-1}$ samples. Since $k$ -steps need to embed each type uniquely, $k$ grows with the size of the type space. For large $k$ , the bound is dominated by $N^{3k},$ thus we have $M=\Omega(N^{3k})$ as $k$ grows.

An immediate conclusion that follows from theorem 3.4 is that for the case of compatibility without consistency, this sample complexity is for bounding the probability of suffering linear regret. This is due to the fact that failing to authenticate can now lead to linear regret, since the partner can switch to arbitrary strategies.

3.4 Lower bound for socially intelligent populations

Theorem 3.5.

Let $M$ denote the number of histories with unique first $k$ -steps in dataset $\mathcal{D}$ generated by the members of a socially intelligent class $C$ . There exists a $C$ where $R^{\text{alt}}_{i}(h_{T};\theta_{-i})=T$ with probability $\frac{N^{k}-1}{N^{k}}\times\frac{N^{2k}-M}{N^{2k}}=1-\frac{M}{N^{2k}}-\frac{1}% {N}+\frac{M}{N^{3k}}.$

Proof: Let $C$ be a socially intelligent class of agents following a coordination protocol akin to the one described in section 2.3. The probability follows from the proof of theorem 3.4 as the probability of failing to authenticate. If the authentication fails, the partner switches to an arbitrary Hannan-consistent strategy. As stated in section 3.2, a consistent partner strategy may never communicate the partner’s type. Without knowing the partner’s type, the agent’s worst-case average altruistic regret can be $1$ , since it cannot compute its true regret without the partner’s type (see definition 3.1). Let there be two partner types $\theta_{-i}=\theta_{2}$ or $\theta_{3}.$ If the agent $i$ mistakenly assumes $\theta_{-i}=\theta_{2}$ , its behaviour attempts to minimize $R^{\text{alt}}_{i}(h_{T};\theta_{2})=\sum^{T}_{t=1}G(\sigma^{*}_{i},\sigma^{*}% _{-i};\theta_{2})-G(a^{i}(h_{t}),a^{-i}(h_{t});\theta_{2})$ . Meanwhile, the play of the partner will be a no-regret algorithm with respect to the external regret $R^{ext}_{-i}(h;\theta_{3}).$ Having no other constraints in the type space, there is nothing stop** us from constructing a $\Theta$ such that a strategy minimizing $R^{ext}_{-i}(h_{T};\theta_{3})$ ends up maximizing $R^{\text{alt}}_{i}(h_{T};\theta_{2}).$ Imagine the ideal case of $R^{ext}_{-i}(h_{T};\theta_{3})=0$ where $-i$ plays the fixed best action in hindsight $a^{*}$ throughout $h_{T}.$ Then the altruistic regret observed by $i$ is $R^{\text{alt}}_{i}(h_{T};\theta_{2})=\sum^{T}_{t=1}G(\sigma^{*}_{i},\sigma^{*}% _{-i};\theta_{2})-G(a^{i}(h_{t}),a^{-i}=a^{*};\theta_{2}).$ Let $G(a^{i},a^{*};\theta_{2})=0$ for all $a^{i}.$ Then the altruistic regret is $\sum^{T}_{t=1}G(\sigma^{*}_{i},\sigma^{*}_{-i};\theta_{2})$ which is $T$ in the worst-case.

4 Upper bound for socially intelligent populations

A key idea behind this work is that against a socially intelligent target population, rather than trying to imitate a member of the population perfectly throughout the entire episode, the AI only needs to imitate them long enough to learn about its partner’s private type. Once it has this information, the AI can leverage the fact that the partner’s strategy is consistent against any strategy, and try to “coerce” the human partner into playing a strategy that minimizes the altruistic regret. We will refer to such strategies as imitate-then-commit (IC) strategies, which use the previous observations $\mathcal{D}$ to learn an imitation strategy to follow over the first $\tilde{T}<T$ steps of the interaction. In this section we provide an upper bound on the altruistic regret of a specific (IC) strategy, as a function of the number of episodes in $\mathcal{D}$ , subject to the following assumptions:

Assumption 4.1.

For $\delta,\epsilon>0$ , and some $\tilde{T}<T$ , we have that

1.

$\rho$ is $(\delta,\epsilon,T)$ -consistent.
2.

$\rho$ is $(\delta,\epsilon,\tilde{T})$ -compatible.

Imitation learning.

Under an imitate-then-commit strategy, the sample complexity is defined entirely by the number of episodes the AI needs to observe to learn a good $\tilde{T}$ -step imitation policy. Fortunately, imitation learning is a well-studied problem, and we can largely leverage existing complexity bounds. The one caveat is that in this setting we need bounds on the total variation distance between the distribution over the partial history $h_{\tilde{T}}$ under the population strategy $\rho$ , and that under the learned strategy. Given the dataset $\mathcal{D}$ , we define the imitation strategy $\hat{\pi}^{1}_{\tilde{T}}(\mathcal{D})$ such that $\hat{\pi}^{1}_{\tilde{T}}(h;\theta,\mathcal{D})$ is the empirical distribution over agent $1$ ’s actions for each history-type pair $(h,\theta)$ occurring in $\mathcal{D}$ , while $\hat{\pi}^{1}_{\tilde{T}}(h;\theta,\mathcal{D})$ is the uniform distribution over $N$ for $(h,\theta)\notin\mathcal{D}$ . We then define the marginal strategy $\hat{\pi}^{1}_{\tilde{T}}$ , which can be implemented by sampling a dataset $\mathcal{D}$ , and then following the imitation strategy defined by $\mathcal{D}$ for the next $\tilde{T}$ steps. We then have the following bound on the distribution of $h_{\tilde{T}}$ under the imitation strategy:

Lemma 4.2.

Let $p_{\tilde{T}}$ be the distribution over partial histories $h_{\tilde{T}}$ under the population strategy $\rho$ , and let $\hat{p}_{\tilde{T}}$ be their distribution under $\hat{\pi}^{1}_{\tilde{T}}$ . We have that

\|p_{\tilde{T}}-\hat{p}_{\tilde{T}}\|_{\text{TV}}\leq\min\left\{\tilde{T},% \frac{N^{2(\tilde{T}+1)}|\Theta|\tilde{T}^{2}\log(K)}{K}\right\},

(9)

where $K=|\mathcal{D}|$

This bound follows directly from that of [11] via Lemma 1 of [12] (see Appendix B.1 for full proof).

Imitate-then-commit strategy.

For history $h_{\tilde{T}}\in\mathcal{H}_{\tilde{T}}$ , we let $\hat{z}(h_{\tilde{T}})\in\Delta(N\times N)$ denote the empirical joint strategy played up to and including step $\tilde{T}$ . We show that, using $\hat{z}(h_{\tilde{T}})$ , it is possible to construct a mixture $\nu$ over mixed strategies $x\in\Delta(N)$ that, in expectation over $\nu$ , the partner’s payoff under their best response to $x\sim\nu$ will be at least as large as their payoff under $\hat{z}(h_{\tilde{T}})$ . The corresponding IC strategy will operate as follows:

1.

Sample $\mathcal{D}$ and compute the imitation strategy $\hat{\pi}^{1}_{\tilde{T}}(\mathcal{D})$ .
2.

Play $\hat{\pi}^{1}_{\tilde{T}}(\mathcal{D})$ for the first $\tilde{T}$ steps, and observe $h_{\tilde{T}}$ .
3.

Compute a suitable mixture $\nu$ from $\hat{z}(h_{\tilde{T}})$ , and sample $x\sim\nu$
4.

Sample actions from $x$ for the remaining $T-\tilde{T}$ steps

We then have the following upper bound on the altruistic regret achievable with an imitate-then-commit strategy:

Theorem 4.3.

Given that Assumption 4.1 holds for $\rho$ , there exists a data-dependent strategy $\pi^{\text{IC}}(\mathcal{D})$ such that when played by the AI as agent $2$ , the altruistic regret satisfies

\text{E}\left[R^{\text{alt}}_{1}(h_{T},\theta_{2})\right]\leq 2\delta+\delta(K% )+\left(2\frac{T-\tilde{T}}{T}+1\right)\epsilon,

(10)

where $K=|\mathcal{D}|$ and $\delta(K)$ is defined as

\delta(K)=\min\left\{\tilde{T},\frac{N^{2(\tilde{T}+1)}|\Theta|\tilde{T}^{2}% \log(K)}{K}\right\}

(11)

and where the expectation is taken over $h_{T}$ , $\boldsymbol{\theta}$ , and $\mathcal{D}$ .

Proof sketch:

By Lemma 4.2, we can learn an imitation strategy such that the corresponding distribution over $h_{\tilde{T}}$ and $\hat{z}(h_{\tilde{T}})$ is close to that under $\rho$ in self-play. As $\rho$ is compatible, both agents’ payoffs under $\hat{z}(h_{\tilde{T}})$ must be close to those under some PONE. Finally, we can construct a mixture $\nu$ for agent $1$ such that agent $2$ ’s payoffs under its (approximate) best-response are almost as large as those under $\hat{z}(h_{\tilde{T}})$ (see Appendix B.2).

5 Related Work

Our work is closely related to the previous targeted learning model [1, 13, 14],which defines similar compatibility and consistency criteria. The notion of targeted optimality [15] include convergence to learning an approximately best response in a multi-agent model with high probability in a tractable number of steps against a population of memory-bounded adaptive agents. The main difference with our work is that targeted learning only requires consistency against a specific target class of partners, which generally would not include the agent itself, or other adaptive agents. We require socially intelligent agents to be consistent against all possible partner strategies. We also require that cooperation and consistent learning occur over a fixed time horizon $T$ , rather than asymptotically. These differences mean that a hypothetical “universally cooperative” agent might be able to leverage the consistency of its partner to achieve cooperation without a prearranged convention. Socially intelligent agents can modeled as individually rational learners [16] to achieve Pareto-efficient joint behavior. Our research builds on this work by considering a learning setting where the agent when paired with any member of the population will achieve at least the same utility with high probability as the Pareto-efficient approach.

The problem of training agents to be able to cooperate with previously unseen partners is sometimes referred to as ad hoc teamwork [17, 18] or zero-shot coordination [19], especially in the context of multiagent reinforcement learning. Many approaches in reinforcement learning train cooperative policies that are robust to possible strategies that a human or an AI agent can follow [20]. A lot of these methods build a “population” of partner strategies and maximizes the diversity of this population in order to train the AI’s policy against it [21, 22]. Other approaches assume that there is no prior coordination between the agents [19] to learn rational joint strategies while estimating the agents’ mutual uncertainty about one-another’s strategies [23]. Ad-hoc multiagent coordination can be helpful to learn cooperation among AI agents with the “other-play” algorithm [19] that finds such a strategy as a solution to the corresponding label free coordination problem [23]. A possible approach to solve these problems can be self-play [24] where the agent can optimize themselves by playing with past iterations of themselves in order to estimate the strategies of unseen partners. However, the "self-play" approach can learn cooperative strategies which can "over-fit" [25] to one another in the population of agents. A key goal of Ad hoc coordination (teamwork) and aligned research in zero-shot coordination work has been to avoid this type of overfitting [26]. Our problem domain is closely related to both ad hoc teamwork or zero-shot coordination, since we consider training an agent to cooperate with previously unseen partners, and assume no control over the partner. Even though population-based training approaches to ad hoc teamwork are common, they focus on fully cooperative environments such as Dec-POMDPs, where the main issue is creating a diverse enough population to train with [27]. We consider partners that are self-interested, and do not assume identical payoffs.

Finally, in the case of Hannan-consistent partners, our problem setting is closely related to strategizing against and learning to manipulate no-regret learners [28, 9]. This line of work studies whether an optimizer agent can achieve better payoff than CCE against no-regret learners by learning to enforce a Stackelberg equilibria on them. Their emphasis is on online learning and the optimizer’s payoff, while we focus on the offline setting and cooperation.

6 Conclusion

We provide formal guarantees for successful and reliable cooperation of AI agents with populations of socially intelligent rational agents. This is based on the assumptions that 1) agents in the population are individually rational, and 2) agents in the population when cooperating with another agent in the same group can achieve, at least the same utility that they would with respect to some Pareto efficient equilibrium strategy. We formalize the notion of consistency and cooperative compatibility of agents in two-player general-sum finitely-repeated bi-matrix games between the agents and the population with private type. Our theoretical guarantees are in the offline cooperation setting where the agent has to cooperate with unseen partners in the population to strategize against and manipulate no-regret policies for which we formalize the idea of altruistic regret. We prove that the assumptions on its own are insufficient to learn zero-shot cooperation with partners of the socially intelligent target population. We provide upper bounds on the sample complexity needed to learn a successful cooperation strategy along with lower bounds on when the multi-agent cooperation setting is needed with respect to the populations’ trajectories, the state space and the length of the learning episodes. The bounds in these settings of the agent actively querying the MDP without knowing the transition dynamics of the population or the agent observing the populations’ transition dynamics are much stronger than the bounds that can be derived by naively reducing the cooperation problem to one of reinforcement learning. These complexity analysis and formally proven bounds can be helpful to sustainably model the alignment problem of AI agents.

References

Powers and Shoham [2004] Rob Powers and Yoav Shoham. New criteria and a new algorithm for learning in multi-agent systems. Advances in Neural Information Processing Systems, 17, 2004.
Shoham and Leyton-Brown [2008] Yoav Shoham and Kevin Leyton-Brown. Multiagent systems: Algorithmic, game-theoretic, and logical foundations. Cambridge University Press, 2008.
Hannan [1957] James Hannan. Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, 3(2):97–140, 1957.
Cesa-Bianchi and Lugosi [2006] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
Mas-Colell et al. [1995] Andreu Mas-Colell, Michael Dennis Whinston, Jerry R Green, et al. Microeconomic theory, volume 1. Oxford university press New York, 1995.
Freund and Schapire [1999] Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79–103, 1999.
Aumann [1964] Robert J . Aumann. 28. Mixed and Behavior Strategies in Infinite Extensive Games, pages 627–650. Princeton University Press, Princeton, 1964. ISBN 9781400882014. doi: doi:10.1515/9781400882014-029. URL https://doi.org/10.1515/9781400882014-029.
Monnot and Piliouras [2017] Barnabé Monnot and Georgios Piliouras. Limits and limitations of no-regret learning in games. The Knowledge Engineering Review, 32:e21, 2017.
Brown et al. [2024] William Brown, Jon Schneider, and Kiran Vodrahalli. Is learning in games good for the learners? Advances in Neural Information Processing Systems, 36, 2024.
Loftin and Oliehoek [2022] Robert Loftin and Frans A Oliehoek. On the impossibility of learning to cooperate with adaptive partner strategies in repeated games. In International Conference on Machine Learning, pages 14197–14209. PMLR, 2022.
Rajaraman et al. [2020] Nived Rajaraman, Lin Yang, Jiantao Jiao, and Kannan Ramchandran. Toward the fundamental limits of imitation learning. Advances in Neural Information Processing Systems, 33:2914–2924, 2020.
Ciosek [2022] Kamil Ciosek. Imitation learning by reinforcement learning. In International Conference on Learning Representations, 2022.
Powers and Shoham [2005] Rob Powers and Yoav Shoham. Learning against opponents with bounded memory. In The Nineteenth International Joint Conference on Artificial Intelligence, pages 817–822, 2005.
Chakraborty and Stone [2010a] Doran Chakraborty and Peter Stone. Convergence, targeted optimality and safety in multiagent learning. In Proceedings of the Twenty-seventh International Conference on Machine Learning (ICML 2010), June 2010a. URL http://www.cs.utexas.edu/users/ai-lab?chakraborty:icml10.
Chakraborty and Stone [2010b] Doran Chakraborty and Peter Stone. Convergence, targeted optimality, and safety in multiagent learning. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, page 191–198, Madison, WI, USA, 2010b. Omnipress. ISBN 9781605589077.
Loftin et al. [2023] Robert Loftin, Mustafa Mert Çelikok, and Frans A. Oliehoek. Towards a unifying model of rationality in multiagent systems, 2023.
Stone et al. [2010] Peter Stone, Gal Kaminka, Sarit Kraus, and Jeffrey Rosenschein. Ad hoc autonomous agent teams: Collaboration without pre-coordination. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 24, pages 1504–1509, 2010.
Mirsky et al. [2022] Reuth Mirsky, Ignacio Carlucho, Arrasy Rahman, Elliot Fosong, William Macke, Mohan Sridharan, Peter Stone, and Stefano V Albrecht. A survey of ad hoc teamwork research. In European conference on multi-agent systems, pages 275–293. Springer, 2022.
Hu et al. [2020] Hengyuan Hu, Adam Lerer, Alex Peysakhovich, and Jakob Foerster. “other-play” for zero-shot coordination. In International Conference on Machine Learning, pages 4399–4410. PMLR, 2020.
Carroll et al. [2019] Micah Carroll, Rohin Shah, Mark K Ho, Tom Griffiths, Sanjit Seshia, Pieter Abbeel, and Anca Dragan. On the utility of learning about humans for human-ai coordination. Advances in Neural Information Processing Systems, 32:5174–5185, 2019.
Strouse et al. [2021a] DJ Strouse, Kevin McKee, Matt Botvinick, Edward Hughes, and Richard Everett. Collaborating with humans without human data. Advances in Neural Information Processing Systems, 34:14502–14515, 2021a.
Cui et al. [2023] Brandon Cui, Andrei Lupu, Samuel Sokota, Hengyuan Hu, David J Wu, and Jakob Nicolaus Foerster. Adversarial diversity in hanabi. In The Eleventh International Conference on Learning Representations, 2023.
Treutlein et al. [2021] Johannes Treutlein, Michael Dennis, Caspar Oesterheld, and Jakob Foerster. A new formalism, method and open issues for zero-shot coordination. In International Conference on Machine Learning, pages 10413–10423. PMLR, 2021.
Zand et al. [2022] Jaleh Zand, Jack Parker-Holder, and Stephen J. Roberts. On-the-fly strategy adaptation for ad-hoc agent coordination. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’22, page 1771–1773, Richland, SC, 2022. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9781450392136.
Strouse et al. [2021b] DJ Strouse, Kevin McKee, Matt Botvinick, Edward Hughes, and Richard Everett. Collaborating with humans without human data. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 14502–14515. Curran Associates, Inc., 2021b. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/797134c3e42371bb4979a462eb2f042a-Paper.pdf.
Cui et al. [2021] Brandon Cui, Hengyuan Hu, Luis Pineda, and Jakob Foerster. K-level reasoning for zero-shot coordination in hanabi. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 8215–8228. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/4547dff5fd7604f18c8ee32cf3da41d7-Paper.pdf.
Rahman et al. [2024] Muhammad Rahman, Jiaxun Cui, and Peter Stone. Minimum coverage sets for training robust ad hoc teamwork agents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17523–17530, 2024.
Deng et al. [2019] Yuan Deng, Jon Schneider, and Balasubramanian Sivan. Strategizing against no-regret learners. Advances in neural information processing systems, 32, 2019.
Hoeffding [1963] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
von Stengel and Zamir [2004] Bernhard von Stengel and Shmuel Zamir. Leadership with commitment to mixed strategies. 2004.

Acknowledgments and Disclosure of Funding

This work has been supported by the Hybrid Intelligence Center, https://hybrid-intelligence-centre.nl, grant number 024.004.022.

Appendix A Proofs for Section 2

A.1 Proof of Lemma 2.4

Here the joint type $\boldsymbol{\theta}$ will be implicit. For $i\in\{1,2\}$ , we define $V^{i}_{t}$ as

V^{i}_{t}=G_{i}(s^{i}_{t},s^{-i}_{t})-G_{i}(s^{i}_{t},a^{-i}_{t})

(12)

We can see that $\text{E}[V^{i}_{t}|h_{t-1}]=0$ . We can then have that

$\displaystyle\bar{R}^{\text{ext}}_{t}$	$\displaystyle=\max_{a\in N}\sum^{t}_{r=1}\left\{G_{i}(a,s^{-i}_{r})-G_{i}(s^{i% }_{r},a^{-i}_{r})\right\}$	(13)
	$\displaystyle=\max_{a\in N}\sum^{t}_{r=1}\left\{G_{i}(a,s^{-i}_{r})-G_{i}(s^{i% }_{r},s^{-i}_{r})+G_{i}(s^{i}_{r},s^{-i}_{r})-G_{i}(s^{i}_{r},a^{-i}_{r})\right\}$	(14)
	$\displaystyle=\sum^{t}_{r=1}\left\{G_{i}(s^{i}_{r},s^{-i}_{r})-G_{i}(s^{i}_{r}% ,a^{-i}_{r})\right\}=\sum^{t}_{r=1}V^{i}_{r}$	(15)
	$\displaystyle\leq\sqrt{\frac{2}{T}\ln\frac{1}{\delta}}$	(16)

with probability $1-\delta$ for all $t\leq T$ simultaneously.

This follows from the fact that $|V^{i}_{t}|\in[0,1]$ and the “maximal” Azuma-Hoeffding inequality [29]. The second equality follows from the fact that $\langle s^{i}_{t},s^{-i}_{t}\rangle=s(\theta)$ is a Nash equilibrium. The first bound of Lemma 2.4 follows from a union bound over the probability for both players, while the second bound combines this with Equation 3. $\square$

A.2 Proof of Theorem 2.5

Theorem A.1 (2.6).

Proof. By the definition of $\epsilon_{1}$ , $\pi^{T,\epsilon_{1}}$ will only deviate when playing with itself if at some point $k<t\leq T$ one player incurs an expected external regret of at least $\epsilon_{0}$ , and by Lemma 2.4 that will occur with probability at most $\delta$ . Therefore, $\pi^{T,\epsilon_{1}}$ is $(\delta,\epsilon_{0},T)$ -compatible. We also have that the total expected external regret of the MW agent $\pi^{\text{mw},T}$ is at most $\sqrt{(T/2)\ln N}$ . This means that if $\pi^{T,\epsilon_{1}}$ switches at stage $t$ , then the maximum possible expected external regret incurred by $\pi^{T,\epsilon_{1}}$ will be less than $\bar{R}^{\text{ext}}_{i}(h_{t};\theta)+\sqrt{\frac{T}{2}\ln N}$ . Since $\pi^{\text{mw},T}$ will always switch just before this point is reached, its total expected regret will be less than $\epsilon_{1}$ surely, and will be less than $\epsilon$ w.p. $1-\delta$ . As $\epsilon\geq\epsilon_{0}$ , we have that the $\pi^{,T,\epsilon_{1}}$ is $(\delta,\epsilon,T)$ -socially intelligent.

Appendix B Proofs for Section 4

B.1 Proof of Lemma 4.2

We first apply Theorem 4.4 of [11], which states that, for episodic imitation learning over $H$ -step trajectories, for any expert policy $\pi^{*}$ we have

J(\pi^{*})-\text{E}_{\mathcal{D}}\left[J(\hat{\pi}^{1}_{\tilde{T}}(h;\theta,% \mathcal{D}))\right]\leq\min\left\{H,\frac{|S|H^{2}\log(K)}{K}\right\},

(17)

where $S$ is the state space, with per-step rewards bounded in $[0,1]$ . We can model the interaction with $\rho$ as a $\tilde{T}$ -step episodic MDP/R with $S=\mathcal{H}_{\leq\tilde{T}}$ . Plugging in $H=\tilde{T}$ , $|S|<N^{2(\tilde{T}+1)}$ , and $\pi^{*}=\rho$ gives us

J(\rho)-\text{E}_{\mathcal{D}}\left[J(\hat{\pi}^{1}_{\tilde{T}}(h;\theta,% \mathcal{D}))\right]\leq\min\left\{\tilde{T},\frac{N^{2\tilde{T}}|\Theta|% \tilde{T}^{2}\log(N)}{K}\right\}.

(18)

This bound holds simultaneously for all possible reward functions bounded in $[0,1]$ . If we restrict the reward function $r$ to be non-zero only for the terminal states $\mathcal{H}_{\tilde{T}}$ , we have

J(\pi^{*})-\text{E}_{\mathcal{D}}\left[J(\hat{\pi}^{1}_{\tilde{T}}(h;\theta,% \mathcal{D}))\right]=\text{E}_{p_{\tilde{T}}}[r(h_{\tilde{T}})]-\text{E}_{\hat% {p}_{\tilde{T}}}[r(h_{\tilde{T}})],

(19)

using the definition of the marginal strategy $\hat{\pi}^{1}_{\tilde{T}}$ . Finally, applying Lemma 1 of [12] gives us

\|p_{\tilde{T}}-\hat{p}_{\tilde{T}}\|_{\text{TV}}\leq\min\left\{\tilde{T},% \frac{N^{2\tilde{T}}|\Theta|\tilde{T}^{2}\log(N)}{K}\right\},

(20)

the desired result. ∎

B.2 Proof of Theorem 4.3

First, let $\tau^{2}(\boldsymbol{\theta})$ , defined as

\tau^{2}(\boldsymbol{\theta})=\min_{\langle\sigma^{1},\sigma^{2}\rangle\in% \mathcal{P}(G(\boldsymbol{\theta}))}G(\sigma^{2},\sigma^{1};\theta^{2}),

(21)

denote agent $2$ ’s payoff under the worst possible payoff for a PONE of the game parameterized by joint type $\boldsymbol{\theta}$ . Let $\mathcal{C}$ denote the event that

\tau^{2}(\boldsymbol{\theta})-\frac{1}{\tilde{T}}\sum^{\tilde{T}}_{t=1}G(a^{2}% _{t},a^{1}_{t};\theta_{2})\leq\epsilon

(22)

Because $\rho$ is $(\delta,\epsilon,\tilde{T})$ -compatible, we have that $\text{Pr}_{\rho}\{C\}\geq 1-\delta$ . For $\delta(K)$ defined as

\delta(K)=\min\left\{\tilde{T},\frac{N^{2(\tilde{T}+1)}|\Theta|\tilde{T}^{2}% \log(K)}{K}\right\},

(23)

Lemma 4.2 also gives us $\text{Pr}_{\hat{\pi}^{1},\rho}\{C\}\geq 1-\delta-\delta(K)$ . We therefore have that

$\displaystyle\text{E}_{\hat{\pi}^{1},\rho}\left[\sum^{T}_{t=1}G(a^{2}_{t},a^{1% }_{t};\theta_{2})\right]$	$\displaystyle\geq\text{E}_{\hat{\pi}^{1},\rho}\left[\sum^{T}_{t=1}G(a^{2}_{t},% a^{1}_{t};\theta_{2})\|\mathcal{C}\right]-T(\delta+\delta(K))$	(24)
$\displaystyle=\text{E}_{\hat{\pi}^{1},\rho}\left[\sum^{\tilde{T}}_{t=1}G(a^{2}% _{t},a^{1}_{t};\theta_{2})\|\mathcal{C}\right]$	$\displaystyle+\text{E}_{\hat{\pi}^{1},\rho}\left[\sum^{T}_{t=\tilde{T}+1}G(a^{% 2}_{t},a^{1}_{t};\theta_{2})\|\mathcal{C}\right]-T(\delta+\delta(K))$	(25)
$\displaystyle\geq\text{E}_{\hat{\pi}^{1},\rho}\left[\sum^{T}_{t=\tilde{T}+1}G(% a^{2}_{t},a^{1}_{t};\theta_{2})\|\mathcal{C}\right]$	$\displaystyle+T(\tau^{2}(\boldsymbol{\theta})-\epsilon-\delta-\delta(K))$	(26)

We therefore need to lower-bound the term

\text{E}_{\hat{\pi}^{1},\rho}\left[\sum^{T}_{t=\tilde{T}+1}G(a^{2}_{t},a^{1}_{% t};\theta_{2})|\mathcal{C}\right]

(27)

This will be the expected payoff given the strategy $x\sim\nu$ the AI commits to for the remaining $T-\tilde{T}$ steps. The idea now is that we can construct a mixture $\nu$ over strategies that the the AI can commit to for the remaining $T-\tilde{T}$ steps such that the partner’s payoff under their (approximate) best-response will be nearly as good as that under $\hat{z}(h_{\tilde{T}})$ .

Let $G(z;\theta^{2})=\sum_{i\in M}\sum_{j\in M}z_{i,j}G(j,i;\theta^{2})$ be agent 2’s expected payoff under $z$ . For any joint strategy $z$ , we can construct $\nu$ such that if the AI commits to strategies sampled from $\nu$ , the partner will have the same information about the AI’s probably actions as they would given their “recommended” action under $\hat{z}(h_{\tilde{T}})$ . We build on the construction used by von Stengel and Zamir [30]. For any joint strategy $z$ , we let $z_{j}=\sum_{i\in N}z_{ij}$ denote the marginal probability that the column player (agent 2) plays $j$ under $z$ . For all $j\in N$ such that $z_{j}>0$ , we define $x_{j}$ as the conditional distribution over the row-player (agent 1’s) actions given that the column player plays $j$ , such that $x_{j}(i)=\frac{z_{ij}}{z_{j}}$ . We then define $\nu$ as the strategy that commits to each $x_{j}$ with probability $z_{j}$ .

We can show that, when the partner plays a best-response to $x\sim\nu$ , their payoff will be no worse than under $z$ itself. We first construct a response function $r_{z}$ such that when agent 2 responds to $x\sim\nu$ with $r_{z}(x)$ , its expected payoff equals $G(z;\theta^{2})$ . Let $S=\{j\in N:z_{j}>0\}$ , and partition $S$ into $\mathcal{P}$ such that, for each $P\in\mathcal{P}$ , we have $x_{j}=x_{l}$ for all $j,l\in P$ . For each $P\in\mathcal{P}$ , we then define the strategy $y_{P}$ such that

y_{P}(j)=\frac{z_{j}}{\sum_{l\in P}z_{l}}

(28)

for each $j\in P$ , with $y_{P}(j)=0$ for $j\notin P$ . (Note that if $z$ corresponds to some uncorrelated strategy $\langle x,y\rangle$ , then $P=N$ and $y_{P}=y$ .) Finally, for $j\in S$ , we define $P(j)$ as the partition containing $j$ , and define $r_{z}$ such that $r_{z}(x_{j})=x_{P(j)}$ . We leave $r_{z}$ undefined for $x$ where $\mu(x)=0$ . Now let $x_{P}$ be the common conditional strategy for all $j\in P$ , and let $z_{P}=\sum_{j\in P}z_{j}$ . We then have that

$\displaystyle\text{E}_{x\sim\nu}G(r_{z}(x),x;\theta^{2})$	$\displaystyle=\sum_{j\in S}z_{j}\left[x^{\top}_{j}G(\theta^{2})^{\top}r_{z}(x_% {j})\right]$	(29)
	$\displaystyle=\sum_{P\in\mathcal{P}}z_{P}\left[x^{\top}_{P}G(\theta^{2})^{\top% }y_{P}\right]$	(30)
	$\displaystyle=\sum_{P\in\mathcal{P}}z_{P}\left(\sum_{i\in N}\sum_{j\in N}x^{% \top}_{P}(i)y_{P}(j)G(\theta^{2})^{\top}_{ij}\right)$	(31)
	$\displaystyle=\sum_{P\in\mathcal{P}}z_{P}\left(\sum_{i\in N}\sum_{j\in N}\text% {Pr}_{z}\{i\|P\}\text{Pr}_{z}\{j\|P\}G(\theta^{2})^{\top}_{ij}\right)$	(32)
	$\displaystyle=\sum_{P\in\mathcal{P}}z_{P}\left(\sum_{i\in N}\sum_{j\in N}\text% {Pr}_{z}\{i,j\|P\}G(\theta^{2})^{\top}_{ij}\right)$	(33)
	$\displaystyle=\sum_{i\in N}\sum_{j\in N}z_{ij}G(\theta^{2})^{\top}_{ij}=G(z;% \theta^{2})$	(34)

where we have used the fact that $i$ and $j$ are independent given that $j\in P$ . Next, we have that for any best-response function $r^{*}$ , we have

\begin{split}G(z;\theta^{2})&=\text{E}_{x\sim\nu}G(r_{z}(x),x;\theta^{2})\\ &=\text{E}_{x\sim\mu}[x^{\top}G(\theta^{2})^{\top}r_{z}(x)]\\ &\leq\text{E}_{x\sim\mu}[\max_{y\in\Delta(N)}x^{\top}G(\theta^{2})^{\top}y]\\ &=\text{E}_{x\sim\mu}[x^{\top}G(\theta^{2})^{\top}r^{*}(x)]\\ &=\text{E}_{x\sim\nu}G(r^{*}(x),x;\theta^{2})\end{split}

(35)

Therefore, so long as the partner plays a best-response to the AIs chosen strategy, the will achieve at least the same payoff (in expectation) as they would under the strategy $z$ from which $\nu$ was computed. Note however that $\rho$ will be (approximately) consistent over the full $T$ steps, not just the last $T-\tilde{T}$ . Define $\alpha=\frac{\tilde{T}}{T}$ and $\beta=\frac{T-\tilde{T}}{T}$ , and let $z^{1}$ be agent 1’s marginal strategy under $z$ . With probability $1-\delta$ , $\rho$ will play an $\epsilon$ -best-response to the mixture $\alpha\hat{z}(h_{\tilde{T}})^{1}-\beta x$ , with $x\sim\nu$ .

Let $\mathcal{C}^{\prime}$ be the event that $\rho$ is $\epsilon$ -consistent over $T$ steps. We then have that

	$\displaystyle\text{E}_{\hat{\pi}^{1},\rho}\left[\sum^{T}_{t=\tilde{T}+1}G(a^{2% }_{t},a^{1}_{t};\theta_{2})\|\mathcal{C}\right]$	$\displaystyle\geq\text{E}_{\hat{\pi}^{1},\rho}\left[\sum^{T}_{t=\tilde{T}+1}G(% a^{2}_{t},a^{1}_{t};\theta_{2})\|\mathcal{C},\mathcal{C}^{\prime}\right]-T\delta$		(36)
		$\displaystyle\geq(T-\tilde{T})\left(\tau^{2}(\boldsymbol{\theta})-2\epsilon% \right)-T\delta$		(37)

Finally, dividing by $T$ and subtracting from $\tau^{2}(\boldsymbol{\theta})$ , we get

\text{E}\left[R^{\text{alt}_{1}}(h_{T},\theta_{2})\right]\leq 2\delta+\delta(K% )+\left(2\frac{T-\tilde{T}}{T}+1\right)\epsilon

(38)

the desired result.

∎