On the Complexity of Learning to Cooperate with Populations of Socially Rational Agents

Robert Loftin
Department of Computer Science
University of Sheffield
Sheffield, S10 2TN, UK
[email protected]
&Saptarashmi Bandyopadhyay
Department of Computer Science
University of Maryland
College Park, MD 20742, USA
[email protected]
&Mustafa Mert Çelikok
Department of Intelligent Systems
Delft University of Technology
Delft, 2600 AA, The Netherlands
[email protected]
Abstract

Artificially intelligent agents deployed in the real-world will require the ability to reliably cooperate with humans (as well as other, heterogeneous AI agents). To provide formal guarantees of successful cooperation, we must make some assumptions about how partner agents could plausibly behave. Any realistic set of assumptions must account for the fact that other agents may be just as adaptable as our agent is. In this work, we consider the problem of cooperating with a population of agents in a finitely-repeated, two player general-sum matrix game with private utilities. Two natural assumptions in such settings are that: 1) all agents in the population are individually rational learners, and 2) when any two members of the population are paired together, with high-probability they will achieve at least the same utility as they would under some Pareto efficient equilibrium strategy. Our results first show that these assumptions alone are insufficient to ensure zero-shot cooperation with members of the target population. We therefore consider the problem of learning a strategy for cooperating with such a population using prior observations its members interacting with one another. We provide upper and lower bounds on the number of samples needed to learn an effective cooperation strategy. Most importantly, we show that these bounds can be much stronger than those arising from a "naive” reduction of the problem to one of imitation learning.

1 Introduction

In this work, we address the problem of learning to cooperate with a socially intelligent population of agents from observations interactions between members of this population. We study cooperation in finitely-repeated, two-player, general-sum matrix games with private payoffs. W say that a population of adaptive agents is socially intelligent if its members are (1) individually Hannan-consistent and (2) compatible in the sense that any pair of agents will perform nearly as well as some Pareto-optimal Nash equilibrium of the matrix game. We argue that this model of cooperation is more realistic than those that assume identical payoffs or public utilities. In real-world applications it is unlikely that independent agents will have identical utilities, or that they will provide complete information about their preferences or future behaviour to others. In the case of AI–AI cooperation, agents developed by different companies will not have access to each other’s source-code, while in the case of human–AI cooperation, having the human fully describe there preferences or behaviour in advance may be infeasible. Therefore, the question we address in this work is: Can we learn to cooperate with a socially intelligent population of agent by observing its members cooperate with each other? We answer this question by providing upper and lower bounds on the sample complexity of learning good cooperation strategies.

If we make no assumptions about the target population, we can do little more than attempt to mimic observed behavior as closely as possible, reducing the problem to one of imitation learning. Unfortunately, the strategies of adaptive agents may depend on the full history of interaction, and so the sample complexity of imitation learning will grow exponentially in the length of the repeated game. Our main contribution is an upper-bound showing that, for partners drawn from a socially intelligent (consistent and compatible) population, we can learn to cooperate with far fewer samples than would be required by a pure imitation learning approach.

This result utilizes a class of what we refer to as imitate-then-commit strategies, which leverage the fact that the population is socially intelligent to achieve cooperation without perfect imitation. The key idea is that our agent only needs to learn to imitate a member of the target population long enough to for the average strategy to approximate a Pareto-efficient solution. Once such a strategy is identified, our agent can switch to a coercive strategy such that any Hannan-consistent partner will either continue to adhere to the current joint strategy, or else switch to a superior strategy, with either case corresponding to “successful” cooperation.

In section 2 formalize our repeated game setting, and provide background on external regret and Hannan-consistency. We also propose a definition of cooperative compatibility (Definition 2.2) that is closely related to the notion of compatibility used in [1]. In Section 2.3, we provide our novel definition of social intelligence, and describe a realistic class of agents that satisfy it. In Section 3 we formalize our learning problem as that of trying to minimize altruistic regret, which we argue is the most natural measure of successful cooperation in this setting. We also give lower bounds on its sample complexity under different sets of assumptions. Finally, in Section 4 we present an upper-bound on the number of samples needed to learn strategies that achieve small altruistic regret.

2 Preliminaries

Repeated bi-matrix games with private types.

Let i{1,2}𝑖12i\in\{1,2\}italic_i ∈ { 1 , 2 } denote the agent index. We assume both agents have N𝑁Nitalic_N pure strategies (henceforth "actions"). Let ΘΘ\Thetaroman_Θ denote the finite type space, where θ1,θ2Θsubscript𝜃1subscript𝜃2Θ\theta_{1},\theta_{2}\in\Thetaitalic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ roman_Θ denote the private types of the two agents, and 𝜽=(θ1,θ2)𝜽subscript𝜃1subscript𝜃2\boldsymbol{\theta}=(\theta_{1},\theta_{2})bold_italic_θ = ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) denotes the joint type. We denote agent i𝑖iitalic_i’s payoff matrix as G(θi)N×N𝐺subscript𝜃𝑖superscript𝑁𝑁G(\theta_{i})\in\Re^{N\times N}italic_G ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ roman_ℜ start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT, and let G(𝜽)=[G(θ1),G(θ2)]𝐺𝜽𝐺subscript𝜃1𝐺superscriptsubscript𝜃2topG(\boldsymbol{\theta})=[G(\theta_{1}),G(\theta_{2})^{\top}]italic_G ( bold_italic_θ ) = [ italic_G ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_G ( italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] denote the bi-matrix game parameterized by 𝜽𝜽\boldsymbol{\theta}bold_italic_θ (with agent 1 as the row player). In a single episode, the agents play G(𝜽)𝐺𝜽G(\boldsymbol{\theta})italic_G ( bold_italic_θ ) for a fixed number of stages 0<T<0𝑇0<T<\infty0 < italic_T < ∞. We let at1subscriptsuperscript𝑎1𝑡a^{1}_{t}italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and at2subscriptsuperscript𝑎2𝑡a^{2}_{t}italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote the actions chosen by agents 1 and 2 in stage 0<tT0𝑡𝑇0<t\leq T0 < italic_t ≤ italic_T. For mixed strategies σ,σΔ(N)𝜎superscript𝜎Δ𝑁\sigma,\sigma^{\prime}\in\Delta(N)italic_σ , italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Δ ( italic_N ), we let G(σ,σ;θ)=σG(θ)σ𝐺𝜎superscript𝜎𝜃superscript𝜎top𝐺𝜃superscript𝜎G(\sigma,\sigma^{\prime};\theta)=\sigma^{\top}G(\theta)\sigma^{\prime}italic_G ( italic_σ , italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_θ ) = italic_σ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_G ( italic_θ ) italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We overload at1subscriptsuperscript𝑎1𝑡a^{1}_{t}italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and at2subscriptsuperscript𝑎2𝑡a^{2}_{t}italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to also denote the mixed strategies that assign all probability mass to actions at1subscriptsuperscript𝑎1𝑡a^{1}_{t}italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and at2subscriptsuperscript𝑎2𝑡a^{2}_{t}italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, such that G(at1,at2;θ1)𝐺subscriptsuperscript𝑎1𝑡subscriptsuperscript𝑎2𝑡subscript𝜃1G(a^{1}_{t},a^{2}_{t};\theta_{1})italic_G ( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and G(at1,at2;θ2)𝐺subscriptsuperscript𝑎1𝑡subscriptsuperscript𝑎2𝑡subscript𝜃2G(a^{1}_{t},a^{2}_{t};\theta_{2})italic_G ( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) are agent 1111 and 2222’s respective payoffs at stage t𝑡titalic_t. We also assume that for all θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ, Gij(θ)[0,1],i,j[N]formulae-sequencesubscript𝐺𝑖𝑗𝜃01for-all𝑖𝑗delimited-[]𝑁G_{ij}(\theta)\in[0,1],\forall i,j\in[N]italic_G start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_θ ) ∈ [ 0 , 1 ] , ∀ italic_i , italic_j ∈ [ italic_N ].

Let t=(N×N)tsubscript𝑡superscript𝑁𝑁𝑡\mathcal{H}_{t}=(N\times N)^{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_N × italic_N ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT be the set of histories of length t𝑡titalic_t (with 0={}subscript0\mathcal{H}_{0}=\{\emptyset\}caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { ∅ }), and let t=s=0tssubscriptabsent𝑡subscriptsuperscript𝑡𝑠0subscript𝑠\mathcal{H}_{\leq t}=\bigcup^{t}_{s=0}\mathcal{H}_{s}caligraphic_H start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT = ⋃ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT be the set of all histories of length at most t𝑡titalic_t. The strategy space ΠΠ\Piroman_Π for an agent is then the space of map**s π:Θ×T1Δ(N):𝜋maps-toΘsubscriptabsent𝑇1Δ𝑁\pi:\Theta\times\mathcal{H}_{\leq T-1}\mapsto\Delta(N)italic_π : roman_Θ × caligraphic_H start_POSTSUBSCRIPT ≤ italic_T - 1 end_POSTSUBSCRIPT ↦ roman_Δ ( italic_N ), where Δ(N)Δ𝑁\Delta(N)roman_Δ ( italic_N ) is the set of probability distributions over the action set [N]delimited-[]𝑁[N][ italic_N ]. As a functional, a strategy π𝜋\piitalic_π maps each type θ𝜃\thetaitalic_θ to a behavioral strategy [2, Chapter 5.2.2] that maps histories of play to action distributions, such that atiπi(θi,ht1)similar-tosubscriptsuperscript𝑎𝑖𝑡subscript𝜋𝑖subscript𝜃𝑖subscript𝑡1a^{i}_{t}\sim\pi_{i}(\theta_{i},h_{t-1})italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ). We denote agent i𝑖iitalic_i’s expected total payoff for following strategy π𝜋\piitalic_π against πsuperscript𝜋\pi^{\prime}italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as

Mi(π,π;θ,θ)=E[t=1TG(ati,ati;θi)|πi=π,πi=π,θi=θ,θi=θ],subscript𝑀𝑖𝜋superscript𝜋𝜃superscript𝜃Edelimited-[]formulae-sequenceconditionalsubscriptsuperscript𝑇𝑡1𝐺subscriptsuperscript𝑎𝑖𝑡subscriptsuperscript𝑎𝑖𝑡subscript𝜃𝑖subscript𝜋𝑖𝜋formulae-sequencesubscript𝜋𝑖superscript𝜋formulae-sequencesubscript𝜃𝑖𝜃subscript𝜃𝑖superscript𝜃M_{i}(\pi,\pi^{\prime};\theta,\theta^{\prime})=\text{E}\left[\left.\sum^{T}_{t% =1}G(a^{i}_{t},a^{-i}_{t};\theta_{i})\right|\pi_{i}=\pi,\pi_{-i}=\pi^{\prime},% \theta_{i}=\theta,\theta_{-i}=\theta^{\prime}\right],italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_π , italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_θ , italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = E [ ∑ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_G ( italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_π , italic_π start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT = italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_θ , italic_θ start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT = italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] , (1)

where the expectation is taken over the actions atisubscriptsuperscript𝑎𝑖𝑡a^{i}_{t}italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and atisubscriptsuperscript𝑎𝑖𝑡a^{-i}_{t}italic_a start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT sampled from the agents’ strategies.

2.1 Consistency

A natural criterion for rationality is that an agent should attempt to to achieve a payoff nearly as large as the best response to its partner’s average strategy, which we refer to as consistency. To account for the non-stationary behavior of other agents’, we specifically consider Hannan consistency [3], which in our finite-time setting simply requires that an agent have bounded external regret over T𝑇Titalic_T stages. The external regret for agent i𝑖iitalic_i is defined as

Riext(h;θ)=maxai[N]t=1|h|{G(ai,ati(h);θi)G(ati(h),ati(h);θi)}subscriptsuperscript𝑅ext𝑖𝜃subscriptsuperscript𝑎𝑖delimited-[]𝑁subscriptsuperscript𝑡1𝐺superscript𝑎𝑖subscriptsuperscript𝑎𝑖𝑡subscript𝜃𝑖𝐺subscriptsuperscript𝑎𝑖𝑡subscriptsuperscript𝑎𝑖𝑡subscript𝜃𝑖R^{\text{ext}}_{i}(h;\theta)=\max_{a^{i}\in[N]}\sum^{|h|}_{t=1}\left\{G(a^{i},% a^{-i}_{t}(h);\theta_{i})-G(a^{i}_{t}(h),a^{-i}_{t}(h);\theta_{i})\right\}italic_R start_POSTSUPERSCRIPT ext end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h ; italic_θ ) = roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ [ italic_N ] end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT | italic_h | end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT { italic_G ( italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_h ) ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_G ( italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_h ) , italic_a start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_h ) ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } (2)

where ati(h)subscriptsuperscript𝑎𝑖𝑡a^{i}_{t}(h)italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_h ) denotes the action i𝑖iitalic_i played at stage t𝑡titalic_t within the history hTsubscriptabsent𝑇h\in\mathcal{H}_{\leq T}italic_h ∈ caligraphic_H start_POSTSUBSCRIPT ≤ italic_T end_POSTSUBSCRIPT.

Definition 2.1 (Consistency).

For δ,ϵ,T>0𝛿italic-ϵ𝑇0\delta,\epsilon,T>0italic_δ , italic_ϵ , italic_T > 0, an agent i{1,2}𝑖12i\in\{1,2\}italic_i ∈ { 1 , 2 } is (δ,ϵ,T)𝛿italic-ϵ𝑇(\delta,\epsilon,T)( italic_δ , italic_ϵ , italic_T )-consistent if, for all types θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ, and any partner strategy, we have that 1TRiext(hT;θ)ϵ1𝑇subscriptsuperscript𝑅ext𝑖subscript𝑇𝜃italic-ϵ\frac{1}{T}R^{\text{ext}}_{i}(h_{T};\theta)\leq\epsilondivide start_ARG 1 end_ARG start_ARG italic_T end_ARG italic_R start_POSTSUPERSCRIPT ext end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ; italic_θ ) ≤ italic_ϵ with probability at least 1δ1𝛿1-\delta1 - italic_δ.

We also define the expected external regret R¯iext(h;θ)subscriptsuperscript¯𝑅ext𝑖𝜃\bar{R}^{\text{ext}}_{i}(h;\theta)over¯ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ext end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h ; italic_θ ) by replacing the ati(h)subscriptsuperscript𝑎𝑖𝑡a^{i}_{t}(h)italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_h ) (the action i𝑖iitalic_i played at stage t𝑡titalic_t) with their full strategy πi(θ,ht)superscript𝜋𝑖𝜃subscript𝑡\pi^{i}(\theta,h_{t})italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_θ , italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Riext(h;θ)subscriptsuperscript𝑅ext𝑖𝜃R^{\text{ext}}_{i}(h;\theta)italic_R start_POSTSUPERSCRIPT ext end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h ; italic_θ ) and R¯iext(h;θ)subscriptsuperscript¯𝑅ext𝑖𝜃\bar{R}^{\text{ext}}_{i}(h;\theta)over¯ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ext end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h ; italic_θ ) are related by the inequality

Riext(ht;θ)subscriptsuperscript𝑅ext𝑖subscript𝑡𝜃\displaystyle R^{\text{ext}}_{i}(h_{t};\theta)italic_R start_POSTSUPERSCRIPT ext end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ ) R¯iext(ht;θ)+T2ln1δ,absentsubscriptsuperscript¯𝑅ext𝑖subscript𝑡𝜃𝑇21𝛿\displaystyle\leq\bar{R}^{\text{ext}}_{i}(h_{t};\theta)+\sqrt{\frac{T}{2}\ln% \frac{1}{\delta}},≤ over¯ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ext end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ ) + square-root start_ARG divide start_ARG italic_T end_ARG start_ARG 2 end_ARG roman_ln divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG , (3)

which holds w.p. at least 1δ1𝛿1-\delta1 - italic_δ for all tT𝑡𝑇t\leq Titalic_t ≤ italic_T simultaneously (this follows directly from  [4, Lemma 4.1]). We therefore only need to bound R¯iext(ht;θ)subscriptsuperscript¯𝑅ext𝑖subscript𝑡𝜃\bar{R}^{\text{ext}}_{i}(h_{t};\theta)over¯ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ext end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ ) to provide high-probability regret bounds.

2.2 Cooperative compatibility

A𝐴Aitalic_A B𝐵Bitalic_B
A𝐴Aitalic_A 2,2222,22 , 2 0,0000,00 , 0
B𝐵Bitalic_B 0,0000,00 , 0 1,1111,11 , 1
(a) A fully-cooperative 2x2 matrix game.
C𝐶Citalic_C D𝐷Ditalic_D
C𝐶Citalic_C 2,2222,22 , 2 0,3030,30 , 3
D𝐷Ditalic_D 3,0303,03 , 0 1,1111,11 , 1
(b) The prisoner’s dilemma game.
Table 1:

Even in a fully cooperative game, the fact that both agents are consistent does not guarantee that they will achieve an optimal outcome. In the 2×2222\times 22 × 2 game in Table 1a for example, both (A,A)𝐴𝐴(A,A)( italic_A , italic_A ) and (B,B)𝐵𝐵(B,B)( italic_B , italic_B ) are Nash equilibria to which consistent agents could converge, but only (A,A)𝐴𝐴(A,A)( italic_A , italic_A ) is optimal. In general-sum games, consistency may preclude Pareto-optimal outcomes, as in the classic prisoner’s dilemma game (Table 1b), where the only outcome in which neither player incurs positive regret is (D,D)𝐷𝐷(D,D)( italic_D , italic_D ), which is Pareto-dominated by (C,C)𝐶𝐶(C,C)( italic_C , italic_C ).Therefore, similar to [1], we define successful cooperation in terms of the Pareto-optimal Nash equilibria (PONE) [5] of a game G𝐺Gitalic_G.

Let 𝒩(G)Δ(N)×Δ(N)𝒩𝐺Δ𝑁Δ𝑁\mathcal{N}(G)\subseteq\Delta(N)\times\Delta(N)caligraphic_N ( italic_G ) ⊆ roman_Δ ( italic_N ) × roman_Δ ( italic_N ) be the set of Nash equilibria (NE) of G𝐺Gitalic_G. For a fully-cooperative game, 𝒩(G)𝒩𝐺\mathcal{N}(G)caligraphic_N ( italic_G ) will contain all globally optimal strategy profiles for G𝐺Gitalic_G. It may, however, also contain joint strategies that are highly sub-optimal. Let 𝒫(G)𝒩(G)𝒫𝐺𝒩𝐺\mathcal{P}(G)\subseteq\mathcal{N}(G)caligraphic_P ( italic_G ) ⊆ caligraphic_N ( italic_G ) denote the set of Pareto optimal Nash equilibria. In this work, we say that a strategy profile σ1,σ2𝒫(G)subscript𝜎1subscript𝜎2𝒫𝐺\langle\sigma_{1},\sigma_{2}\rangle\in\mathcal{P}(G)⟨ italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ ∈ caligraphic_P ( italic_G ) if and only if σ1,σ2𝒩(G)subscript𝜎1subscript𝜎2𝒩𝐺\langle\sigma_{1},\sigma_{2}\rangle\in\mathcal{N}(G)⟨ italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ ∈ caligraphic_N ( italic_G ), and there does not exist σ1,σ2𝒩(G)subscriptsuperscript𝜎1subscriptsuperscript𝜎2𝒩𝐺\langle\sigma^{\prime}_{1},\sigma^{\prime}_{2}\rangle\in\mathcal{N}(G)⟨ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ ∈ caligraphic_N ( italic_G ) such that G(σ1,σ2;θ1)>G(σ1,σ2;θ1)𝐺subscriptsuperscript𝜎1subscriptsuperscript𝜎2subscript𝜃1𝐺subscript𝜎1subscript𝜎2subscript𝜃1G(\sigma^{\prime}_{1},\sigma^{\prime}_{2};\theta_{1})>G(\sigma_{1},\sigma_{2};% \theta_{1})italic_G ( italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) > italic_G ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and G(σ2,σ1;θ2)>G(σ2,σ1;θ2)𝐺subscriptsuperscript𝜎2subscriptsuperscript𝜎1subscript𝜃2𝐺subscript𝜎2subscript𝜎1subscript𝜃2G(\sigma^{\prime}_{2},\sigma^{\prime}_{1};\theta_{2})>G(\sigma_{2},\sigma_{1};% \theta_{2})italic_G ( italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) > italic_G ( italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). This means that σ1,σ2subscript𝜎1subscript𝜎2\langle\sigma_{1},\sigma_{2}\rangle⟨ italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ is a PONE if it is a Nash equilibrium of G𝐺Gitalic_G, and it is not strongly Pareto-dominated by any other Nash equilibrium of G𝐺Gitalic_G. Intuitively, if two agents are individually consistent, and willing to cooperate with each other, their joint payoff profile should not be dominated by any PONE. We formalize this intuition as follows:

Definition 2.2 (Compatibility).

For δ,ϵ,T>0𝛿italic-ϵ𝑇0\delta,\epsilon,T>0italic_δ , italic_ϵ , italic_T > 0, two agents π1superscript𝜋1\pi^{1}italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and π2superscript𝜋2\pi^{2}italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are (δ,ϵ,T)𝛿italic-ϵ𝑇(\delta,\epsilon,T)( italic_δ , italic_ϵ , italic_T )-compatible if, when played together, for any joint type 𝜽Θ×Θ𝜽ΘΘ\boldsymbol{\theta}\in\Theta\times\Thetabold_italic_θ ∈ roman_Θ × roman_Θ, w.p. at least 1δ1𝛿1-\delta1 - italic_δ, σ1,σ2𝒫(G(𝜽))subscriptsuperscript𝜎1subscriptsuperscript𝜎2𝒫𝐺𝜽\exists\langle\sigma^{*}_{1},\sigma^{*}_{2}\rangle\in\mathcal{P}(G(\boldsymbol% {\theta}))∃ ⟨ italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ ∈ caligraphic_P ( italic_G ( bold_italic_θ ) ) s.t.

1Tt=1TG(σi,σi;θi)G(ati,ati;θi)ϵ,1𝑇subscriptsuperscript𝑇𝑡1𝐺subscriptsuperscript𝜎𝑖subscriptsuperscript𝜎𝑖subscript𝜃𝑖𝐺subscriptsuperscript𝑎𝑖𝑡subscriptsuperscript𝑎𝑖𝑡subscript𝜃𝑖italic-ϵ\frac{1}{T}\sum^{T}_{t=1}G(\sigma^{*}_{i},\sigma^{*}_{-i};\theta_{i})-G(a^{i}_% {t},a^{-i}_{t};\theta_{i})\leq\epsilon,divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_G ( italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_G ( italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ italic_ϵ , (4)

for both i=1𝑖1i=1italic_i = 1 and i=2𝑖2i=2italic_i = 2.

A pair of agents is compatible if, when paired together, with high-probability over their path of play hTsubscript𝑇h_{T}italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT there will exist some PONE that does not ϵitalic-ϵ\epsilonitalic_ϵ-dominate their realized payoffs. Note that this definition is the approximate and finite-horizon version of the one provided in [1].

2.3 Socially intelligent agents

We argue that it is natural to model an existing population of cooperating agents as a set of approximately compatible, but otherwise heterogeneous agents. We therefore introduce the more general idea of a socially intelligent class of agents that are compatible with any other member of their class:

Definition 2.3 (Social Intelligence).

A set C𝐶Citalic_C of agents forms a socially intelligent class w.r.t. ΘΘ\Thetaroman_Θ if, for some δ,ϵ,T>0𝛿italic-ϵ𝑇0\delta,\epsilon,T>0italic_δ , italic_ϵ , italic_T > 0, each agent πC𝜋𝐶\pi\in Citalic_π ∈ italic_C is (δ,ϵ,T)𝛿italic-ϵ𝑇(\delta,\epsilon,T)( italic_δ , italic_ϵ , italic_T )-consistent for all θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ, and any two agents π,πC𝜋superscript𝜋𝐶\pi,\pi^{\prime}\in Citalic_π , italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_C are (δ,ϵ,T)𝛿italic-ϵ𝑇(\delta,\epsilon,T)( italic_δ , italic_ϵ , italic_T )-compatible over all joint types ΘΘ\Thetaroman_Θ. An individual agent π𝜋\piitalic_π is called socially intelligent if it forms a socially intelligent class {π}𝜋\{\pi\}{ italic_π } with itself.

The Hannan consistency requirement ensures that any agent in the population always has bounded average regret, whereas the approximate compatibility means if both agents are from C𝐶Citalic_C, with high probability there will exist some PONE that does not ϵitalic-ϵ\epsilonitalic_ϵ-dominate their path of play. Below we describe a socially intelligent class based on a pre-agreed coordination protocol.

Coordination protocols

For a type space ΘΘ\Thetaroman_Θ, we first define a function s(𝜽)𝒫(G(𝜽))𝑠𝜽𝒫𝐺𝜽s(\boldsymbol{\theta})\in\mathcal{P}(G(\boldsymbol{\theta}))italic_s ( bold_italic_θ ) ∈ caligraphic_P ( italic_G ( bold_italic_θ ) ) that maps from each joint type 𝜽𝜽\boldsymbol{\theta}bold_italic_θ to a strategy profile in 𝒫(G(𝜽))𝒫𝐺𝜽\mathcal{P}(G(\boldsymbol{\theta}))caligraphic_P ( italic_G ( bold_italic_θ ) ). We can think of s(𝜽)𝑠𝜽s(\boldsymbol{\theta})italic_s ( bold_italic_θ ) as a common “convention” the agents in C𝐶Citalic_C have settled upon. Since we assume private types, members of C𝐶Citalic_C do not know each other’s type at the beginning of their interaction. If any type θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ can be communicated to others in a sequence of k<T𝑘𝑇k<Titalic_k < italic_T actions, then agents in C𝐶Citalic_C can agree on a coordination protocol similar to a handshake. Let the protocol be a map κ(θ)𝜅𝜃\kappa(\theta)italic_κ ( italic_θ ) from types to a history-dependent policy. Then, at the beginning of each interaction, both agents will play κ𝜅\kappaitalic_κ for k𝑘kitalic_k-steps in order to communicate their types. After coordinating with each other, the agents play s((θi,θi))𝑠subscript𝜃𝑖subscript𝜃𝑖s((\theta_{i},\theta_{-i}))italic_s ( ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ) ) for the remaining Tk𝑇𝑘T-kitalic_T - italic_k steps. The agents must still ensure their partner does not deviate from s((θi,θi))𝑠subscript𝜃𝑖subscript𝜃𝑖s((\theta_{i},\theta_{-i}))italic_s ( ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ) ) for safety against adversarial “imposters”. Since playing a PONE jointly will lead to low regret for both, if i𝑖iitalic_i’s regret exceeds a certain threshold, this would indicate i𝑖-i- italic_i is deviating from s𝑠sitalic_s significantly. The threshold can be chosen by the aid of the following lemma,

Lemma 2.4.

For any δ,T>0𝛿𝑇0\delta,T>0italic_δ , italic_T > 0, if both players follow strategy s(𝛉)𝑠𝛉s(\boldsymbol{\theta})italic_s ( bold_italic_θ ) at each stage, then with probability at least 1δ1𝛿1-\delta1 - italic_δ we have

R¯iext(ht;θi)2Tln2δandRiext(ht;θi)22Tln4δ,formulae-sequencesubscriptsuperscript¯𝑅ext𝑖subscript𝑡subscript𝜃𝑖2𝑇2𝛿andsubscriptsuperscript𝑅ext𝑖subscript𝑡subscript𝜃𝑖22𝑇4𝛿\bar{R}^{\text{ext}}_{i}(h_{t};\theta_{i})\leq\sqrt{2T\ln\frac{2}{\delta}}% \quad\text{and}\quad R^{\text{ext}}_{i}(h_{t};\theta_{i})\leq 2\sqrt{2T\ln% \frac{4}{\delta}},over¯ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ext end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ square-root start_ARG 2 italic_T roman_ln divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG end_ARG and italic_R start_POSTSUPERSCRIPT ext end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ 2 square-root start_ARG 2 italic_T roman_ln divide start_ARG 4 end_ARG start_ARG italic_δ end_ARG end_ARG , (5)

which follows from an application of the Azuma-Hoeffding inequality (shown in Appendix A.1). Then the question is what safe strategy should the i𝑖iitalic_i fall back into, if the rule is triggered. We base the fallback strategy on the multiplicative weights [6] update rule, defined as:

smw,ki(ht;θi)smw,ki(ht1;θi)exp(ηG(k,at1i(h);θi))proportional-tosubscriptsuperscript𝑠𝑖mw𝑘subscript𝑡subscript𝜃𝑖subscriptsuperscript𝑠𝑖mw𝑘subscript𝑡1subscript𝜃𝑖𝜂𝐺𝑘subscriptsuperscript𝑎𝑖𝑡1subscript𝜃𝑖s^{i}_{\text{mw},k}(h_{t};\theta_{i})\propto s^{i}_{\text{mw},k}(h_{t-1};% \theta_{i})\exp\left(-\eta G(k,a^{-i}_{t-1}(h);\theta_{i})\right)italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT mw , italic_k end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∝ italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT mw , italic_k end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_exp ( - italic_η italic_G ( italic_k , italic_a start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_h ) ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (6)

for kN𝑘𝑁k\in Nitalic_k ∈ italic_N, where smwi(h0;θi)subscriptsuperscript𝑠𝑖mwsubscript0subscript𝜃𝑖s^{i}_{\text{mw}}(h_{0};\theta_{i})italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT mw end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the uniform strategy. Define πmw,Tsuperscript𝜋mw𝑇\pi^{\text{mw},T}italic_π start_POSTSUPERSCRIPT mw , italic_T end_POSTSUPERSCRIPT as the agent that plays smwi(ht;θi)subscriptsuperscript𝑠𝑖mwsubscript𝑡subscript𝜃𝑖s^{i}_{\text{mw}}(h_{t};\theta_{i})italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT mw end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) with learning rate η=8ln(N/T)𝜂8𝑁𝑇\eta=\sqrt{8\ln(N/T)}italic_η = square-root start_ARG 8 roman_ln ( italic_N / italic_T ) end_ARG. The expected external regret of πmw,Tsuperscript𝜋mw𝑇\pi^{\text{mw},T}italic_π start_POSTSUPERSCRIPT mw , italic_T end_POSTSUPERSCRIPT is bounded as

R¯iext(hT;θi)T2lnNsubscriptsuperscript¯𝑅ext𝑖subscript𝑇subscript𝜃𝑖𝑇2𝑁\bar{R}^{\text{ext}}_{i}(h_{T};\theta_{i})\leq\sqrt{\frac{T}{2}\ln N}over¯ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ext end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ square-root start_ARG divide start_ARG italic_T end_ARG start_ARG 2 end_ARG roman_ln italic_N end_ARG (7)

surely [4, Theorem 2.2]. We then define the agent’s overall strategy πT,ϵsuperscript𝜋𝑇italic-ϵ\pi^{T,\epsilon}italic_π start_POSTSUPERSCRIPT italic_T , italic_ϵ end_POSTSUPERSCRIPT as follows:

  1. 1.

    In first k𝑘kitalic_k steps, play κ(θi).𝜅subscript𝜃𝑖\kappa(\theta_{i}).italic_κ ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

  2. 2.

    If i𝑖-i- italic_i’s behaviour in hksubscript𝑘h_{k}italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT not compatible with κ(θ)𝜅𝜃\kappa(\theta)italic_κ ( italic_θ ) for any θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ, switch to πmw,Tsuperscript𝜋mw𝑇\pi^{\text{mw},T}italic_π start_POSTSUPERSCRIPT mw , italic_T end_POSTSUPERSCRIPT for all subsequent stages.

  3. 3.

    While R¯iext(ht;θi)k+ϵ(Tk)Tk2lnN1subscriptsuperscript¯𝑅ext𝑖subscript𝑡subscript𝜃𝑖𝑘italic-ϵ𝑇𝑘𝑇𝑘2𝑁1\bar{R}^{\text{ext}}_{i}(h_{t};\theta_{i})\leq k+\epsilon(T-k)-\sqrt{\frac{T-k% }{2}\ln N}-1over¯ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ext end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ italic_k + italic_ϵ ( italic_T - italic_k ) - square-root start_ARG divide start_ARG italic_T - italic_k end_ARG start_ARG 2 end_ARG roman_ln italic_N end_ARG - 1, play si(𝜽)subscript𝑠𝑖𝜽s_{i}(\boldsymbol{\theta})italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_θ ).

  4. 4.

    Otherwise, switch to πmw,Tsuperscript𝜋mw𝑇\pi^{\text{mw},T}italic_π start_POSTSUPERSCRIPT mw , italic_T end_POSTSUPERSCRIPT for all subsequent stages.

The theorem below shows that agents that follow the social authentication strategy above form a socially intelligent class among themselves. All proofs have been deferred to appendix A.2.

Theorem 2.5.

For any δ,T>k𝛿𝑇𝑘\delta,T>kitalic_δ , italic_T > italic_k, let ϵ02(Tk)ln2δsubscriptitalic-ϵ02𝑇𝑘2𝛿\epsilon_{0}\geq\sqrt{\frac{2}{(T-k)}\ln\frac{2}{\delta}}italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≥ square-root start_ARG divide start_ARG 2 end_ARG start_ARG ( italic_T - italic_k ) end_ARG roman_ln divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG end_ARG, and let ϵ1=ϵ0+12(Tk)lnN+1(Tk)subscriptitalic-ϵ1subscriptitalic-ϵ012𝑇𝑘𝑁1𝑇𝑘\epsilon_{1}=\epsilon_{0}+\sqrt{\frac{1}{2(T-k)}\ln N}+\frac{1}{(T-k)}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 ( italic_T - italic_k ) end_ARG roman_ln italic_N end_ARG + divide start_ARG 1 end_ARG start_ARG ( italic_T - italic_k ) end_ARG. Then for ϵ=ϵ1+(Tk)2ln1δitalic-ϵsubscriptitalic-ϵ1𝑇𝑘21𝛿\epsilon=\epsilon_{1}+\sqrt{\frac{(T-k)}{2}\ln\frac{1}{\delta}}italic_ϵ = italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + square-root start_ARG divide start_ARG ( italic_T - italic_k ) end_ARG start_ARG 2 end_ARG roman_ln divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG, the πT,ϵ1superscript𝜋𝑇subscriptitalic-ϵ1\pi^{T,\epsilon_{1}}italic_π start_POSTSUPERSCRIPT italic_T , italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is (δ,ϵ,T)𝛿italic-ϵ𝑇(\delta,\epsilon,T)( italic_δ , italic_ϵ , italic_T )-socially intelligent.

3 Learning to Cooperate

Going forward, we will assume that our agent (henceforth referred to as the “AI”) will take the role of agent 1111, while the other agent (referred to as the “partner”) will be agent 2222. Our goal is to choose a strategy for the AI that can cooperate with a partner drawn from some target population nearly as effectively as agents from this population cooperate with one another. For parametric game G𝐺Gitalic_G, with type space ΘΘ\Thetaroman_Θ, we will let the target population be a set C𝐶Citalic_C of strategies forming an (δ,ϵ,T)𝛿italic-ϵ𝑇(\delta,\epsilon,T)( italic_δ , italic_ϵ , italic_T )-SI class w.r.t. ΘΘ\Thetaroman_Θ. Ideally, we would hope to choose an AI strategy π𝜋\piitalic_π that can cooperate with C𝐶Citalic_C without any additional information the strategies in C𝐶Citalic_C. Looking at the coordination protocol example in Section 2.3, we can see that in many cases a population is likely to use arbitrary conventions to coordinate their behavior, and intuitively we would imagine cooperation to be impossible without prior knowledge of these conventions. (We make this intuition formal in Theorem 3.5).

We therefore consider the problem of learning an cooperative AI strategy using prior observations of members of the target population interacting with one another. We define a social learning problem by a tuple {G,Θ,C,ρ,μ}𝐺Θ𝐶𝜌𝜇\{G,\Theta,C,\rho,\mu\}{ italic_G , roman_Θ , italic_C , italic_ρ , italic_μ }, where C𝐶Citalic_C is the target population (SI w.r.t. ΘΘ\Thetaroman_Θ), ρ𝜌\rhoitalic_ρ is a distribution over C𝐶Citalic_C, while μ𝜇\muitalic_μ is a distribution over the joint type space Θ×ΘΘΘ\Theta\times\Thetaroman_Θ × roman_Θ. We can think of C𝐶Citalic_C as the set of possible strategies that any member of the target population might follow, while ρ𝜌\rhoitalic_ρ is the frequency of those strategies within the population. To choose an AI strategy, we leverage a dataset 𝒟={(θ1j,θ2j,hTj)|j[n]}𝒟conditional-setsubscriptsuperscript𝜃𝑗1subscriptsuperscript𝜃𝑗2subscriptsuperscript𝑗𝑇𝑗delimited-[]𝑛\mathcal{D}=\{(\theta^{j}_{1},\theta^{j}_{2},h^{j}_{T})|j\in[n]\}caligraphic_D = { ( italic_θ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_h start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) | italic_j ∈ [ italic_n ] } covering n𝑛nitalic_n episodes of length T𝑇Titalic_T. In each episode j𝑗jitalic_j, two agents πj1subscriptsuperscript𝜋1𝑗\pi^{1}_{j}italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and πj2subscriptsuperscript𝜋2𝑗\pi^{2}_{j}italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are sampled independently from ρ𝜌\rhoitalic_ρ, and played together under the joint type 𝜽jμsimilar-tosubscript𝜽𝑗𝜇\boldsymbol{\theta}_{j}\sim\mubold_italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∼ italic_μ. The AI observes the full history hTjsubscriptsuperscript𝑗𝑇h^{j}_{T}italic_h start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, along with the agents’ types θ1jsubscriptsuperscript𝜃𝑗1\theta^{j}_{1}italic_θ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and θ2jsubscriptsuperscript𝜃𝑗2\theta^{j}_{2}italic_θ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We denote a specific learning algorithm as a data conditioned strategy π(𝒟)𝜋𝒟\pi(\mathcal{D})italic_π ( caligraphic_D ).

3.1 Altruistic Regret

We seek an AI strategy that minimizes the regret relative to some Pareto optimal solution to G(𝜽)𝐺𝜽G(\boldsymbol{\theta})italic_G ( bold_italic_θ ). Rather than minimizing regret in terms of the AI’s own payoffs, however, we seek to minimize partner’s relative to their (worst case) PONE in G(𝜽)𝐺𝜽G(\boldsymbol{\theta})italic_G ( bold_italic_θ ). We formalize this regret with the following definition:

Definition 3.1 (Altruistic Regret).

Let the (σi,σi)𝒫(Gi(θi))subscriptsuperscript𝜎𝑖subscriptsuperscript𝜎𝑖𝒫subscript𝐺𝑖subscript𝜃𝑖(\sigma^{*}_{i},\sigma^{*}_{-i})\in\mathcal{P}(G_{-i}(\theta_{-i}))( italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_P ( italic_G start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ) ) denote the PONE with the lowest payoff for the agent i𝑖-i- italic_i where i{1,2}𝑖12i\in\{1,2\}italic_i ∈ { 1 , 2 }. The altruistic regret of agent i𝑖iitalic_i is defined as

Rialt(hT;θi)=t=1TG(σi,σi;θi)G(ai(ht),ai(ht);θi).subscriptsuperscript𝑅alt𝑖subscript𝑇subscript𝜃𝑖subscriptsuperscript𝑇𝑡1𝐺subscriptsuperscript𝜎𝑖subscriptsuperscript𝜎𝑖subscript𝜃𝑖𝐺superscript𝑎𝑖subscript𝑡superscript𝑎𝑖subscript𝑡subscript𝜃𝑖R^{\text{alt}}_{i}(h_{T};\theta_{-i})=\sum^{T}_{t=1}G(\sigma^{*}_{i},\sigma^{*% }_{-i};\theta_{-i})-G(a^{i}(h_{t}),a^{-i}(h_{t});\theta_{-i}).italic_R start_POSTSUPERSCRIPT alt end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ) = ∑ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_G ( italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ) - italic_G ( italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_a start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ; italic_θ start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ) . (8)

In practical cooperation tasks, we would expect outcomes that have low regret for the partner will have low regret for the AI as well.

The cooperation objective for the AI agent can then be formalized as minimising the altruistic regret. Unlike the definition suggests, the AI agent must know its own type as well. This is due to the fact that as seen in the coordination protocols example, if the AI fails to imitate a human of its type or fail to communicate its type correctly, the partner might switch to a safe strategy.

The goal for the AI is to minimize its expected altruistic regret over partners sampled from ρ𝜌\rhoitalic_ρ and types sampled from μ𝜇\muitalic_μ. The following lemma shows that we can treat the problem of minimizing regret with respect to a heterogeneous population C𝐶Citalic_C as that of minimizing regret w.r.t. a single stochastic strategy.

Lemma 3.2.

Let C𝐶Citalic_C be a finite set of agents that are (δ,ϵ,T)𝛿italic-ϵ𝑇(\delta,\epsilon,T)( italic_δ , italic_ϵ , italic_T )-socially intelligent w.r.t. type space ΘΘ\Thetaroman_Θ, and let ρ𝜌\rhoitalic_ρ be a distribution over C𝐶Citalic_C. There exists a mixed strategy ρ¯¯𝜌\bar{\rho}over¯ start_ARG italic_ρ end_ARG that forms an (δ,ϵ,T)𝛿italic-ϵ𝑇(\delta,\epsilon,T)( italic_δ , italic_ϵ , italic_T )-socially intelligent class, and which is equivalent to playing against partners sampled from ρ𝜌\rhoitalic_ρ in expectation.

Proof. In a perfect recall game, every behavioural strategy has an equivalent mixed strategy and vice-versa [7]. Thus ρ𝜌\rhoitalic_ρ can equivalently be defined as a distribution over mixed strategies so that ρΔ(Δ(N))𝜌ΔΔ𝑁\rho\in\Delta(\Delta(N))italic_ρ ∈ roman_Δ ( roman_Δ ( italic_N ) ). Then defining ρ¯(a)=Δ(N)σ(a)𝑑ρ(σ)¯𝜌𝑎subscriptΔ𝑁𝜎𝑎differential-d𝜌𝜎\bar{\rho}(a)=\int_{\Delta(N)}\sigma(a)\,d\rho(\sigma)over¯ start_ARG italic_ρ end_ARG ( italic_a ) = ∫ start_POSTSUBSCRIPT roman_Δ ( italic_N ) end_POSTSUBSCRIPT italic_σ ( italic_a ) italic_d italic_ρ ( italic_σ ) where a[N]𝑎delimited-[]𝑁a\in[N]italic_a ∈ [ italic_N ] denotes a pure strategy (i.e. action) completes the proof.

In order to show the joint impact of consistency and compatibility on the learning problem, we discuss the cases where the population is either consistent or compatible, but not both.

3.2 Consistency without Compatibility

Assume that C𝐶Citalic_C consists of agents that are consistent but not necessarily compatible. The most general class in this case is the class of all no-external-regret learners (no-regret henceforth). It is a well-established result that the long-run average of no-regret learning converges to the set of coarse correlated equilibria. The question is whether the AI agent can learn to do better than a coarse correlated equilibrium when paired with a member of C𝐶Citalic_C, using only a dataset 𝒟𝒟\mathcal{D}caligraphic_D that consists of histories of play for different CCEs.

Theorem 3.3.

There exists a consistent yet incompatible class of agents C𝐶Citalic_C such that even with an infinite amount of data, the AI cannot learn strategies that minimise altruistic regret.

Proof. The proof follows from the theorem 3 of Monnot and Piliouras [8] which shows that given any coarse correlated equilibrium of a two-player normal-form game, there exists a pair of no-regret learners that would converge to it. Since C𝐶Citalic_C can be any subset of no-regret learners, we cannot exclude those who converge to inefficient CCE. If the class C𝐶Citalic_C contains only the agents that converge to Pareto-inefficient CCE, we cannot hope to learn optimal strategies from any dataset. Given an observed CCE z𝑧zitalic_z in the dataset, assume that the AI knows it is facing one of the two agents that generated z𝑧zitalic_z, but does not know their type explicitly. Using a Stackelberg argument similar to Brown et al. [9], we prove in appendix B.2 that the AI can compute and commit to a leader strategy such that the payoffs are never strongly Pareto-dominated by z.𝑧z.italic_z . However even in this case, we cannot eliminate the possibility of it being weakly dominated.

Regardless of the dataset, in the online phase, the AI faces a new agent from C𝐶Citalic_C each time and does not know their type. We may hope to learn a classifier to quickly infer our partner’s type online from their behaviour, assuming there exists a map** from initial behaviour to types. However, since C𝐶Citalic_C consists only of no-regret learners guaranteed to converge to a CCE in self-play, they have no reason to initially communicate their types to each other.

3.3 Compatibility without consistency

Assume that the members of C𝐶Citalic_C are compatible but not consistent. We can construct such a class by using the coordination protocols example from section 2.3. Now, when agents from C𝐶Citalic_C successfully identify each other after the authentication phase, they proceed with playing the agreed-upon PONE. However, if at any moment they play the wrong action, there is no constraint on what strategy they will switch to. This setting is equivalent to the case considered by Loftin and Oliehoek [10] in their impossibility result. The members of C𝐶Citalic_C can employ grim-trigger strategies that forever punish the other agent, triggered by a mistake at any point. Even if we eliminate grim-trigger strategies, the impossibility result has proven that there still exists strategies the members of C𝐶Citalic_C can play once triggered, and make the other agent suffer regret arbitrarily close to 1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG with payoffs in [0,1].01[0,1].[ 0 , 1 ] . Since a single mistake during the online interaction can lead to partner playing strategies that yield linear regret, the outsider must learn to imitate at least one member of C𝐶Citalic_C perfectly from the dataset. Therefore the offline problem in this setting reduces to imitation learning, in particular the no-interaction case from Rajaraman et al. [11].

For each agent, the authentication protocol κ𝜅\kappaitalic_κ is equivalent to a history-dependent policy that they commit to playing in the first k𝑘kitalic_k time-steps. The lower-bound on the expected sub-optimality of the imitation learning from Rajaraman et al. [11] is based on the fact that the imitator cannot do better than uniformly random in unseen states. In the case of κ𝜅\kappaitalic_κ, states correspond to histories up to length k.𝑘k.italic_k . Since every k𝑘kitalic_k-step history can be uniquely embedding a type, an unseen history means a high probability of making a mistake if paired with the corresponding type. Therefore, to avoid linear altruistic regret, the AI must observe at least |k|subscript𝑘|\mathcal{H}_{k}|| caligraphic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | samples, where ksubscript𝑘\mathcal{H}_{k}caligraphic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the set of all possible k𝑘kitalic_k-step histories.

Theorem 3.4.

Let M𝑀Mitalic_M be the number of unique samples of k𝑘kitalic_k-step histories in the dataset. There exists a class of agents C𝐶Citalic_C with a k𝑘kitalic_k-step social authentication protocol such that to bound the probability of failing to authenticate, we need MN3kδN3kN2kN1𝑀superscript𝑁3𝑘𝛿superscript𝑁3𝑘superscript𝑁2𝑘𝑁1M\geq\frac{N^{3k}-\delta N^{3k}-N^{2k}}{N-1}italic_M ≥ divide start_ARG italic_N start_POSTSUPERSCRIPT 3 italic_k end_POSTSUPERSCRIPT - italic_δ italic_N start_POSTSUPERSCRIPT 3 italic_k end_POSTSUPERSCRIPT - italic_N start_POSTSUPERSCRIPT 2 italic_k end_POSTSUPERSCRIPT end_ARG start_ARG italic_N - 1 end_ARG samples. Then for growing k𝑘kitalic_k, the sample complexity lower bound is M=Ω(N2k).𝑀Ωsuperscript𝑁2𝑘M=\Omega(N^{2k}).italic_M = roman_Ω ( italic_N start_POSTSUPERSCRIPT 2 italic_k end_POSTSUPERSCRIPT ) .

Proof. Consider the coordination protocol example mentioned above. Let hkksubscript𝑘subscript𝑘h_{k}\in\mathcal{H}_{k}italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT be missing from the dataset. When the AI is paired with the corresponding partner type, the probability of correctly authenticating is 1Nk,1superscript𝑁𝑘\frac{1}{N^{k}},divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG , and thus authentication fails with probability Nk1Nk.superscript𝑁𝑘1superscript𝑁𝑘\frac{N^{k}-1}{N^{k}}.divide start_ARG italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG . Assuming we face each type uniformly randomly, if we have M𝑀Mitalic_M unique samples, the probability of facing an unobserved history is N2kMN2ksuperscript𝑁2𝑘𝑀superscript𝑁2𝑘\frac{N^{2k}-M}{N^{2k}}divide start_ARG italic_N start_POSTSUPERSCRIPT 2 italic_k end_POSTSUPERSCRIPT - italic_M end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 italic_k end_POSTSUPERSCRIPT end_ARG since |k|=N2ksubscript𝑘superscript𝑁2𝑘|\mathcal{H}_{k}|=N^{2k}| caligraphic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | = italic_N start_POSTSUPERSCRIPT 2 italic_k end_POSTSUPERSCRIPT. Then the probability of failing is Nk1Nk×N2kMN2k=1MN2k1N+MN3k.superscript𝑁𝑘1superscript𝑁𝑘superscript𝑁2𝑘𝑀superscript𝑁2𝑘1𝑀superscript𝑁2𝑘1𝑁𝑀superscript𝑁3𝑘\frac{N^{k}-1}{N^{k}}\times\frac{N^{2k}-M}{N^{2k}}=1-\frac{M}{N^{2k}}-\frac{1}% {N}+\frac{M}{N^{3k}}.divide start_ARG italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG × divide start_ARG italic_N start_POSTSUPERSCRIPT 2 italic_k end_POSTSUPERSCRIPT - italic_M end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 italic_k end_POSTSUPERSCRIPT end_ARG = 1 - divide start_ARG italic_M end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 italic_k end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG + divide start_ARG italic_M end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 3 italic_k end_POSTSUPERSCRIPT end_ARG . In order to bound this by δ,𝛿\delta,italic_δ , we need MN3kδN3kN2kN1𝑀superscript𝑁3𝑘𝛿superscript𝑁3𝑘superscript𝑁2𝑘𝑁1M\geq\frac{N^{3k}-\delta N^{3k}-N^{2k}}{N-1}italic_M ≥ divide start_ARG italic_N start_POSTSUPERSCRIPT 3 italic_k end_POSTSUPERSCRIPT - italic_δ italic_N start_POSTSUPERSCRIPT 3 italic_k end_POSTSUPERSCRIPT - italic_N start_POSTSUPERSCRIPT 2 italic_k end_POSTSUPERSCRIPT end_ARG start_ARG italic_N - 1 end_ARG samples. Since k𝑘kitalic_k-steps need to embed each type uniquely, k𝑘kitalic_k grows with the size of the type space. For large k𝑘kitalic_k, the bound is dominated by N3k,superscript𝑁3𝑘N^{3k},italic_N start_POSTSUPERSCRIPT 3 italic_k end_POSTSUPERSCRIPT , thus we have M=Ω(N3k)𝑀Ωsuperscript𝑁3𝑘M=\Omega(N^{3k})italic_M = roman_Ω ( italic_N start_POSTSUPERSCRIPT 3 italic_k end_POSTSUPERSCRIPT ) as k𝑘kitalic_k grows.

An immediate conclusion that follows from theorem 3.4 is that for the case of compatibility without consistency, this sample complexity is for bounding the probability of suffering linear regret. This is due to the fact that failing to authenticate can now lead to linear regret, since the partner can switch to arbitrary strategies.

3.4 Lower bound for socially intelligent populations

Theorem 3.5.

Let M𝑀Mitalic_M denote the number of histories with unique first k𝑘kitalic_k-steps in dataset 𝒟𝒟\mathcal{D}caligraphic_D generated by the members of a socially intelligent class C𝐶Citalic_C. There exists a C𝐶Citalic_C where Rialt(hT;θi)=Tsubscriptsuperscript𝑅alt𝑖subscript𝑇subscript𝜃𝑖𝑇R^{\text{alt}}_{i}(h_{T};\theta_{-i})=Titalic_R start_POSTSUPERSCRIPT alt end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ) = italic_T with probability Nk1Nk×N2kMN2k=1MN2k1N+MN3k.superscript𝑁𝑘1superscript𝑁𝑘superscript𝑁2𝑘𝑀superscript𝑁2𝑘1𝑀superscript𝑁2𝑘1𝑁𝑀superscript𝑁3𝑘\frac{N^{k}-1}{N^{k}}\times\frac{N^{2k}-M}{N^{2k}}=1-\frac{M}{N^{2k}}-\frac{1}% {N}+\frac{M}{N^{3k}}.divide start_ARG italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG × divide start_ARG italic_N start_POSTSUPERSCRIPT 2 italic_k end_POSTSUPERSCRIPT - italic_M end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 italic_k end_POSTSUPERSCRIPT end_ARG = 1 - divide start_ARG italic_M end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 italic_k end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG + divide start_ARG italic_M end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 3 italic_k end_POSTSUPERSCRIPT end_ARG .

Proof: Let C𝐶Citalic_C be a socially intelligent class of agents following a coordination protocol akin to the one described in section 2.3. The probability follows from the proof of theorem 3.4 as the probability of failing to authenticate. If the authentication fails, the partner switches to an arbitrary Hannan-consistent strategy. As stated in section 3.2, a consistent partner strategy may never communicate the partner’s type. Without knowing the partner’s type, the agent’s worst-case average altruistic regret can be 1111, since it cannot compute its true regret without the partner’s type (see definition 3.1). Let there be two partner types θi=θ2subscript𝜃𝑖subscript𝜃2\theta_{-i}=\theta_{2}italic_θ start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT or θ3.subscript𝜃3\theta_{3}.italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT . If the agent i𝑖iitalic_i mistakenly assumes θi=θ2subscript𝜃𝑖subscript𝜃2\theta_{-i}=\theta_{2}italic_θ start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, its behaviour attempts to minimize Rialt(hT;θ2)=t=1TG(σi,σi;θ2)G(ai(ht),ai(ht);θ2)subscriptsuperscript𝑅alt𝑖subscript𝑇subscript𝜃2subscriptsuperscript𝑇𝑡1𝐺subscriptsuperscript𝜎𝑖subscriptsuperscript𝜎𝑖subscript𝜃2𝐺superscript𝑎𝑖subscript𝑡superscript𝑎𝑖subscript𝑡subscript𝜃2R^{\text{alt}}_{i}(h_{T};\theta_{2})=\sum^{T}_{t=1}G(\sigma^{*}_{i},\sigma^{*}% _{-i};\theta_{2})-G(a^{i}(h_{t}),a^{-i}(h_{t});\theta_{2})italic_R start_POSTSUPERSCRIPT alt end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ∑ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_G ( italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_G ( italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_a start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). Meanwhile, the play of the partner will be a no-regret algorithm with respect to the external regret Riext(h;θ3).subscriptsuperscript𝑅𝑒𝑥𝑡𝑖subscript𝜃3R^{ext}_{-i}(h;\theta_{3}).italic_R start_POSTSUPERSCRIPT italic_e italic_x italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ( italic_h ; italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) . Having no other constraints in the type space, there is nothing stop** us from constructing a ΘΘ\Thetaroman_Θ such that a strategy minimizing Riext(hT;θ3)subscriptsuperscript𝑅𝑒𝑥𝑡𝑖subscript𝑇subscript𝜃3R^{ext}_{-i}(h_{T};\theta_{3})italic_R start_POSTSUPERSCRIPT italic_e italic_x italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ends up maximizing Rialt(hT;θ2).subscriptsuperscript𝑅alt𝑖subscript𝑇subscript𝜃2R^{\text{alt}}_{i}(h_{T};\theta_{2}).italic_R start_POSTSUPERSCRIPT alt end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) . Imagine the ideal case of Riext(hT;θ3)=0subscriptsuperscript𝑅𝑒𝑥𝑡𝑖subscript𝑇subscript𝜃30R^{ext}_{-i}(h_{T};\theta_{3})=0italic_R start_POSTSUPERSCRIPT italic_e italic_x italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = 0 where i𝑖-i- italic_i plays the fixed best action in hindsight asuperscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT throughout hT.subscript𝑇h_{T}.italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT . Then the altruistic regret observed by i𝑖iitalic_i is Rialt(hT;θ2)=t=1TG(σi,σi;θ2)G(ai(ht),ai=a;θ2).R^{\text{alt}}_{i}(h_{T};\theta_{2})=\sum^{T}_{t=1}G(\sigma^{*}_{i},\sigma^{*}% _{-i};\theta_{2})-G(a^{i}(h_{t}),a^{-i}=a^{*};\theta_{2}).italic_R start_POSTSUPERSCRIPT alt end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ∑ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_G ( italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_G ( italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_a start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT = italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) . Let G(ai,a;θ2)=0𝐺superscript𝑎𝑖superscript𝑎subscript𝜃20G(a^{i},a^{*};\theta_{2})=0italic_G ( italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0 for all ai.superscript𝑎𝑖a^{i}.italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT . Then the altruistic regret is t=1TG(σi,σi;θ2)subscriptsuperscript𝑇𝑡1𝐺subscriptsuperscript𝜎𝑖subscriptsuperscript𝜎𝑖subscript𝜃2\sum^{T}_{t=1}G(\sigma^{*}_{i},\sigma^{*}_{-i};\theta_{2})∑ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_G ( italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) which is T𝑇Titalic_T in the worst-case.

4 Upper bound for socially intelligent populations

A key idea behind this work is that against a socially intelligent target population, rather than trying to imitate a member of the population perfectly throughout the entire episode, the AI only needs to imitate them long enough to learn about its partner’s private type. Once it has this information, the AI can leverage the fact that the partner’s strategy is consistent against any strategy, and try to “coerce” the human partner into playing a strategy that minimizes the altruistic regret. We will refer to such strategies as imitate-then-commit (IC) strategies, which use the previous observations 𝒟𝒟\mathcal{D}caligraphic_D to learn an imitation strategy to follow over the first T~<T~𝑇𝑇\tilde{T}<Tover~ start_ARG italic_T end_ARG < italic_T steps of the interaction. In this section we provide an upper bound on the altruistic regret of a specific (IC) strategy, as a function of the number of episodes in 𝒟𝒟\mathcal{D}caligraphic_D, subject to the following assumptions:

Assumption 4.1.

For δ,ϵ>0𝛿italic-ϵ0\delta,\epsilon>0italic_δ , italic_ϵ > 0, and some T~<T~𝑇𝑇\tilde{T}<Tover~ start_ARG italic_T end_ARG < italic_T, we have that

  1. 1.

    ρ𝜌\rhoitalic_ρ is (δ,ϵ,T)𝛿italic-ϵ𝑇(\delta,\epsilon,T)( italic_δ , italic_ϵ , italic_T )-consistent.

  2. 2.

    ρ𝜌\rhoitalic_ρ is (δ,ϵ,T~)𝛿italic-ϵ~𝑇(\delta,\epsilon,\tilde{T})( italic_δ , italic_ϵ , over~ start_ARG italic_T end_ARG )-compatible.

Imitation learning.

Under an imitate-then-commit strategy, the sample complexity is defined entirely by the number of episodes the AI needs to observe to learn a good T~~𝑇\tilde{T}over~ start_ARG italic_T end_ARG-step imitation policy. Fortunately, imitation learning is a well-studied problem, and we can largely leverage existing complexity bounds. The one caveat is that in this setting we need bounds on the total variation distance between the distribution over the partial history hT~subscript~𝑇h_{\tilde{T}}italic_h start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT under the population strategy ρ𝜌\rhoitalic_ρ, and that under the learned strategy. Given the dataset 𝒟𝒟\mathcal{D}caligraphic_D, we define the imitation strategy π^T~1(𝒟)subscriptsuperscript^𝜋1~𝑇𝒟\hat{\pi}^{1}_{\tilde{T}}(\mathcal{D})over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ( caligraphic_D ) such that π^T~1(h;θ,𝒟)subscriptsuperscript^𝜋1~𝑇𝜃𝒟\hat{\pi}^{1}_{\tilde{T}}(h;\theta,\mathcal{D})over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ( italic_h ; italic_θ , caligraphic_D ) is the empirical distribution over agent 1111’s actions for each history-type pair (h,θ)𝜃(h,\theta)( italic_h , italic_θ ) occurring in 𝒟𝒟\mathcal{D}caligraphic_D, while π^T~1(h;θ,𝒟)subscriptsuperscript^𝜋1~𝑇𝜃𝒟\hat{\pi}^{1}_{\tilde{T}}(h;\theta,\mathcal{D})over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ( italic_h ; italic_θ , caligraphic_D ) is the uniform distribution over N𝑁Nitalic_N for (h,θ)𝒟𝜃𝒟(h,\theta)\notin\mathcal{D}( italic_h , italic_θ ) ∉ caligraphic_D. We then define the marginal strategy π^T~1subscriptsuperscript^𝜋1~𝑇\hat{\pi}^{1}_{\tilde{T}}over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT, which can be implemented by sampling a dataset 𝒟𝒟\mathcal{D}caligraphic_D, and then following the imitation strategy defined by 𝒟𝒟\mathcal{D}caligraphic_D for the next T~~𝑇\tilde{T}over~ start_ARG italic_T end_ARG steps. We then have the following bound on the distribution of hT~subscript~𝑇h_{\tilde{T}}italic_h start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT under the imitation strategy:

Lemma 4.2.

Let pT~subscript𝑝~𝑇p_{\tilde{T}}italic_p start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT be the distribution over partial histories hT~subscript~𝑇h_{\tilde{T}}italic_h start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT under the population strategy ρ𝜌\rhoitalic_ρ, and let p^T~subscript^𝑝~𝑇\hat{p}_{\tilde{T}}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT be their distribution under π^T~1subscriptsuperscript^𝜋1~𝑇\hat{\pi}^{1}_{\tilde{T}}over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT. We have that

pT~p^T~TVmin{T~,N2(T~+1)|Θ|T~2log(K)K},subscriptnormsubscript𝑝~𝑇subscript^𝑝~𝑇TV~𝑇superscript𝑁2~𝑇1Θsuperscript~𝑇2𝐾𝐾\|p_{\tilde{T}}-\hat{p}_{\tilde{T}}\|_{\text{TV}}\leq\min\left\{\tilde{T},% \frac{N^{2(\tilde{T}+1)}|\Theta|\tilde{T}^{2}\log(K)}{K}\right\},∥ italic_p start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT - over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT TV end_POSTSUBSCRIPT ≤ roman_min { over~ start_ARG italic_T end_ARG , divide start_ARG italic_N start_POSTSUPERSCRIPT 2 ( over~ start_ARG italic_T end_ARG + 1 ) end_POSTSUPERSCRIPT | roman_Θ | over~ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_K ) end_ARG start_ARG italic_K end_ARG } , (9)

where K=|𝒟|𝐾𝒟K=|\mathcal{D}|italic_K = | caligraphic_D |

This bound follows directly from that of [11] via Lemma 1 of [12] (see Appendix B.1 for full proof).

Imitate-then-commit strategy.

For history hT~T~subscript~𝑇subscript~𝑇h_{\tilde{T}}\in\mathcal{H}_{\tilde{T}}italic_h start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT, we let z^(hT~)Δ(N×N)^𝑧subscript~𝑇Δ𝑁𝑁\hat{z}(h_{\tilde{T}})\in\Delta(N\times N)over^ start_ARG italic_z end_ARG ( italic_h start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ) ∈ roman_Δ ( italic_N × italic_N ) denote the empirical joint strategy played up to and including step T~~𝑇\tilde{T}over~ start_ARG italic_T end_ARG. We show that, using z^(hT~)^𝑧subscript~𝑇\hat{z}(h_{\tilde{T}})over^ start_ARG italic_z end_ARG ( italic_h start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ), it is possible to construct a mixture ν𝜈\nuitalic_ν over mixed strategies xΔ(N)𝑥Δ𝑁x\in\Delta(N)italic_x ∈ roman_Δ ( italic_N ) that, in expectation over ν𝜈\nuitalic_ν, the partner’s payoff under their best response to xνsimilar-to𝑥𝜈x\sim\nuitalic_x ∼ italic_ν will be at least as large as their payoff under z^(hT~)^𝑧subscript~𝑇\hat{z}(h_{\tilde{T}})over^ start_ARG italic_z end_ARG ( italic_h start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ). The corresponding IC strategy will operate as follows:

  1. 1.

    Sample 𝒟𝒟\mathcal{D}caligraphic_D and compute the imitation strategy π^T~1(𝒟)subscriptsuperscript^𝜋1~𝑇𝒟\hat{\pi}^{1}_{\tilde{T}}(\mathcal{D})over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ( caligraphic_D ).

  2. 2.

    Play π^T~1(𝒟)subscriptsuperscript^𝜋1~𝑇𝒟\hat{\pi}^{1}_{\tilde{T}}(\mathcal{D})over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ( caligraphic_D ) for the first T~~𝑇\tilde{T}over~ start_ARG italic_T end_ARG steps, and observe hT~subscript~𝑇h_{\tilde{T}}italic_h start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT.

  3. 3.

    Compute a suitable mixture ν𝜈\nuitalic_ν from z^(hT~)^𝑧subscript~𝑇\hat{z}(h_{\tilde{T}})over^ start_ARG italic_z end_ARG ( italic_h start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ), and sample xνsimilar-to𝑥𝜈x\sim\nuitalic_x ∼ italic_ν

  4. 4.

    Sample actions from x𝑥xitalic_x for the remaining TT~𝑇~𝑇T-\tilde{T}italic_T - over~ start_ARG italic_T end_ARG steps

We then have the following upper bound on the altruistic regret achievable with an imitate-then-commit strategy:

Theorem 4.3.

Given that Assumption 4.1 holds for ρ𝜌\rhoitalic_ρ, there exists a data-dependent strategy πIC(𝒟)superscript𝜋IC𝒟\pi^{\text{IC}}(\mathcal{D})italic_π start_POSTSUPERSCRIPT IC end_POSTSUPERSCRIPT ( caligraphic_D ) such that when played by the AI as agent 2222, the altruistic regret satisfies

E[R1alt(hT,θ2)]2δ+δ(K)+(2TT~T+1)ϵ,Edelimited-[]subscriptsuperscript𝑅alt1subscript𝑇subscript𝜃22𝛿𝛿𝐾2𝑇~𝑇𝑇1italic-ϵ\text{E}\left[R^{\text{alt}}_{1}(h_{T},\theta_{2})\right]\leq 2\delta+\delta(K% )+\left(2\frac{T-\tilde{T}}{T}+1\right)\epsilon,E [ italic_R start_POSTSUPERSCRIPT alt end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] ≤ 2 italic_δ + italic_δ ( italic_K ) + ( 2 divide start_ARG italic_T - over~ start_ARG italic_T end_ARG end_ARG start_ARG italic_T end_ARG + 1 ) italic_ϵ , (10)

where K=|𝒟|𝐾𝒟K=|\mathcal{D}|italic_K = | caligraphic_D | and δ(K)𝛿𝐾\delta(K)italic_δ ( italic_K ) is defined as

δ(K)=min{T~,N2(T~+1)|Θ|T~2log(K)K}𝛿𝐾~𝑇superscript𝑁2~𝑇1Θsuperscript~𝑇2𝐾𝐾\delta(K)=\min\left\{\tilde{T},\frac{N^{2(\tilde{T}+1)}|\Theta|\tilde{T}^{2}% \log(K)}{K}\right\}italic_δ ( italic_K ) = roman_min { over~ start_ARG italic_T end_ARG , divide start_ARG italic_N start_POSTSUPERSCRIPT 2 ( over~ start_ARG italic_T end_ARG + 1 ) end_POSTSUPERSCRIPT | roman_Θ | over~ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_K ) end_ARG start_ARG italic_K end_ARG } (11)

and where the expectation is taken over hTsubscript𝑇h_{T}italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, 𝛉𝛉\boldsymbol{\theta}bold_italic_θ, and 𝒟𝒟\mathcal{D}caligraphic_D.

Proof sketch:

By Lemma 4.2, we can learn an imitation strategy such that the corresponding distribution over hT~subscript~𝑇h_{\tilde{T}}italic_h start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT and z^(hT~)^𝑧subscript~𝑇\hat{z}(h_{\tilde{T}})over^ start_ARG italic_z end_ARG ( italic_h start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ) is close to that under ρ𝜌\rhoitalic_ρ in self-play. As ρ𝜌\rhoitalic_ρ is compatible, both agents’ payoffs under z^(hT~)^𝑧subscript~𝑇\hat{z}(h_{\tilde{T}})over^ start_ARG italic_z end_ARG ( italic_h start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ) must be close to those under some PONE. Finally, we can construct a mixture ν𝜈\nuitalic_ν for agent 1111 such that agent 2222’s payoffs under its (approximate) best-response are almost as large as those under z^(hT~)^𝑧subscript~𝑇\hat{z}(h_{\tilde{T}})over^ start_ARG italic_z end_ARG ( italic_h start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ) (see Appendix B.2).

5 Related Work

Our work is closely related to the previous targeted learning model [1, 13, 14],which defines similar compatibility and consistency criteria. The notion of targeted optimality [15] include convergence to learning an approximately best response in a multi-agent model with high probability in a tractable number of steps against a population of memory-bounded adaptive agents. The main difference with our work is that targeted learning only requires consistency against a specific target class of partners, which generally would not include the agent itself, or other adaptive agents. We require socially intelligent agents to be consistent against all possible partner strategies. We also require that cooperation and consistent learning occur over a fixed time horizon T𝑇Titalic_T, rather than asymptotically. These differences mean that a hypothetical “universally cooperative” agent might be able to leverage the consistency of its partner to achieve cooperation without a prearranged convention. Socially intelligent agents can modeled as individually rational learners [16] to achieve Pareto-efficient joint behavior. Our research builds on this work by considering a learning setting where the agent when paired with any member of the population will achieve at least the same utility with high probability as the Pareto-efficient approach.

The problem of training agents to be able to cooperate with previously unseen partners is sometimes referred to as ad hoc teamwork [17, 18] or zero-shot coordination [19], especially in the context of multiagent reinforcement learning. Many approaches in reinforcement learning train cooperative policies that are robust to possible strategies that a human or an AI agent can follow [20]. A lot of these methods build a “population” of partner strategies and maximizes the diversity of this population in order to train the AI’s policy against it [21, 22]. Other approaches assume that there is no prior coordination between the agents [19] to learn rational joint strategies while estimating the agents’ mutual uncertainty about one-another’s strategies [23]. Ad-hoc multiagent coordination can be helpful to learn cooperation among AI agents with the “other-play” algorithm [19] that finds such a strategy as a solution to the corresponding label free coordination problem [23]. A possible approach to solve these problems can be self-play [24] where the agent can optimize themselves by playing with past iterations of themselves in order to estimate the strategies of unseen partners. However, the "self-play" approach can learn cooperative strategies which can "over-fit" [25] to one another in the population of agents. A key goal of Ad hoc coordination (teamwork) and aligned research in zero-shot coordination work has been to avoid this type of overfitting [26]. Our problem domain is closely related to both ad hoc teamwork or zero-shot coordination, since we consider training an agent to cooperate with previously unseen partners, and assume no control over the partner. Even though population-based training approaches to ad hoc teamwork are common, they focus on fully cooperative environments such as Dec-POMDPs, where the main issue is creating a diverse enough population to train with [27]. We consider partners that are self-interested, and do not assume identical payoffs.

Finally, in the case of Hannan-consistent partners, our problem setting is closely related to strategizing against and learning to manipulate no-regret learners [28, 9]. This line of work studies whether an optimizer agent can achieve better payoff than CCE against no-regret learners by learning to enforce a Stackelberg equilibria on them. Their emphasis is on online learning and the optimizer’s payoff, while we focus on the offline setting and cooperation.

6 Conclusion

We provide formal guarantees for successful and reliable cooperation of AI agents with populations of socially intelligent rational agents. This is based on the assumptions that 1) agents in the population are individually rational, and 2) agents in the population when cooperating with another agent in the same group can achieve, at least the same utility that they would with respect to some Pareto efficient equilibrium strategy. We formalize the notion of consistency and cooperative compatibility of agents in two-player general-sum finitely-repeated bi-matrix games between the agents and the population with private type. Our theoretical guarantees are in the offline cooperation setting where the agent has to cooperate with unseen partners in the population to strategize against and manipulate no-regret policies for which we formalize the idea of altruistic regret. We prove that the assumptions on its own are insufficient to learn zero-shot cooperation with partners of the socially intelligent target population. We provide upper bounds on the sample complexity needed to learn a successful cooperation strategy along with lower bounds on when the multi-agent cooperation setting is needed with respect to the populations’ trajectories, the state space and the length of the learning episodes. The bounds in these settings of the agent actively querying the MDP without knowing the transition dynamics of the population or the agent observing the populations’ transition dynamics are much stronger than the bounds that can be derived by naively reducing the cooperation problem to one of reinforcement learning. These complexity analysis and formally proven bounds can be helpful to sustainably model the alignment problem of AI agents.

References

  • Powers and Shoham [2004] Rob Powers and Yoav Shoham. New criteria and a new algorithm for learning in multi-agent systems. Advances in Neural Information Processing Systems, 17, 2004.
  • Shoham and Leyton-Brown [2008] Yoav Shoham and Kevin Leyton-Brown. Multiagent systems: Algorithmic, game-theoretic, and logical foundations. Cambridge University Press, 2008.
  • Hannan [1957] James Hannan. Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, 3(2):97–140, 1957.
  • Cesa-Bianchi and Lugosi [2006] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
  • Mas-Colell et al. [1995] Andreu Mas-Colell, Michael Dennis Whinston, Jerry R Green, et al. Microeconomic theory, volume 1. Oxford university press New York, 1995.
  • Freund and Schapire [1999] Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79–103, 1999.
  • Aumann [1964] Robert J . Aumann. 28. Mixed and Behavior Strategies in Infinite Extensive Games, pages 627–650. Princeton University Press, Princeton, 1964. ISBN 9781400882014. doi: doi:10.1515/9781400882014-029. URL https://doi.org/10.1515/9781400882014-029.
  • Monnot and Piliouras [2017] Barnabé Monnot and Georgios Piliouras. Limits and limitations of no-regret learning in games. The Knowledge Engineering Review, 32:e21, 2017.
  • Brown et al. [2024] William Brown, Jon Schneider, and Kiran Vodrahalli. Is learning in games good for the learners? Advances in Neural Information Processing Systems, 36, 2024.
  • Loftin and Oliehoek [2022] Robert Loftin and Frans A Oliehoek. On the impossibility of learning to cooperate with adaptive partner strategies in repeated games. In International Conference on Machine Learning, pages 14197–14209. PMLR, 2022.
  • Rajaraman et al. [2020] Nived Rajaraman, Lin Yang, Jiantao Jiao, and Kannan Ramchandran. Toward the fundamental limits of imitation learning. Advances in Neural Information Processing Systems, 33:2914–2924, 2020.
  • Ciosek [2022] Kamil Ciosek. Imitation learning by reinforcement learning. In International Conference on Learning Representations, 2022.
  • Powers and Shoham [2005] Rob Powers and Yoav Shoham. Learning against opponents with bounded memory. In The Nineteenth International Joint Conference on Artificial Intelligence, pages 817–822, 2005.
  • Chakraborty and Stone [2010a] Doran Chakraborty and Peter Stone. Convergence, targeted optimality and safety in multiagent learning. In Proceedings of the Twenty-seventh International Conference on Machine Learning (ICML 2010), June 2010a. URL http://www.cs.utexas.edu/users/ai-lab?chakraborty:icml10.
  • Chakraborty and Stone [2010b] Doran Chakraborty and Peter Stone. Convergence, targeted optimality, and safety in multiagent learning. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, page 191–198, Madison, WI, USA, 2010b. Omnipress. ISBN 9781605589077.
  • Loftin et al. [2023] Robert Loftin, Mustafa Mert Çelikok, and Frans A. Oliehoek. Towards a unifying model of rationality in multiagent systems, 2023.
  • Stone et al. [2010] Peter Stone, Gal Kaminka, Sarit Kraus, and Jeffrey Rosenschein. Ad hoc autonomous agent teams: Collaboration without pre-coordination. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 24, pages 1504–1509, 2010.
  • Mirsky et al. [2022] Reuth Mirsky, Ignacio Carlucho, Arrasy Rahman, Elliot Fosong, William Macke, Mohan Sridharan, Peter Stone, and Stefano V Albrecht. A survey of ad hoc teamwork research. In European conference on multi-agent systems, pages 275–293. Springer, 2022.
  • Hu et al. [2020] Hengyuan Hu, Adam Lerer, Alex Peysakhovich, and Jakob Foerster. “other-play” for zero-shot coordination. In International Conference on Machine Learning, pages 4399–4410. PMLR, 2020.
  • Carroll et al. [2019] Micah Carroll, Rohin Shah, Mark K Ho, Tom Griffiths, Sanjit Seshia, Pieter Abbeel, and Anca Dragan. On the utility of learning about humans for human-ai coordination. Advances in Neural Information Processing Systems, 32:5174–5185, 2019.
  • Strouse et al. [2021a] DJ Strouse, Kevin McKee, Matt Botvinick, Edward Hughes, and Richard Everett. Collaborating with humans without human data. Advances in Neural Information Processing Systems, 34:14502–14515, 2021a.
  • Cui et al. [2023] Brandon Cui, Andrei Lupu, Samuel Sokota, Hengyuan Hu, David J Wu, and Jakob Nicolaus Foerster. Adversarial diversity in hanabi. In The Eleventh International Conference on Learning Representations, 2023.
  • Treutlein et al. [2021] Johannes Treutlein, Michael Dennis, Caspar Oesterheld, and Jakob Foerster. A new formalism, method and open issues for zero-shot coordination. In International Conference on Machine Learning, pages 10413–10423. PMLR, 2021.
  • Zand et al. [2022] Jaleh Zand, Jack Parker-Holder, and Stephen J. Roberts. On-the-fly strategy adaptation for ad-hoc agent coordination. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’22, page 1771–1773, Richland, SC, 2022. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9781450392136.
  • Strouse et al. [2021b] DJ Strouse, Kevin McKee, Matt Botvinick, Edward Hughes, and Richard Everett. Collaborating with humans without human data. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 14502–14515. Curran Associates, Inc., 2021b. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/797134c3e42371bb4979a462eb2f042a-Paper.pdf.
  • Cui et al. [2021] Brandon Cui, Hengyuan Hu, Luis Pineda, and Jakob Foerster. K-level reasoning for zero-shot coordination in hanabi. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 8215–8228. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/4547dff5fd7604f18c8ee32cf3da41d7-Paper.pdf.
  • Rahman et al. [2024] Muhammad Rahman, Jiaxun Cui, and Peter Stone. Minimum coverage sets for training robust ad hoc teamwork agents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17523–17530, 2024.
  • Deng et al. [2019] Yuan Deng, Jon Schneider, and Balasubramanian Sivan. Strategizing against no-regret learners. Advances in neural information processing systems, 32, 2019.
  • Hoeffding [1963] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
  • von Stengel and Zamir [2004] Bernhard von Stengel and Shmuel Zamir. Leadership with commitment to mixed strategies. 2004.

Acknowledgments and Disclosure of Funding

This work has been supported by the Hybrid Intelligence Center, https://hybrid-intelligence-centre.nl, grant number 024.004.022.

Appendix A Proofs for Section 2

A.1 Proof of Lemma 2.4

Here the joint type 𝜽𝜽\boldsymbol{\theta}bold_italic_θ will be implicit. For i{1,2}𝑖12i\in\{1,2\}italic_i ∈ { 1 , 2 }, we define Vtisubscriptsuperscript𝑉𝑖𝑡V^{i}_{t}italic_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as

Vti=Gi(sti,sti)Gi(sti,ati)subscriptsuperscript𝑉𝑖𝑡subscript𝐺𝑖subscriptsuperscript𝑠𝑖𝑡subscriptsuperscript𝑠𝑖𝑡subscript𝐺𝑖subscriptsuperscript𝑠𝑖𝑡subscriptsuperscript𝑎𝑖𝑡V^{i}_{t}=G_{i}(s^{i}_{t},s^{-i}_{t})-G_{i}(s^{i}_{t},a^{-i}_{t})italic_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (12)

We can see that E[Vti|ht1]=0Edelimited-[]conditionalsubscriptsuperscript𝑉𝑖𝑡subscript𝑡10\text{E}[V^{i}_{t}|h_{t-1}]=0E [ italic_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] = 0. We can then have that

R¯textsubscriptsuperscript¯𝑅ext𝑡\displaystyle\bar{R}^{\text{ext}}_{t}over¯ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ext end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =maxaNr=1t{Gi(a,sri)Gi(sri,ari)}absentsubscript𝑎𝑁subscriptsuperscript𝑡𝑟1subscript𝐺𝑖𝑎subscriptsuperscript𝑠𝑖𝑟subscript𝐺𝑖subscriptsuperscript𝑠𝑖𝑟subscriptsuperscript𝑎𝑖𝑟\displaystyle=\max_{a\in N}\sum^{t}_{r=1}\left\{G_{i}(a,s^{-i}_{r})-G_{i}(s^{i% }_{r},a^{-i}_{r})\right\}= roman_max start_POSTSUBSCRIPT italic_a ∈ italic_N end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT { italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) - italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) } (13)
=maxaNr=1t{Gi(a,sri)Gi(sri,sri)+Gi(sri,sri)Gi(sri,ari)}absentsubscript𝑎𝑁subscriptsuperscript𝑡𝑟1subscript𝐺𝑖𝑎subscriptsuperscript𝑠𝑖𝑟subscript𝐺𝑖subscriptsuperscript𝑠𝑖𝑟subscriptsuperscript𝑠𝑖𝑟subscript𝐺𝑖subscriptsuperscript𝑠𝑖𝑟subscriptsuperscript𝑠𝑖𝑟subscript𝐺𝑖subscriptsuperscript𝑠𝑖𝑟subscriptsuperscript𝑎𝑖𝑟\displaystyle=\max_{a\in N}\sum^{t}_{r=1}\left\{G_{i}(a,s^{-i}_{r})-G_{i}(s^{i% }_{r},s^{-i}_{r})+G_{i}(s^{i}_{r},s^{-i}_{r})-G_{i}(s^{i}_{r},a^{-i}_{r})\right\}= roman_max start_POSTSUBSCRIPT italic_a ∈ italic_N end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT { italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) - italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) + italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) - italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) } (14)
=r=1t{Gi(sri,sri)Gi(sri,ari)}=r=1tVriabsentsubscriptsuperscript𝑡𝑟1subscript𝐺𝑖subscriptsuperscript𝑠𝑖𝑟subscriptsuperscript𝑠𝑖𝑟subscript𝐺𝑖subscriptsuperscript𝑠𝑖𝑟subscriptsuperscript𝑎𝑖𝑟subscriptsuperscript𝑡𝑟1subscriptsuperscript𝑉𝑖𝑟\displaystyle=\sum^{t}_{r=1}\left\{G_{i}(s^{i}_{r},s^{-i}_{r})-G_{i}(s^{i}_{r}% ,a^{-i}_{r})\right\}=\sum^{t}_{r=1}V^{i}_{r}= ∑ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT { italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) - italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) } = ∑ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT (15)
2Tln1δabsent2𝑇1𝛿\displaystyle\leq\sqrt{\frac{2}{T}\ln\frac{1}{\delta}}≤ square-root start_ARG divide start_ARG 2 end_ARG start_ARG italic_T end_ARG roman_ln divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG (16)

with probability 1δ1𝛿1-\delta1 - italic_δ for all tT𝑡𝑇t\leq Titalic_t ≤ italic_T simultaneously.

This follows from the fact that |Vti|[0,1]subscriptsuperscript𝑉𝑖𝑡01|V^{i}_{t}|\in[0,1]| italic_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ∈ [ 0 , 1 ] and the “maximal” Azuma-Hoeffding inequality [29]. The second equality follows from the fact that sti,sti=s(θ)subscriptsuperscript𝑠𝑖𝑡subscriptsuperscript𝑠𝑖𝑡𝑠𝜃\langle s^{i}_{t},s^{-i}_{t}\rangle=s(\theta)⟨ italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ = italic_s ( italic_θ ) is a Nash equilibrium. The first bound of Lemma 2.4 follows from a union bound over the probability for both players, while the second bound combines this with Equation 3. \square

A.2 Proof of Theorem 2.5

Theorem A.1 (2.6).

For any δ,T>k𝛿𝑇𝑘\delta,T>kitalic_δ , italic_T > italic_k, let ϵ02(Tk)ln2δsubscriptitalic-ϵ02𝑇𝑘2𝛿\epsilon_{0}\geq\sqrt{\frac{2}{(T-k)}\ln\frac{2}{\delta}}italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≥ square-root start_ARG divide start_ARG 2 end_ARG start_ARG ( italic_T - italic_k ) end_ARG roman_ln divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG end_ARG, and let ϵ1=ϵ0+12(Tk)lnN+1(Tk)subscriptitalic-ϵ1subscriptitalic-ϵ012𝑇𝑘𝑁1𝑇𝑘\epsilon_{1}=\epsilon_{0}+\sqrt{\frac{1}{2(T-k)}\ln N}+\frac{1}{(T-k)}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 ( italic_T - italic_k ) end_ARG roman_ln italic_N end_ARG + divide start_ARG 1 end_ARG start_ARG ( italic_T - italic_k ) end_ARG. Then for ϵ=ϵ1+(Tk)2ln1δitalic-ϵsubscriptitalic-ϵ1𝑇𝑘21𝛿\epsilon=\epsilon_{1}+\sqrt{\frac{(T-k)}{2}\ln\frac{1}{\delta}}italic_ϵ = italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + square-root start_ARG divide start_ARG ( italic_T - italic_k ) end_ARG start_ARG 2 end_ARG roman_ln divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG, the πT,ϵ1superscript𝜋𝑇subscriptitalic-ϵ1\pi^{T,\epsilon_{1}}italic_π start_POSTSUPERSCRIPT italic_T , italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is (δ,ϵ,T)𝛿italic-ϵ𝑇(\delta,\epsilon,T)( italic_δ , italic_ϵ , italic_T )-socially intelligent.

Proof. By the definition of ϵ1subscriptitalic-ϵ1\epsilon_{1}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, πT,ϵ1superscript𝜋𝑇subscriptitalic-ϵ1\pi^{T,\epsilon_{1}}italic_π start_POSTSUPERSCRIPT italic_T , italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT will only deviate when playing with itself if at some point k<tT𝑘𝑡𝑇k<t\leq Titalic_k < italic_t ≤ italic_T one player incurs an expected external regret of at least ϵ0subscriptitalic-ϵ0\epsilon_{0}italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and by Lemma 2.4 that will occur with probability at most δ𝛿\deltaitalic_δ. Therefore, πT,ϵ1superscript𝜋𝑇subscriptitalic-ϵ1\pi^{T,\epsilon_{1}}italic_π start_POSTSUPERSCRIPT italic_T , italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is (δ,ϵ0,T)𝛿subscriptitalic-ϵ0𝑇(\delta,\epsilon_{0},T)( italic_δ , italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T )-compatible. We also have that the total expected external regret of the MW agent πmw,Tsuperscript𝜋mw𝑇\pi^{\text{mw},T}italic_π start_POSTSUPERSCRIPT mw , italic_T end_POSTSUPERSCRIPT is at most (T/2)lnN𝑇2𝑁\sqrt{(T/2)\ln N}square-root start_ARG ( italic_T / 2 ) roman_ln italic_N end_ARG. This means that if πT,ϵ1superscript𝜋𝑇subscriptitalic-ϵ1\pi^{T,\epsilon_{1}}italic_π start_POSTSUPERSCRIPT italic_T , italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT switches at stage t𝑡titalic_t, then the maximum possible expected external regret incurred by πT,ϵ1superscript𝜋𝑇subscriptitalic-ϵ1\pi^{T,\epsilon_{1}}italic_π start_POSTSUPERSCRIPT italic_T , italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT will be less than R¯iext(ht;θ)+T2lnNsubscriptsuperscript¯𝑅ext𝑖subscript𝑡𝜃𝑇2𝑁\bar{R}^{\text{ext}}_{i}(h_{t};\theta)+\sqrt{\frac{T}{2}\ln N}over¯ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ext end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ ) + square-root start_ARG divide start_ARG italic_T end_ARG start_ARG 2 end_ARG roman_ln italic_N end_ARG. Since πmw,Tsuperscript𝜋mw𝑇\pi^{\text{mw},T}italic_π start_POSTSUPERSCRIPT mw , italic_T end_POSTSUPERSCRIPT will always switch just before this point is reached, its total expected regret will be less than ϵ1subscriptitalic-ϵ1\epsilon_{1}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT surely, and will be less than ϵitalic-ϵ\epsilonitalic_ϵ w.p. 1δ1𝛿1-\delta1 - italic_δ. As ϵϵ0italic-ϵsubscriptitalic-ϵ0\epsilon\geq\epsilon_{0}italic_ϵ ≥ italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we have that the π,T,ϵ1\pi^{,T,\epsilon_{1}}italic_π start_POSTSUPERSCRIPT , italic_T , italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is (δ,ϵ,T)𝛿italic-ϵ𝑇(\delta,\epsilon,T)( italic_δ , italic_ϵ , italic_T )-socially intelligent.

Appendix B Proofs for Section 4

B.1 Proof of Lemma 4.2

We first apply Theorem 4.4 of [11], which states that, for episodic imitation learning over H𝐻Hitalic_H-step trajectories, for any expert policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT we have

J(π)E𝒟[J(π^T~1(h;θ,𝒟))]min{H,|S|H2log(K)K},𝐽superscript𝜋subscriptE𝒟delimited-[]𝐽subscriptsuperscript^𝜋1~𝑇𝜃𝒟𝐻𝑆superscript𝐻2𝐾𝐾J(\pi^{*})-\text{E}_{\mathcal{D}}\left[J(\hat{\pi}^{1}_{\tilde{T}}(h;\theta,% \mathcal{D}))\right]\leq\min\left\{H,\frac{|S|H^{2}\log(K)}{K}\right\},italic_J ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_J ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ( italic_h ; italic_θ , caligraphic_D ) ) ] ≤ roman_min { italic_H , divide start_ARG | italic_S | italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_K ) end_ARG start_ARG italic_K end_ARG } , (17)

where S𝑆Sitalic_S is the state space, with per-step rewards bounded in [0,1]01[0,1][ 0 , 1 ]. We can model the interaction with ρ𝜌\rhoitalic_ρ as a T~~𝑇\tilde{T}over~ start_ARG italic_T end_ARG-step episodic MDP/R with S=T~𝑆subscriptabsent~𝑇S=\mathcal{H}_{\leq\tilde{T}}italic_S = caligraphic_H start_POSTSUBSCRIPT ≤ over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT. Plugging in H=T~𝐻~𝑇H=\tilde{T}italic_H = over~ start_ARG italic_T end_ARG, |S|<N2(T~+1)𝑆superscript𝑁2~𝑇1|S|<N^{2(\tilde{T}+1)}| italic_S | < italic_N start_POSTSUPERSCRIPT 2 ( over~ start_ARG italic_T end_ARG + 1 ) end_POSTSUPERSCRIPT, and π=ρsuperscript𝜋𝜌\pi^{*}=\rhoitalic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_ρ gives us

J(ρ)E𝒟[J(π^T~1(h;θ,𝒟))]min{T~,N2T~|Θ|T~2log(N)K}.𝐽𝜌subscriptE𝒟delimited-[]𝐽subscriptsuperscript^𝜋1~𝑇𝜃𝒟~𝑇superscript𝑁2~𝑇Θsuperscript~𝑇2𝑁𝐾J(\rho)-\text{E}_{\mathcal{D}}\left[J(\hat{\pi}^{1}_{\tilde{T}}(h;\theta,% \mathcal{D}))\right]\leq\min\left\{\tilde{T},\frac{N^{2\tilde{T}}|\Theta|% \tilde{T}^{2}\log(N)}{K}\right\}.italic_J ( italic_ρ ) - E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_J ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ( italic_h ; italic_θ , caligraphic_D ) ) ] ≤ roman_min { over~ start_ARG italic_T end_ARG , divide start_ARG italic_N start_POSTSUPERSCRIPT 2 over~ start_ARG italic_T end_ARG end_POSTSUPERSCRIPT | roman_Θ | over~ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_N ) end_ARG start_ARG italic_K end_ARG } . (18)

This bound holds simultaneously for all possible reward functions bounded in [0,1]01[0,1][ 0 , 1 ]. If we restrict the reward function r𝑟ritalic_r to be non-zero only for the terminal states T~subscript~𝑇\mathcal{H}_{\tilde{T}}caligraphic_H start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT, we have

J(π)E𝒟[J(π^T~1(h;θ,𝒟))]=EpT~[r(hT~)]Ep^T~[r(hT~)],𝐽superscript𝜋subscriptE𝒟delimited-[]𝐽subscriptsuperscript^𝜋1~𝑇𝜃𝒟subscriptEsubscript𝑝~𝑇delimited-[]𝑟subscript~𝑇subscriptEsubscript^𝑝~𝑇delimited-[]𝑟subscript~𝑇J(\pi^{*})-\text{E}_{\mathcal{D}}\left[J(\hat{\pi}^{1}_{\tilde{T}}(h;\theta,% \mathcal{D}))\right]=\text{E}_{p_{\tilde{T}}}[r(h_{\tilde{T}})]-\text{E}_{\hat% {p}_{\tilde{T}}}[r(h_{\tilde{T}})],italic_J ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_J ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ( italic_h ; italic_θ , caligraphic_D ) ) ] = E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_h start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ) ] - E start_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_h start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ) ] , (19)

using the definition of the marginal strategy π^T~1subscriptsuperscript^𝜋1~𝑇\hat{\pi}^{1}_{\tilde{T}}over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT. Finally, applying Lemma 1 of [12] gives us

pT~p^T~TVmin{T~,N2T~|Θ|T~2log(N)K},subscriptnormsubscript𝑝~𝑇subscript^𝑝~𝑇TV~𝑇superscript𝑁2~𝑇Θsuperscript~𝑇2𝑁𝐾\|p_{\tilde{T}}-\hat{p}_{\tilde{T}}\|_{\text{TV}}\leq\min\left\{\tilde{T},% \frac{N^{2\tilde{T}}|\Theta|\tilde{T}^{2}\log(N)}{K}\right\},∥ italic_p start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT - over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT TV end_POSTSUBSCRIPT ≤ roman_min { over~ start_ARG italic_T end_ARG , divide start_ARG italic_N start_POSTSUPERSCRIPT 2 over~ start_ARG italic_T end_ARG end_POSTSUPERSCRIPT | roman_Θ | over~ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_N ) end_ARG start_ARG italic_K end_ARG } , (20)

the desired result. ∎

B.2 Proof of Theorem 4.3

First, let τ2(𝜽)superscript𝜏2𝜽\tau^{2}(\boldsymbol{\theta})italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_θ ), defined as

τ2(𝜽)=minσ1,σ2𝒫(G(𝜽))G(σ2,σ1;θ2),superscript𝜏2𝜽subscriptsuperscript𝜎1superscript𝜎2𝒫𝐺𝜽𝐺superscript𝜎2superscript𝜎1superscript𝜃2\tau^{2}(\boldsymbol{\theta})=\min_{\langle\sigma^{1},\sigma^{2}\rangle\in% \mathcal{P}(G(\boldsymbol{\theta}))}G(\sigma^{2},\sigma^{1};\theta^{2}),italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_θ ) = roman_min start_POSTSUBSCRIPT ⟨ italic_σ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⟩ ∈ caligraphic_P ( italic_G ( bold_italic_θ ) ) end_POSTSUBSCRIPT italic_G ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ; italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (21)

denote agent 2222’s payoff under the worst possible payoff for a PONE of the game parameterized by joint type 𝜽𝜽\boldsymbol{\theta}bold_italic_θ. Let 𝒞𝒞\mathcal{C}caligraphic_C denote the event that

τ2(𝜽)1T~t=1T~G(at2,at1;θ2)ϵsuperscript𝜏2𝜽1~𝑇subscriptsuperscript~𝑇𝑡1𝐺subscriptsuperscript𝑎2𝑡subscriptsuperscript𝑎1𝑡subscript𝜃2italic-ϵ\tau^{2}(\boldsymbol{\theta})-\frac{1}{\tilde{T}}\sum^{\tilde{T}}_{t=1}G(a^{2}% _{t},a^{1}_{t};\theta_{2})\leq\epsilonitalic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_θ ) - divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_T end_ARG end_ARG ∑ start_POSTSUPERSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_G ( italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≤ italic_ϵ (22)

Because ρ𝜌\rhoitalic_ρ is (δ,ϵ,T~)𝛿italic-ϵ~𝑇(\delta,\epsilon,\tilde{T})( italic_δ , italic_ϵ , over~ start_ARG italic_T end_ARG )-compatible, we have that Prρ{C}1δsubscriptPr𝜌𝐶1𝛿\text{Pr}_{\rho}\{C\}\geq 1-\deltaPr start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT { italic_C } ≥ 1 - italic_δ. For δ(K)𝛿𝐾\delta(K)italic_δ ( italic_K ) defined as

δ(K)=min{T~,N2(T~+1)|Θ|T~2log(K)K},𝛿𝐾~𝑇superscript𝑁2~𝑇1Θsuperscript~𝑇2𝐾𝐾\delta(K)=\min\left\{\tilde{T},\frac{N^{2(\tilde{T}+1)}|\Theta|\tilde{T}^{2}% \log(K)}{K}\right\},italic_δ ( italic_K ) = roman_min { over~ start_ARG italic_T end_ARG , divide start_ARG italic_N start_POSTSUPERSCRIPT 2 ( over~ start_ARG italic_T end_ARG + 1 ) end_POSTSUPERSCRIPT | roman_Θ | over~ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_K ) end_ARG start_ARG italic_K end_ARG } , (23)

Lemma 4.2 also gives us Prπ^1,ρ{C}1δδ(K)subscriptPrsuperscript^𝜋1𝜌𝐶1𝛿𝛿𝐾\text{Pr}_{\hat{\pi}^{1},\rho}\{C\}\geq 1-\delta-\delta(K)Pr start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ρ end_POSTSUBSCRIPT { italic_C } ≥ 1 - italic_δ - italic_δ ( italic_K ). We therefore have that

Eπ^1,ρ[t=1TG(at2,at1;θ2)]subscriptEsuperscript^𝜋1𝜌delimited-[]subscriptsuperscript𝑇𝑡1𝐺subscriptsuperscript𝑎2𝑡subscriptsuperscript𝑎1𝑡subscript𝜃2\displaystyle\text{E}_{\hat{\pi}^{1},\rho}\left[\sum^{T}_{t=1}G(a^{2}_{t},a^{1% }_{t};\theta_{2})\right]E start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ρ end_POSTSUBSCRIPT [ ∑ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_G ( italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] Eπ^1,ρ[t=1TG(at2,at1;θ2)|𝒞]T(δ+δ(K))absentsubscriptEsuperscript^𝜋1𝜌delimited-[]conditionalsubscriptsuperscript𝑇𝑡1𝐺subscriptsuperscript𝑎2𝑡subscriptsuperscript𝑎1𝑡subscript𝜃2𝒞𝑇𝛿𝛿𝐾\displaystyle\geq\text{E}_{\hat{\pi}^{1},\rho}\left[\sum^{T}_{t=1}G(a^{2}_{t},% a^{1}_{t};\theta_{2})|\mathcal{C}\right]-T(\delta+\delta(K))≥ E start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ρ end_POSTSUBSCRIPT [ ∑ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_G ( italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | caligraphic_C ] - italic_T ( italic_δ + italic_δ ( italic_K ) ) (24)
=Eπ^1,ρ[t=1T~G(at2,at1;θ2)|𝒞]absentsubscriptEsuperscript^𝜋1𝜌delimited-[]conditionalsubscriptsuperscript~𝑇𝑡1𝐺subscriptsuperscript𝑎2𝑡subscriptsuperscript𝑎1𝑡subscript𝜃2𝒞\displaystyle=\text{E}_{\hat{\pi}^{1},\rho}\left[\sum^{\tilde{T}}_{t=1}G(a^{2}% _{t},a^{1}_{t};\theta_{2})|\mathcal{C}\right]= E start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ρ end_POSTSUBSCRIPT [ ∑ start_POSTSUPERSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_G ( italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | caligraphic_C ] +Eπ^1,ρ[t=T~+1TG(at2,at1;θ2)|𝒞]T(δ+δ(K))subscriptEsuperscript^𝜋1𝜌delimited-[]conditionalsubscriptsuperscript𝑇𝑡~𝑇1𝐺subscriptsuperscript𝑎2𝑡subscriptsuperscript𝑎1𝑡subscript𝜃2𝒞𝑇𝛿𝛿𝐾\displaystyle+\text{E}_{\hat{\pi}^{1},\rho}\left[\sum^{T}_{t=\tilde{T}+1}G(a^{% 2}_{t},a^{1}_{t};\theta_{2})|\mathcal{C}\right]-T(\delta+\delta(K))+ E start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ρ end_POSTSUBSCRIPT [ ∑ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = over~ start_ARG italic_T end_ARG + 1 end_POSTSUBSCRIPT italic_G ( italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | caligraphic_C ] - italic_T ( italic_δ + italic_δ ( italic_K ) ) (25)
Eπ^1,ρ[t=T~+1TG(at2,at1;θ2)|𝒞]absentsubscriptEsuperscript^𝜋1𝜌delimited-[]conditionalsubscriptsuperscript𝑇𝑡~𝑇1𝐺subscriptsuperscript𝑎2𝑡subscriptsuperscript𝑎1𝑡subscript𝜃2𝒞\displaystyle\geq\text{E}_{\hat{\pi}^{1},\rho}\left[\sum^{T}_{t=\tilde{T}+1}G(% a^{2}_{t},a^{1}_{t};\theta_{2})|\mathcal{C}\right]≥ E start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ρ end_POSTSUBSCRIPT [ ∑ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = over~ start_ARG italic_T end_ARG + 1 end_POSTSUBSCRIPT italic_G ( italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | caligraphic_C ] +T(τ2(𝜽)ϵδδ(K))𝑇superscript𝜏2𝜽italic-ϵ𝛿𝛿𝐾\displaystyle+T(\tau^{2}(\boldsymbol{\theta})-\epsilon-\delta-\delta(K))+ italic_T ( italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_θ ) - italic_ϵ - italic_δ - italic_δ ( italic_K ) ) (26)

We therefore need to lower-bound the term

Eπ^1,ρ[t=T~+1TG(at2,at1;θ2)|𝒞]subscriptEsuperscript^𝜋1𝜌delimited-[]conditionalsubscriptsuperscript𝑇𝑡~𝑇1𝐺subscriptsuperscript𝑎2𝑡subscriptsuperscript𝑎1𝑡subscript𝜃2𝒞\text{E}_{\hat{\pi}^{1},\rho}\left[\sum^{T}_{t=\tilde{T}+1}G(a^{2}_{t},a^{1}_{% t};\theta_{2})|\mathcal{C}\right]E start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ρ end_POSTSUBSCRIPT [ ∑ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = over~ start_ARG italic_T end_ARG + 1 end_POSTSUBSCRIPT italic_G ( italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | caligraphic_C ] (27)

This will be the expected payoff given the strategy xνsimilar-to𝑥𝜈x\sim\nuitalic_x ∼ italic_ν the AI commits to for the remaining TT~𝑇~𝑇T-\tilde{T}italic_T - over~ start_ARG italic_T end_ARG steps. The idea now is that we can construct a mixture ν𝜈\nuitalic_ν over strategies that the the AI can commit to for the remaining TT~𝑇~𝑇T-\tilde{T}italic_T - over~ start_ARG italic_T end_ARG steps such that the partner’s payoff under their (approximate) best-response will be nearly as good as that under z^(hT~)^𝑧subscript~𝑇\hat{z}(h_{\tilde{T}})over^ start_ARG italic_z end_ARG ( italic_h start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ).

Let G(z;θ2)=iMjMzi,jG(j,i;θ2)𝐺𝑧superscript𝜃2subscript𝑖𝑀subscript𝑗𝑀subscript𝑧𝑖𝑗𝐺𝑗𝑖superscript𝜃2G(z;\theta^{2})=\sum_{i\in M}\sum_{j\in M}z_{i,j}G(j,i;\theta^{2})italic_G ( italic_z ; italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_M end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ italic_M end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_G ( italic_j , italic_i ; italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) be agent 2’s expected payoff under z𝑧zitalic_z. For any joint strategy z𝑧zitalic_z, we can construct ν𝜈\nuitalic_ν such that if the AI commits to strategies sampled from ν𝜈\nuitalic_ν, the partner will have the same information about the AI’s probably actions as they would given their “recommended” action under z^(hT~)^𝑧subscript~𝑇\hat{z}(h_{\tilde{T}})over^ start_ARG italic_z end_ARG ( italic_h start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ). We build on the construction used by von Stengel and Zamir [30]. For any joint strategy z𝑧zitalic_z, we let zj=iNzijsubscript𝑧𝑗subscript𝑖𝑁subscript𝑧𝑖𝑗z_{j}=\sum_{i\in N}z_{ij}italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT denote the marginal probability that the column player (agent 2) plays j𝑗jitalic_j under z𝑧zitalic_z. For all jN𝑗𝑁j\in Nitalic_j ∈ italic_N such that zj>0subscript𝑧𝑗0z_{j}>0italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0, we define xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT as the conditional distribution over the row-player (agent 1’s) actions given that the column player plays j𝑗jitalic_j, such that xj(i)=zijzjsubscript𝑥𝑗𝑖subscript𝑧𝑖𝑗subscript𝑧𝑗x_{j}(i)=\frac{z_{ij}}{z_{j}}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_i ) = divide start_ARG italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG. We then define ν𝜈\nuitalic_ν as the strategy that commits to each xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with probability zjsubscript𝑧𝑗z_{j}italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

We can show that, when the partner plays a best-response to xνsimilar-to𝑥𝜈x\sim\nuitalic_x ∼ italic_ν, their payoff will be no worse than under z𝑧zitalic_z itself. We first construct a response function rzsubscript𝑟𝑧r_{z}italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT such that when agent 2 responds to xνsimilar-to𝑥𝜈x\sim\nuitalic_x ∼ italic_ν with rz(x)subscript𝑟𝑧𝑥r_{z}(x)italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_x ), its expected payoff equals G(z;θ2)𝐺𝑧superscript𝜃2G(z;\theta^{2})italic_G ( italic_z ; italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Let S={jN:zj>0}𝑆conditional-set𝑗𝑁subscript𝑧𝑗0S=\{j\in N:z_{j}>0\}italic_S = { italic_j ∈ italic_N : italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 }, and partition S𝑆Sitalic_S into 𝒫𝒫\mathcal{P}caligraphic_P such that, for each P𝒫𝑃𝒫P\in\mathcal{P}italic_P ∈ caligraphic_P, we have xj=xlsubscript𝑥𝑗subscript𝑥𝑙x_{j}=x_{l}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT for all j,lP𝑗𝑙𝑃j,l\in Pitalic_j , italic_l ∈ italic_P. For each P𝒫𝑃𝒫P\in\mathcal{P}italic_P ∈ caligraphic_P, we then define the strategy yPsubscript𝑦𝑃y_{P}italic_y start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT such that

yP(j)=zjlPzlsubscript𝑦𝑃𝑗subscript𝑧𝑗subscript𝑙𝑃subscript𝑧𝑙y_{P}(j)=\frac{z_{j}}{\sum_{l\in P}z_{l}}italic_y start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_j ) = divide start_ARG italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_l ∈ italic_P end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG (28)

for each jP𝑗𝑃j\in Pitalic_j ∈ italic_P, with yP(j)=0subscript𝑦𝑃𝑗0y_{P}(j)=0italic_y start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_j ) = 0 for jP𝑗𝑃j\notin Pitalic_j ∉ italic_P. (Note that if z𝑧zitalic_z corresponds to some uncorrelated strategy x,y𝑥𝑦\langle x,y\rangle⟨ italic_x , italic_y ⟩, then P=N𝑃𝑁P=Nitalic_P = italic_N and yP=ysubscript𝑦𝑃𝑦y_{P}=yitalic_y start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = italic_y.) Finally, for jS𝑗𝑆j\in Sitalic_j ∈ italic_S, we define P(j)𝑃𝑗P(j)italic_P ( italic_j ) as the partition containing j𝑗jitalic_j, and define rzsubscript𝑟𝑧r_{z}italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT such that rz(xj)=xP(j)subscript𝑟𝑧subscript𝑥𝑗subscript𝑥𝑃𝑗r_{z}(x_{j})=x_{P(j)}italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT italic_P ( italic_j ) end_POSTSUBSCRIPT. We leave rzsubscript𝑟𝑧r_{z}italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT undefined for x𝑥xitalic_x where μ(x)=0𝜇𝑥0\mu(x)=0italic_μ ( italic_x ) = 0. Now let xPsubscript𝑥𝑃x_{P}italic_x start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT be the common conditional strategy for all jP𝑗𝑃j\in Pitalic_j ∈ italic_P, and let zP=jPzjsubscript𝑧𝑃subscript𝑗𝑃subscript𝑧𝑗z_{P}=\sum_{j\in P}z_{j}italic_z start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j ∈ italic_P end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. We then have that

ExνG(rz(x),x;θ2)subscriptEsimilar-to𝑥𝜈𝐺subscript𝑟𝑧𝑥𝑥superscript𝜃2\displaystyle\text{E}_{x\sim\nu}G(r_{z}(x),x;\theta^{2})E start_POSTSUBSCRIPT italic_x ∼ italic_ν end_POSTSUBSCRIPT italic_G ( italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_x ) , italic_x ; italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) =jSzj[xjG(θ2)rz(xj)]absentsubscript𝑗𝑆subscript𝑧𝑗delimited-[]subscriptsuperscript𝑥top𝑗𝐺superscriptsuperscript𝜃2topsubscript𝑟𝑧subscript𝑥𝑗\displaystyle=\sum_{j\in S}z_{j}\left[x^{\top}_{j}G(\theta^{2})^{\top}r_{z}(x_% {j})\right]= ∑ start_POSTSUBSCRIPT italic_j ∈ italic_S end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_G ( italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] (29)
=P𝒫zP[xPG(θ2)yP]absentsubscript𝑃𝒫subscript𝑧𝑃delimited-[]subscriptsuperscript𝑥top𝑃𝐺superscriptsuperscript𝜃2topsubscript𝑦𝑃\displaystyle=\sum_{P\in\mathcal{P}}z_{P}\left[x^{\top}_{P}G(\theta^{2})^{\top% }y_{P}\right]= ∑ start_POSTSUBSCRIPT italic_P ∈ caligraphic_P end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT [ italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT italic_G ( italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ] (30)
=P𝒫zP(iNjNxP(i)yP(j)G(θ2)ij)absentsubscript𝑃𝒫subscript𝑧𝑃subscript𝑖𝑁subscript𝑗𝑁subscriptsuperscript𝑥top𝑃𝑖subscript𝑦𝑃𝑗𝐺subscriptsuperscriptsuperscript𝜃2top𝑖𝑗\displaystyle=\sum_{P\in\mathcal{P}}z_{P}\left(\sum_{i\in N}\sum_{j\in N}x^{% \top}_{P}(i)y_{P}(j)G(\theta^{2})^{\top}_{ij}\right)= ∑ start_POSTSUBSCRIPT italic_P ∈ caligraphic_P end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ italic_N end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_i ) italic_y start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_j ) italic_G ( italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) (31)
=P𝒫zP(iNjNPrz{i|P}Prz{j|P}G(θ2)ij)absentsubscript𝑃𝒫subscript𝑧𝑃subscript𝑖𝑁subscript𝑗𝑁subscriptPr𝑧conditional-set𝑖𝑃subscriptPr𝑧conditional-set𝑗𝑃𝐺subscriptsuperscriptsuperscript𝜃2top𝑖𝑗\displaystyle=\sum_{P\in\mathcal{P}}z_{P}\left(\sum_{i\in N}\sum_{j\in N}\text% {Pr}_{z}\{i|P\}\text{Pr}_{z}\{j|P\}G(\theta^{2})^{\top}_{ij}\right)= ∑ start_POSTSUBSCRIPT italic_P ∈ caligraphic_P end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ italic_N end_POSTSUBSCRIPT Pr start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT { italic_i | italic_P } Pr start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT { italic_j | italic_P } italic_G ( italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) (32)
=P𝒫zP(iNjNPrz{i,j|P}G(θ2)ij)absentsubscript𝑃𝒫subscript𝑧𝑃subscript𝑖𝑁subscript𝑗𝑁subscriptPr𝑧conditional-set𝑖𝑗𝑃𝐺subscriptsuperscriptsuperscript𝜃2top𝑖𝑗\displaystyle=\sum_{P\in\mathcal{P}}z_{P}\left(\sum_{i\in N}\sum_{j\in N}\text% {Pr}_{z}\{i,j|P\}G(\theta^{2})^{\top}_{ij}\right)= ∑ start_POSTSUBSCRIPT italic_P ∈ caligraphic_P end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ italic_N end_POSTSUBSCRIPT Pr start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT { italic_i , italic_j | italic_P } italic_G ( italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) (33)
=iNjNzijG(θ2)ij=G(z;θ2)absentsubscript𝑖𝑁subscript𝑗𝑁subscript𝑧𝑖𝑗𝐺subscriptsuperscriptsuperscript𝜃2top𝑖𝑗𝐺𝑧superscript𝜃2\displaystyle=\sum_{i\in N}\sum_{j\in N}z_{ij}G(\theta^{2})^{\top}_{ij}=G(z;% \theta^{2})= ∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ italic_N end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_G ( italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_G ( italic_z ; italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (34)

where we have used the fact that i𝑖iitalic_i and j𝑗jitalic_j are independent given that jP𝑗𝑃j\in Pitalic_j ∈ italic_P. Next, we have that for any best-response function rsuperscript𝑟r^{*}italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we have

G(z;θ2)=ExνG(rz(x),x;θ2)=Exμ[xG(θ2)rz(x)]Exμ[maxyΔ(N)xG(θ2)y]=Exμ[xG(θ2)r(x)]=ExνG(r(x),x;θ2)𝐺𝑧superscript𝜃2subscriptEsimilar-to𝑥𝜈𝐺subscript𝑟𝑧𝑥𝑥superscript𝜃2subscriptEsimilar-to𝑥𝜇delimited-[]superscript𝑥top𝐺superscriptsuperscript𝜃2topsubscript𝑟𝑧𝑥subscriptEsimilar-to𝑥𝜇delimited-[]subscript𝑦Δ𝑁superscript𝑥top𝐺superscriptsuperscript𝜃2top𝑦subscriptEsimilar-to𝑥𝜇delimited-[]superscript𝑥top𝐺superscriptsuperscript𝜃2topsuperscript𝑟𝑥subscriptEsimilar-to𝑥𝜈𝐺superscript𝑟𝑥𝑥superscript𝜃2\begin{split}G(z;\theta^{2})&=\text{E}_{x\sim\nu}G(r_{z}(x),x;\theta^{2})\\ &=\text{E}_{x\sim\mu}[x^{\top}G(\theta^{2})^{\top}r_{z}(x)]\\ &\leq\text{E}_{x\sim\mu}[\max_{y\in\Delta(N)}x^{\top}G(\theta^{2})^{\top}y]\\ &=\text{E}_{x\sim\mu}[x^{\top}G(\theta^{2})^{\top}r^{*}(x)]\\ &=\text{E}_{x\sim\nu}G(r^{*}(x),x;\theta^{2})\end{split}start_ROW start_CELL italic_G ( italic_z ; italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL = E start_POSTSUBSCRIPT italic_x ∼ italic_ν end_POSTSUBSCRIPT italic_G ( italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_x ) , italic_x ; italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_G ( italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_x ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ roman_max start_POSTSUBSCRIPT italic_y ∈ roman_Δ ( italic_N ) end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_G ( italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_y ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_G ( italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = E start_POSTSUBSCRIPT italic_x ∼ italic_ν end_POSTSUBSCRIPT italic_G ( italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) , italic_x ; italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW (35)

Therefore, so long as the partner plays a best-response to the AIs chosen strategy, the will achieve at least the same payoff (in expectation) as they would under the strategy z𝑧zitalic_z from which ν𝜈\nuitalic_ν was computed. Note however that ρ𝜌\rhoitalic_ρ will be (approximately) consistent over the full T𝑇Titalic_T steps, not just the last TT~𝑇~𝑇T-\tilde{T}italic_T - over~ start_ARG italic_T end_ARG. Define α=T~T𝛼~𝑇𝑇\alpha=\frac{\tilde{T}}{T}italic_α = divide start_ARG over~ start_ARG italic_T end_ARG end_ARG start_ARG italic_T end_ARG and β=TT~T𝛽𝑇~𝑇𝑇\beta=\frac{T-\tilde{T}}{T}italic_β = divide start_ARG italic_T - over~ start_ARG italic_T end_ARG end_ARG start_ARG italic_T end_ARG, and let z1superscript𝑧1z^{1}italic_z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT be agent 1’s marginal strategy under z𝑧zitalic_z. With probability 1δ1𝛿1-\delta1 - italic_δ, ρ𝜌\rhoitalic_ρ will play an ϵitalic-ϵ\epsilonitalic_ϵ-best-response to the mixture αz^(hT~)1βx𝛼^𝑧superscriptsubscript~𝑇1𝛽𝑥\alpha\hat{z}(h_{\tilde{T}})^{1}-\beta xitalic_α over^ start_ARG italic_z end_ARG ( italic_h start_POSTSUBSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT - italic_β italic_x, with xνsimilar-to𝑥𝜈x\sim\nuitalic_x ∼ italic_ν.

Let 𝒞superscript𝒞\mathcal{C}^{\prime}caligraphic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be the event that ρ𝜌\rhoitalic_ρ is ϵitalic-ϵ\epsilonitalic_ϵ-consistent over T𝑇Titalic_T steps. We then have that

Eπ^1,ρ[t=T~+1TG(at2,at1;θ2)|𝒞]subscriptEsuperscript^𝜋1𝜌delimited-[]conditionalsubscriptsuperscript𝑇𝑡~𝑇1𝐺subscriptsuperscript𝑎2𝑡subscriptsuperscript𝑎1𝑡subscript𝜃2𝒞\displaystyle\text{E}_{\hat{\pi}^{1},\rho}\left[\sum^{T}_{t=\tilde{T}+1}G(a^{2% }_{t},a^{1}_{t};\theta_{2})|\mathcal{C}\right]E start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ρ end_POSTSUBSCRIPT [ ∑ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = over~ start_ARG italic_T end_ARG + 1 end_POSTSUBSCRIPT italic_G ( italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | caligraphic_C ] Eπ^1,ρ[t=T~+1TG(at2,at1;θ2)|𝒞,𝒞]TδabsentsubscriptEsuperscript^𝜋1𝜌delimited-[]conditionalsubscriptsuperscript𝑇𝑡~𝑇1𝐺subscriptsuperscript𝑎2𝑡subscriptsuperscript𝑎1𝑡subscript𝜃2𝒞superscript𝒞𝑇𝛿\displaystyle\geq\text{E}_{\hat{\pi}^{1},\rho}\left[\sum^{T}_{t=\tilde{T}+1}G(% a^{2}_{t},a^{1}_{t};\theta_{2})|\mathcal{C},\mathcal{C}^{\prime}\right]-T\delta≥ E start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ρ end_POSTSUBSCRIPT [ ∑ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = over~ start_ARG italic_T end_ARG + 1 end_POSTSUBSCRIPT italic_G ( italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | caligraphic_C , caligraphic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] - italic_T italic_δ (36)
(TT~)(τ2(𝜽)2ϵ)Tδabsent𝑇~𝑇superscript𝜏2𝜽2italic-ϵ𝑇𝛿\displaystyle\geq(T-\tilde{T})\left(\tau^{2}(\boldsymbol{\theta})-2\epsilon% \right)-T\delta≥ ( italic_T - over~ start_ARG italic_T end_ARG ) ( italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_θ ) - 2 italic_ϵ ) - italic_T italic_δ (37)

Finally, dividing by T𝑇Titalic_T and subtracting from τ2(𝜽)superscript𝜏2𝜽\tau^{2}(\boldsymbol{\theta})italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_θ ), we get

E[Ralt1(hT,θ2)]2δ+δ(K)+(2TT~T+1)ϵEdelimited-[]superscript𝑅subscriptalt1subscript𝑇subscript𝜃22𝛿𝛿𝐾2𝑇~𝑇𝑇1italic-ϵ\text{E}\left[R^{\text{alt}_{1}}(h_{T},\theta_{2})\right]\leq 2\delta+\delta(K% )+\left(2\frac{T-\tilde{T}}{T}+1\right)\epsilonE [ italic_R start_POSTSUPERSCRIPT alt start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] ≤ 2 italic_δ + italic_δ ( italic_K ) + ( 2 divide start_ARG italic_T - over~ start_ARG italic_T end_ARG end_ARG start_ARG italic_T end_ARG + 1 ) italic_ϵ (38)

the desired result.