License: CC BY-NC-SA 4.0
arXiv:2401.02991v1 [cs.CL] 03 Jan 2024

GLIDE-RL: Grounded Language Instruction through DEmonstration in RL

Chaitanya Kharyal
Microsoft
Hyderabad, India
[email protected]
&Sai Krishna Gottipati
AI Redefined Inc
Montreal, Canada
[email protected]
&Tanmay Kumar Sinha
Microsoft Research
Bangalore, India
[email protected]
&Srijita Das
University of Alberta
Edmonton, Canada
[email protected]
&Matthew E. Taylor
AI Redefined Inc
University of Alberta
Edmonton, Canada
[email protected]
Abstract

One of the final frontiers in the development of complex human - AI collaborative systems is the ability of AI agents to comprehend the natural language and perform tasks accordingly. However, training efficient Reinforcement Learning (RL) agents grounded in natural language has been a long-standing challenge due to the complexity and ambiguity of the language and sparsity of the rewards, among other factors. Several advances in reinforcement learning, curriculum learning, continual learning, language models have independently contributed to effective training of grounded agents in various environments. Leveraging these developments, we present a novel algorithm, Grounded Language Instruction through DEmonstration in RL (GLIDE-RL) that introduces a teacher-instructor-student curriculum learning framework for training an RL agent capable of following natural language instructions that can generalize to previously unseen language instructions. In this multi-agent framework, the teacher and the student agents learn simultaneously based on the student’s current skill level. We further demonstrate the necessity for training the student agent with not just one, but multiple teacher agents. Experiments on a complex sparse reward environment validates the effectiveness of our proposed approach.

Keywords Reinforcement Learning  \cdot Curriculum Learning  \cdot Grounded Language

1 Introduction

Grounded language learning can be defined as the task of learning the meaning of natural language units (e.g., utterances, phrases, or words) by leveraging the sensory data (e.g., an image) Cirik et al. (2020). It is a challenging task due to the inherent complexity, context sensitivity and ambiguity of natural language. This challenge is compounded when a decision-making agent has to perform a series of actions to complete different tasks expressed in natural language. Several works in the field of goal-conditioned RL Schaul et al. (2015); Liu et al. (2022) demonstrated the challenges associated with training goal-based RL agents in very sparse reward settings. These goals may be represented in many ways: for example, as one-hot vectors, as goal coordinates in a space (e.g., Euclidean), or as a ‘goal image’ the agent needs to observe. They also share challenges with sparse reward RL tasks: credit assignment and sample efficiency. Representing goals in natural language can be useful because they are expressive and informative. Moreover, these goals can explicitly represent context sensitivity which is useful for learning policy in complex domains. However, representing goals in natural language also adds more complexity and ambiguity — the agent needs to understand that “grab the red ball” is same as “fetch that maroon sphere” and it needs to perform a series of actions before achieving an informative reward.

The above-mentioned challenges necessitate a framework involving curriculum learning Narvekar et al. (2020) for the agent to learn anything useful and tame the beast i.e., the challenge of sparse rewards. We thus introduce a teacher agent that proposes goals to the student agent. Contrary to most other prior approaches, the teacher agent here also acts in the environment thus ensuring that all the goals it proposes are in fact reachable by another agent within a given time frame (e.g., episode length). Moreover, if the student fails to reach these goals, it can attempt to learn to clone the teacher’s trajectory. However, the bigger question remains: How is this setup useful for learning to follow natural language instructions? To address that, we introduce an instructor agent that attempts to describe the teacher’s trajectory or key events in natural language and then converts them into the form of an instruction for the goal-conditioned student agent. This instructor agent can be of any form such as a pre-trained or a trainable video captioning module Kuo et al. (2023); Xu et al. (2023) or a language model Wang et al. (2023); Du et al. (2023).

In our work, we augment the environment with natural language descriptions of the events that the teacher had triggered and then convert them into an instruction format. We used a pre-trained language model (ChatGPT-3.5) to convert this single instruction to multiple synonymous instructions for the student agent to train on. This helped in evaluating and improving the generalization capabilities of the student agent. These interactions between the agents is summarized in Figure-1 below.

Refer to caption
Figure 1: GLIDE-RL Algorithm has three independently functioning parts: The teacher, The instructor and The student. (1) The teacher is an agent that acts in the environment to do complex things and gets rewarded based on the performance of the student. (2) The instructor observes these actions, describes them in the form of events and converts them to the instruction format. (3) The student is the goal-conditioned agent that strives to reach the goals set by the teacher, as described and instructed by the instructor. (4) The instructor also checks if the student has triggered the same/similar events in the environment to mark a certain state as success/failure for the student.

Contributions: The key contributions of this work includes:

  1. 1.

    Introducing a novel algorithm (GLIDE-RL) and the Teacher-Instructor-Student framework for training RL agents grounded in natural language on sparse reward complex tasks

  2. 2.

    Thorough empirical studies demonstrating the influence of factors like curriculum, behavior cloning, multiple teachers, type of language model on the performance of the Student agent

  3. 3.

    Demonstrating generalization capabilities of the trained goal-conditioned RL agent across unseen goals and ambiguous instructions

2 Related Work

Goal conditioned RL: In goal-conditioned RL Schaul et al. (2015), a representation of the goal that the agent needs to achieve is appended to the state representation such that the value function and policy is computed using this shared representation. Historical and key developments in the field of goal-conditioned RL are well summarized in Liu et al. (2022); Colas et al. (2022) In attempts to make the training more efficient, several algorithms were proposed that suggest improvements over the way the goals were generated and in the overall training process. In Hindsight experience replay Andrychowicz et al. (2017), the target goal set was expanded with the intermediate goals that the agent actually achieved while trying to reach the original target goal. Florensa et al. Florensa et al. (2018) used a GAN setup where the generator generates goals of appropriate difficulty for the RL agent and the discriminator identifies whether the generated goals belong to the goal space of intermediate difficulty. Campero et al. Campero et al. (2021) proposed a teacher-student framework where the teacher generates curriculum of increasingly difficult goals based on the student’s skill level and the student gets rewarded implicitly on reaching the proposed goals.

Asymmetric Self Play (ASP) Sukhbaatar et al. (2018) is an unsupervised training approach where the teacher agent proposes increasingly challenging goals for the student to achieve by exploring the environment. The teacher and the student are both RL agents; where the former gets rewarded for suggesting goals outside the student’s comfort and the latter gets rewarded for achieving the proposed goal. This was later extended by Plappert et al. OpenAI et al. (2021) for complex robotics task and Du et al. Du et al. (2022) for more feasible and challenging goals using gameplay between two teachers and two students. Kharyal et al. Kharyal et al. (2023) extended ASP with multiple teachers that suggested diverse goals to help the student agent learn faster. All the prior work condition the goals in the same space as that of the agent. Our work incorporates natural language goals in the goal-conditioned RL agent’s representation in addition to introducing the Instructor agent; thus taking a step forward towards the ability of humans to convey denser knowledge to these agents.

Grounded language learning: Recently, there have been a lot of efforts to integrate natural language instructions for Reinforcement Learning. Luketina et al. Luketina et al. (2019) categorize these directions as language conditional RL where agent-language interaction is part of the problem formulation  and language assisted RL where natural language is used to guide the training of RL agents. There are a few works where the agent follows instructions provided to it in the form of a high-level goal or a policy description Bahdanau et al. (2018); Hermann et al. (2017); Chaplot et al. (2018) in natural language. Oh et al. Oh et al. (2017) proposed an approach to train on a shorter sequence of instructions and generalize to a longer sequence of instructions (seen and unseen) by learning parameterized skills and using them to execute the relevant instructions. However, this includes policy pre-training to understand the structure of these instructions. Goyal et al. Goyal et al. (2019) and Wang et al. Wang et al. (2019) used the agreement between trajectory and expert provided language instructions for reward sha** in sparse reward tasks. Chan et al. Chan et al. (2019) combines Hindsight experience replay Andrychowicz et al. (2017) with natural language goal description which is provided by a teacher based on what the RL agent does in the specific episode. Du et al. Du et al. (2022) takes a step further and queries pretrained large language models to get context-sensitive and diverse goal descriptions. Our work is aligned with these advancements in grounded language in RL but differs in the way the goals are generated and communicated — The Teacher agent acts in the environment to ensure that the goals generated are infact reachable and within the zone of proximal development Seita et al. (2019) of the language-conditioned agent. Moreover, the Instructor agent generates a database of synonymous instructions to ensure that the language-conditioned agent generalizes to unseen instructions.

3 Background

Reinforcement Learning: The teacher and the student agents operate on the framework of a markov decision process (MDP). An MDP consists of a five-element tuple (S,A,P,R,γ)𝑆𝐴𝑃𝑅𝛾(S,A,P,R,\gamma)( italic_S , italic_A , italic_P , italic_R , italic_γ ) where S𝑆Sitalic_S is the set of all possible states, A𝐴Aitalic_A is the set of all possible actions, P𝑃Pitalic_P is the transition probability function, R𝑅Ritalic_R is the reward function, and γ𝛾\gammaitalic_γ is the discount factor. At each time step t, the agent is at a state stSsubscript𝑠𝑡𝑆s_{t}\in Sitalic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_S and takes an action atAsubscript𝑎𝑡𝐴a_{t}\in Aitalic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_A. The environment dynamics map the state-action pair into a successor state st+1P(|st,at)s_{t+1}\sim P(\cdot|s_{t},a_{t})italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and the agent receives a scalar reward rtR(st,at)similar-tosubscript𝑟𝑡𝑅subscript𝑠𝑡subscript𝑎𝑡r_{t}\sim R(s_{t},a_{t})italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). In some environments, the agents might not have access to the full state of the environment. In such cases, a partial observation otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is used as input to the RL agent. The objective is to find a policy π:SA:𝜋𝑆𝐴\pi:S\rightarrow Aitalic_π : italic_S → italic_A that maximizes the expected discounted return 𝔼[Σt=0γtrt]𝔼delimited-[]superscriptsubscriptΣ𝑡0superscript𝛾𝑡subscript𝑟𝑡\mathbb{E}[\Sigma_{t=0}^{\infty}\gamma^{t}r_{t}]blackboard_E [ roman_Σ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] with γ(0,1]𝛾01\gamma\in(0,1]italic_γ ∈ ( 0 , 1 ]. The teacher agents are trained using this simple MDP formulation whereas the student agent is trained in goal-conditioned setting.

Goal conditioned RL deals with training agents to reach the goals determined at the beginning of an episode by the environment or an external agent (like the teacher agent in our context). In these settings, the agent also takes in the goal g𝑔gitalic_g as input at every time step in addition to the observation otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. All the RL algorithms used in a standard MDP above can also be used for this goal-conditioned MDP. The goal g𝑔gitalic_g in our experiments is a natural language instruction.

Deep Q-Networks: The action space is discrete in our experiments. We thus chose D3QN algorithm Huang et al. (2018) for all the experiments. It consists of ’double’ van Hasselt et al. (2015) and ’dueling’ Wang et al. (2015) techniques added to the DQN algorithm Mnih et al. (2015). Note that both the teacher and student agents have the same action space in our setting. They primarily differ in their reward structure (which we design to be adversarial in nature). The input to the student agent also includes the natural language instruction in addition to the usual input observation.

A typical DQN loss Mnih et al. (2015) is given by:

Li(θi)=𝔼s,a,r,s[(yiDQNQ(s,a;θi))2]subscript𝐿𝑖subscript𝜃𝑖subscript𝔼𝑠𝑎𝑟superscript𝑠delimited-[]superscriptsuperscriptsubscript𝑦𝑖𝐷𝑄𝑁𝑄𝑠𝑎subscript𝜃𝑖2L_{i}\left(\theta_{i}\right)=\mathbb{E}_{s,a,r,s^{\prime}}\left[\left(y_{i}^{% DQN}-Q\left(s,a;\theta_{i}\right)\right)^{2}\right]italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_s , italic_a , italic_r , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D italic_Q italic_N end_POSTSUPERSCRIPT - italic_Q ( italic_s , italic_a ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (1)

where, yiDQN=r+γmaxaQ(s,a;θ)superscriptsubscript𝑦𝑖𝐷𝑄𝑁𝑟𝛾subscriptsuperscript𝑎𝑄superscript𝑠superscript𝑎superscript𝜃y_{i}^{DQN}=r+\gamma\max_{a^{\prime}}Q\left(s^{\prime},a^{\prime};\theta^{-}\right)italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D italic_Q italic_N end_POSTSUPERSCRIPT = italic_r + italic_γ roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) is the temporal difference (TD) target and θsuperscript𝜃\theta^{-}italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT represents the parameters of a fixed target network. To avoid the problem of overestimation of Q-values, double DQN van Hasselt et al. (2015) uses the following TD target:

yiDDQN=r+γQ(s,argmaxaQ(s,a;θi);θ)superscriptsubscript𝑦𝑖𝐷𝐷𝑄𝑁𝑟𝛾𝑄superscript𝑠superscript𝑎𝑄superscript𝑠superscript𝑎subscript𝜃𝑖superscript𝜃y_{i}^{DDQN}=r+\gamma Q\left(s^{\prime},\underset{a^{\prime}}{\arg\max}Q\left(% s^{\prime},a^{\prime};\theta_{i}\right);\theta^{-}\right)italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D italic_D italic_Q italic_N end_POSTSUPERSCRIPT = italic_r + italic_γ italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , start_UNDERACCENT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_arg roman_max end_ARG italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ; italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) (2)

In dueling DQN Wang et al. (2015), two network heads are used, one for value function V(s;θ,β)𝑉𝑠𝜃𝛽V(s;\theta,\beta)italic_V ( italic_s ; italic_θ , italic_β ) and one for advantage function A(s,a,θ,α)𝐴𝑠superscript𝑎𝜃𝛼A(s,a^{\prime},\theta,\alpha)italic_A ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_θ , italic_α ). Note that θ𝜃\thetaitalic_θ are the shared parameters. The Q-function is then estimated as:

Q(s,a;θ,α,β)=V(s;θ,β)+𝑄𝑠𝑎𝜃𝛼𝛽limit-from𝑉𝑠𝜃𝛽\displaystyle Q(s,a;\theta,\alpha,\beta)=V(s;\theta,\beta)+italic_Q ( italic_s , italic_a ; italic_θ , italic_α , italic_β ) = italic_V ( italic_s ; italic_θ , italic_β ) + (3)
(A(s,a;θ,α)1|𝒜|aA(s,a;θ,α))𝐴𝑠𝑎𝜃𝛼1𝒜subscriptsuperscript𝑎𝐴𝑠superscript𝑎𝜃𝛼\displaystyle\qquad\left(A(s,a;\theta,\alpha)-\frac{1}{|\mathcal{A}|}\sum_{a^{% \prime}}A\left(s,a^{\prime};\theta,\alpha\right)\right)( italic_A ( italic_s , italic_a ; italic_θ , italic_α ) - divide start_ARG 1 end_ARG start_ARG | caligraphic_A | end_ARG ∑ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_A ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_θ , italic_α ) )

In D3QN Huang et al. (2018), both the double and dueling techniques are combined and has been proven to work well on a wide range of environments with discrete action spaces. We thus use it as a base agent in all our experiments. Both the teacher and student agents are instances of the D3QN actor class. We also use the frame stacking technique Mnih et al. (2015) to assist in learning.

4 Proposed Algorithm

The final objective is to have an RL agent capable of following the natural language instructions in a simulated environment with sparse reward. In our proposed framework Grounded Language Instruction through DEmonstration in RL (GLIDE-RL), this agent is designated as the Student agent.

We have three types of agents: Teacher, Instructor and Student. The teacher and student agents are trained in an adversarial setup. While the student is a goal-conditioned agent that aims to complete the tasks provided to it as natural language instructions, teachers are trained to propose tasks/goals by acting in the environment that student agent can’t achieve — this results in teachers providing a curriculum of incrementally harder goals for the student agent to train on. We train multiple teacher agents to assist in better generalisation of the student agent by proposing diverse goals. Note that the Teacher agents by themselves are not capable of describing what they have done or to instruct the agents. It is the role of the instructor agent to describe what the teacher has done in natural language and then convert it to a form of instruction for the student agent to act and train upon. Formalizing our problem set-up as below:
Given: A student agent S, a set of teacher agents {T1,T2,,TN}subscript𝑇1subscript𝑇2subscript𝑇𝑁\{T_{1},T_{2},\cdots,T_{N}\}{ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_T start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } and an instructor agent I𝐼Iitalic_I
To-do: Learn an optimal goal-conditioned policy πSsubscript𝜋𝑆\pi_{S}italic_π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT for the student agent that can follow natural language instructions generated by I𝐼Iitalic_I by observing the evolving teacher policies {πT1,πT2,,πTN}subscript𝜋subscript𝑇1subscript𝜋subscript𝑇2subscript𝜋subscript𝑇𝑁\{\pi_{T_{1}},\pi_{T_{2}},\cdots,\pi_{T_{N}}\}{ italic_π start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_π start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT }
Assumptions: We make the following assumptions (1) All the teacher agents start from scratch with a random policy and only learn from feedback (reward) related to the student’s performance (2) Instructor agent is capable of describing the actions of the teacher in natural language and is equipped with a pre-trained LLM to convert these descriptions to several synonymous instructions.

GLIDE-RL is detailed in Algorithm 1 and illustrated in Figure-1. We denote the n𝑛nitalic_n teacher agents as T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT,… TNsubscript𝑇𝑁T_{N}italic_T start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT and the student agent as S𝑆Sitalic_S. We represent the parameters of the teacher’s networks with θT1,,θTNsubscript𝜃subscript𝑇1subscript𝜃subscript𝑇𝑁\theta_{T_{1}},\cdots,\theta_{T_{N}}italic_θ start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_θ start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT and that of the student agent with θSsubscript𝜃𝑆\theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT. We denote the parameters of the language model with ϕLsubscriptitalic-ϕ𝐿\phi_{L}italic_ϕ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and that of the instructor with ϕIsubscriptitalic-ϕ𝐼\phi_{I}italic_ϕ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT.

In every student-teacher rollout, one of the teachers Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT acts in the environment by choosing an action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT according to its policy πTisubscript𝜋subscript𝑇𝑖\pi_{T_{i}}italic_π start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT based on the current observation otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT until the end of episode (of predefined length). The state-action (stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) pairs that the teacher encountered in its trajectory are then used by the Instructor to describe in natural language the course of events that the teacher has triggered. In one rollout, the teacher could trigger multiple events Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The instructor first describes these events Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in natural language (e.g., "you are standing in-front of red ball") and then converts this description to the form of an instruction (e.g., "go to the red ball").

The instructor then uses a pre-trained language model ϕIsubscriptitalic-ϕ𝐼\phi_{I}italic_ϕ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT to generate m𝑚mitalic_m synonymous instructions Ii1,Ii2,Iimsubscript𝐼𝑖1subscript𝐼𝑖2subscript𝐼𝑖𝑚I_{i1},I_{i2},...I_{im}italic_I start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT , … italic_I start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT. The first subscript i𝑖iitalic_i indicates that it’s an instruction corresponding to the event i𝑖iitalic_i whereas the second subscript 0,1,2,m012𝑚0,1,2,...m0 , 1 , 2 , … italic_m just denotes the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT synonymous instruction for the same event i𝑖iitalic_i. These events then become the goal for the student agent. Events are fed to the student agent one at a time, in the exact same order E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, …. Ensubscript𝐸𝑛E_{n}italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as the teacher has triggered them, in the form of natural language instructions. Thus, in addition to its current observation otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the goal-conditioned student agent also takes in randomly sampled task/goal Iijsubscript𝐼𝑖𝑗I_{ij}italic_I start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT starting with i=0𝑖0i=0italic_i = 0 and j[0,m]𝑗0𝑚j\in[0,m]italic_j ∈ [ 0 , italic_m ]. the language model ϕLsubscriptitalic-ϕ𝐿\phi_{L}italic_ϕ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT transforms the natural language instruction to an embedding (a tensor) which is then concatenated with the input observation. This combined input representation is then passed through Deep Neural Network to obtain the Q-values for every action. The action corresponding to the maximum Q-value is chosen.the language model transforms the natural language instruction to an embedding (a tensor) which is then concatenated with the input observation. At each time step, the student gets 0 reward if it doesn’t finish the task/goal Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. If it finishes, it gets a positive reward (this value is hyperparameter tuned) and, from the next time step it takes in task/goal Ei+1subscript𝐸𝑖1E_{i+1}italic_E start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT as input in addition to the observation input otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and strives to achieve the goal Ei+1subscript𝐸𝑖1E_{i+1}italic_E start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT.

The student continues to act in the environment until it finishes all the goals E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, … Ensubscript𝐸𝑛E_{n}italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT or until the maximum episode length. Note that the number of events n𝑛nitalic_n a teacher will trigger can vary across rollouts. An important implementation detail to note is that the done flag is marked as true for the Student after it finishes each individual event Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. During the training process, this ensures that the Student is rewarded to focus only on the immediate goal given its current observation and the goal it is conditioned on.

We then update both the student and teacher networks using the standard D3QN loss described earlier in Equations-2 and 3 of Section-3. We denote this loss with D3QNsubscript𝐷3𝑄𝑁\mathcal{L}_{D3QN}caligraphic_L start_POSTSUBSCRIPT italic_D 3 italic_Q italic_N end_POSTSUBSCRIPT. We also employ Behaviour Cloning (BC) loss to train the student to tackle the sparse reward nature of the problem and assist the student’s learning by using the teachers’ behaviours similar to  OpenAI et al. (2021). The BC loss we use is defined as follows:

BC=𝔼(st,gt,at)DBC[logeπS(at|st,gt)ieπS(i|st,gt)]subscript𝐵𝐶subscript𝔼similar-tosubscript𝑠𝑡subscript𝑔𝑡subscript𝑎𝑡subscript𝐷𝐵𝐶delimited-[]superscript𝑒subscript𝜋𝑆conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝑔𝑡subscript𝑖superscript𝑒subscript𝜋𝑆conditional𝑖subscript𝑠𝑡subscript𝑔𝑡\mathcal{L}_{BC}=-\mathbb{E}_{(s_{t},g_{t},a_{t})\sim D_{BC}}\left[\log\frac{e% ^{\pi_{S}(a_{t}|s_{t},g_{t})}}{\sum_{i}e^{\pi_{S}(i|s_{t},g_{t})}}\right]caligraphic_L start_POSTSUBSCRIPT italic_B italic_C end_POSTSUBSCRIPT = - blackboard_E start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∼ italic_D start_POSTSUBSCRIPT italic_B italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_e start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_i | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG ] (4)

Where stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are the teacher’s state, the action it took in that state, and the first event it triggered after performing the action respectively;πSsubscript𝜋𝑆\pi_{S}italic_π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is the student’s policy.

DBCsubscript𝐷𝐵𝐶D_{BC}italic_D start_POSTSUBSCRIPT italic_B italic_C end_POSTSUBSCRIPT is the behavioural cloning replay buffer, constructed to help the Student learn to trigger the events that it failed to trigger during the roll-out Therefore, we only insert those (event, state, action) tuples that the student is not able to trigger, thereby not confusing the student about previously learnt goals.

Therefore, the overall loss function for the student becomes:

=D3QN+ΓBCsubscript𝐷3𝑄𝑁Γsubscript𝐵𝐶\mathcal{L}=\mathcal{L}_{D3QN}+\Gamma\mathcal{L}_{BC}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_D 3 italic_Q italic_N end_POSTSUBSCRIPT + roman_Γ caligraphic_L start_POSTSUBSCRIPT italic_B italic_C end_POSTSUBSCRIPT (5)

Where ΓΓ\Gammaroman_Γ is the adaptive behavioural loss coefficient which is calculated as follows on every update:

Γt+1=Γt+(αRLBC)ϵsubscriptΓ𝑡1subscriptΓ𝑡𝛼subscript𝑅𝐿subscript𝐵𝐶italic-ϵ\Gamma_{t+1}=\Gamma_{t}+(\alpha\mathcal{L}_{RL}-\mathcal{L}_{BC})\epsilonroman_Γ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = roman_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( italic_α caligraphic_L start_POSTSUBSCRIPT italic_R italic_L end_POSTSUBSCRIPT - caligraphic_L start_POSTSUBSCRIPT italic_B italic_C end_POSTSUBSCRIPT ) italic_ϵ (6)

Where α𝛼\alphaitalic_α is a predefined constant called BCL ratio and ϵitalic-ϵ\epsilonitalic_ϵ is the predefined decay rate for ΓΓ\Gammaroman_Γ. Every teacher and student agent uses its own rollout data (experience) to update their respective parameters.

The key advantages of this training setup are two-fold — (1) we know by construction, that the events triggered by a teacher are reachable by the student from the starting state within episode length (2) the teacher provides a valid demonstration on how the events can be reached, which can be leveraged using the Behavioural Cloning Loss for goals that the student fails to reach.

Advantage of using multiple teachers is that these teachers learn diverse policies due to the way we structured the reward function — A teacher gets a negative (positive) reward if the student is able (unable) to reach the goal. This implies that the teachers are incentivized to set digoals different from each other because the student agent is being trained to reach the goals that are already set by other teachers and can reach the same goal again if the current teacher sets the same goal. Therefore, each teacher’s demonstrations are intrinsically influenced by other teachers’ demonstrations.

Environment configuration: While training the agents, we reset the environment to a random configuration at the beginning of every roll-out. Within a roll-out, environment configuration remains the same. This ensures that the teacher and student have the same initial conditions which enables a teacher to set a meaningful goal and for the student to learn useful behaviors from cloning the teacher’s trajectory.

Data: N𝑁Nitalic_N ;
  //Number of teacher agents
1
Data: θT1,,θTN,θSsubscript𝜃subscript𝑇1subscript𝜃subscript𝑇𝑁subscript𝜃𝑆\theta_{T_{1}},\cdots,\theta_{T_{N}},\theta_{S}italic_θ start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_θ start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ;
  //Parameters for the agents
2
Data: ϕL,ϕIsubscriptitalic-ϕ𝐿subscriptitalic-ϕ𝐼\phi_{L},\phi_{I}italic_ϕ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ;
  //Parameters for Language model and Instructor
3
4 for rollout = 1,2,12normal-⋯1,2,\cdots1 , 2 , ⋯ do
5       k=𝚛𝚘𝚕𝚕𝚘𝚞𝚝//Nk=\texttt{rollout}//Nitalic_k = rollout / / italic_N;
6       teacher = kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT teacher;
7       goal = [][][ ];
       ;
       //run teacher
8       for episode length do
9             state = teacher.act(state);
10             event = Instructor.describe();
11             if event != "" then
12                   goal.append(event);
13                  
      ;
       //run student
14       for episode length do
15             g = language_model(goal[0]);
16             state = student.act(state, g);
17             if Instructor.reached(goal[0]) then
18                   goal.pop(0)
19      Distribute rewards;
20       Update agents using RL and BC losses;
21      
Algorithm 1 GLIDE-RL Algorithm

Reward structure: The teacher gets a reward of +y𝑦+y+ italic_y if the student fails to trigger the event Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a reward of x𝑥-x- italic_x if the student reaches the event. Additionally, the teacher also gets rewarded C𝐶-C- italic_C if it fails to trigger any event throughout the episode. On the other hand, the student gets rewarded +z𝑧+z+ italic_z if it is able to trigger the goal event and 00 if it fails to do so. . We experimented with different reward values for each n𝑛nitalic_n-teacher setting, where n{1,2,4}𝑛124n\in\{1,2,4\}italic_n ∈ { 1 , 2 , 4 } More details are shared in Section-5. Note that, while the reward structure is adversarial in nature, it is not a zero-sum setup.

RT={xif student reaches Ei+yif student does not reach EiCif the teacher doesn’t trigger any event throughoutthe episodesubscript𝑅𝑇cases𝑥if student reaches Ei𝑦if student does not reach Ei𝐶if the teacher doesn’t trigger any event throughout𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒the episodeR_{T}=\begin{cases*}-x&if student reaches $E_{i}$\\ +y&if student does not reach $E_{i}$\\ -C&if the teacher doesn't trigger any event throughout\\ &the episode\end{cases*}italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = { start_ROW start_CELL - italic_x end_CELL start_CELL if student reaches italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL + italic_y end_CELL start_CELL if student does not reach italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL - italic_C end_CELL start_CELL if the teacher doesn’t trigger any event throughout end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL the episode end_CELL end_ROW
RS={+zif student reaches Ei0if student does not reach Eisubscript𝑅𝑆cases𝑧if student reaches Ei0if student does not reach EiR_{S}=\begin{cases*}+z&if student reaches $E_{i}$\\ 0&if student does not reach $E_{i}$\end{cases*}italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = { start_ROW start_CELL + italic_z end_CELL start_CELL if student reaches italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if student does not reach italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW

where {x,y,z,C}>0𝑥𝑦𝑧𝐶0\{x,y,z,C\}>0{ italic_x , italic_y , italic_z , italic_C } > 0 and C>x,y𝐶𝑥𝑦C>x,yitalic_C > italic_x , italic_y.

Furthermore, to encourage teacher’s exploration, we give it an additional reward for triggering new events. This reward decays with the number of times the event has been triggered as:

RTex(Ei)=3*0.97fEisuperscriptsubscript𝑅𝑇𝑒𝑥subscript𝐸𝑖3superscript0.97subscript𝑓subscript𝐸𝑖R_{T}^{ex}(E_{i})=3*0.97^{f_{E_{i}}}italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_x end_POSTSUPERSCRIPT ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 3 * 0.97 start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

where fEisubscript𝑓subscript𝐸𝑖f_{E_{i}}italic_f start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the frequency of event Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT triggered by any teacher.

5 Experimental Results

5.1 Experiment Setting - BabyAI

Environment: We use ’BossLevel’ environment, the most complex environment within the babyAI suite Chevalier-Boisvert et al. (2018) for all our experiments. As shown in the Figure-2, the environment consists of 9 rooms with each room having the size of 6x6. There are different types of objects: {ball, box, key, door}. Each object can be of any color: {red, green, blue, grey, purple, yellow}. At the beginning of every episode, the agent along with all objects are spawned in random positions.

Refer to caption
Figure 2: BabyAI BossLevel environment

The observation to the agent is a (7×7×3)773(7\times 7\times 3)( 7 × 7 × 3 ) image. This image has encoding for the different objects, their colours and states in different layers. The action is discrete in nature and selects one of the following actions (turn left, turn right, move forward, pickup, drop, toggle, done).

Test set: For generating the test set, we let a random agent act in the environment for 1000 timesteps (as compared to the student’s episode length of only 115 timesteps). We let it run for 100 episodes and store all the events triggered by this random agent. We also store the environment’s initial states (including agents initial positions). While testing, the environment is initialized with these initial states and the goal-conditioned student agent is instructed to trigger these events in the exact same order as the random agent in all the 100 episodes. One test ’ goal’ includes multiple events (tasks) and the student’s episode is considered as successful only if it triggers all these events in the exact same sequence that the random agent had triggered them.

Instructor: To enable better generalization of the student agent, we convert the event description to an instruction and generate a set of 50 synonymous instructions for each event. An example of a synonymous instruction for the event "standing infront of the red ball" (described by the environment) can be "lift up the crimson ball". During training, we randomly sample an instruction from the corresponding synonymous instruction set for each event triggered by a teacher agent. For the experiments that involve testing the generalization capabilities of the agent, we keep 5 instructions from each synonymous instruction set, in a holdout test set. These instructions are not used during the training and are only used as part of the test set.

For converting these natural language instructions into language embeddings, we use off-the-shelf language models. We experimented with all-distilroberta-v1, all-mpnet-base-v2 and multi-qa-mpnet-base-dot-v1 models provided by the sentence_transformers package in python. We noticed that all-distilroberta-v1 performed consistently better than other language models (as observed in hyperparameter sweeps) and used it for all the experimental results reported in this paper.

Table 1: Hyperparameters used for D3QN for 4-teacher run
Hyperparameter Value
BCL ratio (α𝛼\alphaitalic_α) 0.900
Frame stack 8
gradient clip 0.67
Optimizer AdamKingma and Ba (2015)
Learning rate 5.13×1055.13superscript1055.13\times 10^{-5}5.13 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
τ𝜏\tauitalic_τ 0.098
Student reward (z𝑧zitalic_z) 3
Teacher reward (C𝐶Citalic_C) 8
Teacher reward (y𝑦yitalic_y) 6
Teacher reward (x𝑥xitalic_x) 2

5.2 Results and Analysis

In this section, we empirically demonstrate the effectiveness of our proposed algorithm. The evaluation of our algorithm aims to answer the following research questions:

  1. R1:

    How good is the performance of GLIDE-RL on events seen during training?

  2. R2:

    How effective is our algorithm in understanding synonymous goals/instructions seen during training?

  3. R3:

    Does the student agent learn better with increasing number of teachers?

  4. R4:

    Is the student able to generalize to previously unseen goals?

Experiment 1: Ability to succeed on events seen during the training

In this experiment, we aim to test if the student is able to succeed on the synonymous instructions seen during training. We also attempt to understand the effect of various components like having a curriculum and behaviour cloning loss on the training.

It is important to note that while these instructions are seen during training, the actions the Student agent needs to take to finish that instruction are different — the positions of different objects and the agent’s starting position are generated randomly for the test set and these are not used during training. The sequence of events that the Student must accomplish are also not seen during the training.

Refer to caption
Figure 3: Success rate of different variations on the test set. The plot shows the mean and standard deviation over 5 seeds

We train various agents (and establish baselines) to show the importance of teachers (and the curriculum generated by these teachers) for the training of the student:

Firstly, we train a student conditioned on one-hot goals (onehot in figure 3). Teachers’ functionality doesn’t change here. But, for the student, instead of receiving language embeddings from a language model as inputs, it receives pre-designed one-hot encoding for each event. There is no notion of synonymous goals here, the events triggered by the Teacher are directly converted to a one-hot encoding and sent to the agent. This baseline gives us an estimate of the upper bound of success rate achievable. This also establishes that not all the event sequences of the test set can be completed in just 115 time steps that took the random agent 1000 time steps. Note that we only allow 115 timesteps for the student agent to ensure that the task remains challenging. Also note that, with one-hot encodings, the agent does not have any generalization capabilities as the size of the encodings cannot be increased to accommodate the unseen goals.

With the upper bound on the performance (on the test-set) being established, we now train the student conditioned on the embeddings from the language model (4-teachers in figure 3) Here, we present the student with the synonymous events while training as described before. The aim of this is to gauge how well can the student perform on the test set as compared to when trained with one-hot encoding.

To show the importance of the curriculum generated by the teachers, we train a student with random teacher agents (4-random-teachers in figure 3), and another one-hot student trained directly on the test set but without any teachers or curriculum (onehot-testset in figure 3). While the random-teachers don’t learn adversarially with the student, and hence provide no curriculum, the onehot-testset baseline doesn’t have the notion of teachers. We introduced onehot-testset baseline to understand how challenging the task is without a curriculum set by the teachers.

Furthermore, to understand the necessity of Behavioral Cloning Loss (BCL), we train another baseline (no-bcl in figure 3). We train this student in the exact same manner as GLIDE-RL, with the exception of not using the BCL while training.

Figure-3 shows the importance of teachers (and curriculum) in general for the student’s performance as the students trained without the teachers’ curriculum fail to perform well (measured in terms of success rate), even when trained directly on the test set. Furthermore, we see that the student with goals conditioned as language embeddings is able to perform comparable to the one with one-hot goals. Moreover, with the no-bcl baseline, we establish the importance of the behavioural cloning loss during the training process — without BCL, the student’s performance is as good as the scenario with random teachers (no curriculum).

Experiment 2: Ability to generalise on synonyms seen during training

Refer to caption
Refer to caption
Figure 4: Q values of the synonymous events vs distance from the original event. Each blue dot (\bullet) represents one synonymous event for the original event. The red line represents the trend among the points

Here, we try to reason why the student might be able to accomplish some of the instruction synonyms better than others. We hypothesise that some synonyms of different instructions (corresponding to different events) overlap in the embedding space, thus making it difficult for the student to generalise on those events.

To test this hypothesis, we run the trained Student agent on the test set and for each event in the test set, just before the student triggers the event, we check the maximum Q value in that state corresponding to each of the synonyms. Finally, we average the Q value for each synonym over the occurrence of the event.

We notice an interesting, yet expected, pattern when we plot Q values of various synonyms for an event versus the distance from that event in the embedding space (fig: 4). We see that as the distance of a synonymous instruction from the original instruction (which should be close to the centre of the synonyms cluster in the embedding space because all the synonyms are written using that as the root event) increases, the Q value decreases.

Experiment 3: Effect of varying the number of teachers

Refer to caption
Figure 5: Success rate on the test set with varying number of teachers. The plot shows mean and standard deviation over 5 seeds

In this experiment, we try to understand how changing the number of teachers affects the success rate of the student. Similar to previous experiments, each curve in Figure-5 is averaged over 5 seeds. We notice that the success-rate on the test-set increases as we increase the number of teachers. We believe that this is because of the diversity of the goals being generated by different teachers as established by Kharyal et al Kharyal et al. (2023) in a similar experimental setup, but with simpler goal representation (Euclidean coordinates) and on much simpler environments. Further experimentation with more number of teachers is an avenue for exploration in future work. Given the complexity of the environment, and considering that the 4-teacher scenario is only slightly better than the 2-teacher scenario, we could hypothesize that the success-rate would only increase marginally as we increase the number of teachers further.

Experiment 4: Ability to generalise on previously unseen goals

To test the ability of the student agent to generalise on instructions unseen during training, we construct a holdout set of 5 synonymous instructions for each event and hold them out of the training process. Then, we test the student agent on goals constructed purely using the sequence of events corresponding to these holdout set of instructions denoted as 4-teacher-holdout in Figure-6. We compare the performance of this agent with the agent that was trained on the complete instruction set (including the holdout test set denoted as 4-teacher). Both these student agents were trained with four teachers (as that was the best performing setup as noted in Figure-5).

Refer to caption
Figure 6: Success rate of the student on holdout-test-set

We can observe in Figure-6 that 4-teacher-holdout variant slightly lags behind the performance of the agent that was trained on the complete instruction set (4-teacher) throughout the training process. However, it is important to note that 40 percent success-rate is still a remarkable performance given that the agent has never seen any of the 1) sequence of events or 2) language instructions or 3) the positions of the agent or objects in the environment during its training process.

6 Conclusion and Future Work

In this work, we proposed a novel algorithm and framework for training RL agents grounded in natural language and demonstrated its ability to generalize to unseen language instructions. We have also shown the impact of language model and the number of teachers on the performance of the student agent. We would like to extend this study to include the training of the instructor agent as well in more complex environments. Another possible direction is using actual humans as instructors for the proposed framework. We would also want to explore the different kinds of skills that the teacher agents learn in the process of challenging the student agent.

References

  • Cirik et al. [2020] Volkan Cirik, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. Refer360{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT: A referring expression recognition dataset in 360{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT images. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7189–7202, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.644. URL https://aclanthology.org/2020.acl-main.644.
  • Schaul et al. [2015] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International conference on machine learning, pages 1312–1320. PMLR, 2015.
  • Liu et al. [2022] Minghuan Liu, Menghui Zhu, and Weinan Zhang. Goal-conditioned reinforcement learning: Problems and solutions. In Luc De Raedt, editor, Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 5502–5511, 2022.
  • Narvekar et al. [2020] Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E. Taylor, and Peter Stone. Curriculum learning for reinforcement learning domains: A framework and survey. Journal of Machine Learning Research, 21(181):1–50, 2020. URL http://jmlr.org/papers/v21/20-212.html.
  • Kuo et al. [2023] Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui, and Anelia Angelova. Mammut: A simple architecture for joint learning for multimodal tasks, 2023.
  • Xu et al. [2023] Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, Guohai Xu, Ji Zhang, Songfang Huang, Fei Huang, and **gren Zhou. mplug-2: A modularized multi-modal foundation model across text, image and video, 2023.
  • Wang et al. [2023] Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models, 2023.
  • Du et al. [2023] Yuqing Du, Olivia Watkins, Zihan Wang, Cédric Colas, Trevor Darrell, Pieter Abbeel, Abhishek Gupta, and Jacob Andreas. Guiding pretraining in reinforcement learning with large language models, 2023.
  • Colas et al. [2022] Cédric Colas, Tristan Karch, Olivier Sigaud, and Pierre-Yves Oudeyer. Autotelic agents with intrinsically motivated goal-conditioned reinforcement learning: a short survey. Journal of Artificial Intelligence Research, 74:1159–1199, 2022.
  • Andrychowicz et al. [2017] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. Advances in neural information processing systems, 30, 2017.
  • Florensa et al. [2018] Carlos Florensa, David Held, Xinyang Geng, and Pieter Abbeel. Automatic goal generation for reinforcement learning agents. In International conference on machine learning, pages 1515–1528. PMLR, 2018.
  • Campero et al. [2021] Andres Campero, Roberta Raileanu, Heinrich Küttler, Joshua B Tenenbaum, Tim Rocktäschel, and Edward Grefenstette. Learning with amigo: Adversarially motivated intrinsic goals. ICLR, 2021.
  • Sukhbaatar et al. [2018] Sainbayar Sukhbaatar, Zeming Lin, Ilya Kostrikov, Gabriel Synnaeve, Arthur Szlam, and Rob Fergus. Intrinsic motivation and automatic curricula via asymmetric self-play. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
  • OpenAI et al. [2021] OpenAI OpenAI, Matthias Plappert, Raul Sampedro, Tao Xu, Ilge Akkaya, Vineet Kosaraju, Peter Welinder, Ruben D’Sa, Arthur Petron, Henrique P. d. O. Pinto, Alex Paino, Hyeonwoo Noh, Lilian Weng, Qiming Yuan, Casey Chu, and Wojciech Zaremba. Asymmetric self-play for automatic goal discovery in robotic manipulation, 2021.
  • Du et al. [2022] Yuqing Du, Pieter Abbeel, and Aditya Grover. It takes four to tango: Multiagent selfplay for automatic curriculum generation. In 10th International Conference on Learning Representations, ICLR 2022, pages 1515–1528, 2022.
  • Kharyal et al. [2023] Chaitanya Kharyal, Tanmay Sinha, Sai Krishna Gottipati, Fatemeh Abdollahi, Srijita Das, and Matthew E Taylor. Do as you teach: A multi-teacher approach to self-play in deep reinforcement learning. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, pages 2457–2459, 2023.
  • Luketina et al. [2019] Jelena Luketina, Nantas Nardelli, Gregory Farquhar, Jakob Foerster, Jacob Andreas, Edward Grefenstette, Shimon Whiteson, and Tim Rocktäschel. A survey of reinforcement learning informed by natural language. In International Joint Conference on Artificial Intelligence (IJCAI-19), 2019.
  • Bahdanau et al. [2018] Dzmitry Bahdanau, Felix Hill, Jan Leike, Edward Hughes, Arian Hosseini, Pushmeet Kohli, and Edward Grefenstette. Learning to understand goal specifications by modelling reward. arXiv preprint arXiv:1806.01946, 2018.
  • Hermann et al. [2017] Karl Moritz Hermann, Felix Hill, Simon Green, Fumin Wang, Ryan Faulkner, Hubert Soyer, David Szepesvari, Wojciech Marian Czarnecki, Max Jaderberg, Denis Teplyashin, et al. Grounded language learning in a simulated 3d world. arXiv preprint arXiv:1706.06551, 2017.
  • Chaplot et al. [2018] Devendra Singh Chaplot, Kanthashree Mysore Sathyendra, Rama Kumar Pasumarthi, Dheeraj Rajagopal, and Ruslan Salakhutdinov. Gated-attention architectures for task-oriented language grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • Oh et al. [2017] Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. Zero-shot task generalization with multi-task deep reinforcement learning. In International Conference on Machine Learning, pages 2661–2670. PMLR, 2017.
  • Goyal et al. [2019] Prasoon Goyal, Scott Niekum, and Raymond J Mooney. Using natural language for reward sha** in reinforcement learning. arXiv preprint arXiv:1903.02020, 2019.
  • Wang et al. [2019] Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6629–6638, 2019.
  • Chan et al. [2019] Harris Chan, Yuhuai Wu, Jamie Kiros, Sanja Fidler, and Jimmy Ba. Actrce: Augmenting experience via teacher’s advice for multi-goal reinforcement learning. arXiv preprint arXiv:1902.04546, 2019.
  • Seita et al. [2019] Daniel Seita, David Chan, Roshan Rao, Chen Tang, Mandi Zhao, and John Canny. Zpd teaching strategies for deep reinforcement learning from demonstrations. arXiv preprint arXiv:1910.12154, 2019.
  • Huang et al. [2018] Ying Huang, GuoLiang Wei, and YongXiong Wang. V-d d3qn: the variant of double deep q-learning network with dueling architecture. In 2018 37th Chinese Control Conference (CCC), pages 9130–9135, 2018. doi:10.23919/ChiCC.2018.8483478.
  • van Hasselt et al. [2015] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. CoRR, abs/1509.06461, 2015. URL http://arxiv.longhoe.net/abs/1509.06461.
  • Wang et al. [2015] Ziyu Wang, Nando de Freitas, and Marc Lanctot. Dueling network architectures for deep reinforcement learning. CoRR, abs/1511.06581, 2015. URL http://arxiv.longhoe.net/abs/1511.06581.
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Kirkeby Fidjeland, Georg Ostrovski, Stig Petersen, Charlie Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015. URL https://api.semanticscholar.org/CorpusID:205242740.
  • Chevalier-Boisvert et al. [2018] Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. Babyai: First steps towards grounded language learning with a human in the loop. CoRR, abs/1810.08272, 2018. URL http://arxiv.longhoe.net/abs/1810.08272.
  • Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.longhoe.net/abs/1412.6980.