HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: apxproof
  • failed: dashrule
  • failed: extarrows

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2401.06603v1 [cs.CL] 12 Jan 2024

Mutual Enhancement of Large Language and Reinforcement Learning Models through Bi-Directional Feedback Mechanisms: A Case Study

Shangding Gu*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT
Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities for reinforcement learning (RL) models, such as planning and reasoning capabilities. However, the problems of LLMs and RL model collaboration still need to be solved. In this study, we employ a teacher-student learning framework to tackle these problems, specifically by offering feedback for LLMs using RL models and providing high-level information for RL models with LLMs in a cooperative multi-agent setting. Within this framework, the LLM acts as a teacher, while the RL model acts as a student. The two agents cooperatively assist each other through a process of recursive help, such as ”I help you help I help.” The LLM agent supplies abstract information to the RL agent, enabling efficient exploration and policy improvement. In turn, the RL agent offers feedback to the LLM agent, providing valuable, real-time information that helps generate more useful tokens. This bi-directional feedback loop promotes optimization, exploration, and mutual improvement for both agents, enabling them to accomplish increasingly challenging tasks. Remarkably, we propose a practical algorithm to address the problem and conduct empirical experiments to evaluate the effectiveness of our method.

keywords:
Large Language Model; Reinforcement Learning Model; Cooperative Game; Bi-Directional Feedback.

1 Introduction

Large Language Models (LLMs) openai2023gpt4; chang2023survey have showcased exceptional prowess across various domains. Notably, LLMs find applications in furnishing information for tasks like robot planning singh2023progprompt, machine translation zhang2023prompting and medicine thirunavukarasu2023large. In parallel, RL has demonstrated remarkable capabilities in various domains, including achieving human-level performance in games such as the game of Go silver2016mastering and multiplayer poker brown2019superhuman. LLMs have been increasingly incorporated to enhance the performance of Reinforcement Learning (RL) du2023guiding; szot2023large. Likewise, RL has also been employed to augment the capabilities of LLMs, furthering their effectiveness ouyang2022training. Nevertheless, the effective harnessing of LLMs’ latent potential in solving complex tasks, through the synergistic integration with powerful RL frameworks sutton2018reinforcement, remains a formidable challenge.

The research most related to our study includes the works of Carta et al. carta2023grounding and Tran et al. tran2023exploring. In the work of Carta et al. carta2023grounding, they employ LLMs as RL policies to acquire task-solving capabilities while learning new knowledge through interactive experiences. Their experimental findings suggest that their method outperforms baseline approaches. However, a potential limitation of their work is the absence of instruction feedback from RL models, which may impact the overall effectiveness of their method. In the work of Tran et al. tran2023exploring, they deploy RL to train a conversational agent using a simulator and an initial text generated by a generative chat agent. Subsequently, they input the data from the RL-trained agent to the generative chat agent. Although their experiment results indicate that their method performs better than baselines, a concern of this approach is the potential time consumption associated with RL training for multi-turn conversations, as each conversation may necessitate RL training requests. Additionally, achieving self-online learning for task execution could be challenging in this framework.

In this study, to address the above challenge, we propose a teacher-student learning framework in a cooperative game, where the integration of RL models (students) and LLMs (teachers) with bi-directional feedback gu2023human may be an effective solution. The two models cooperatively to carry out complex tasks, which can be considered a win-win collaboration, where the RL model and the LLM act as two agents, cooperating to complement, assist, and provide feedback to each other, ultimately solving the problem together.

2 Method

In this section, we introduce a teacher-student learning framework with bi-directional feedback, wherein a synergistic partnership between an LLM and an RL model is employed to tackle tasks collaboratively. As illustrated in Figure 1, these two models operate in tandem, with mutual support, ultimately enabling successful task completion 111We use a TD error as an estimator for the case’s advantage function..

LLMs (teachers) help RL models (students): While it is often challenging for LLMs to provide instructions encompassing perfect and comprehensive environmental information, LLMs can supply RL models with approximated information. Providing such rudimentary guidance by LLMs serves the purpose of streamlining the exploration process undertaken by RL models. Consequently, this streamlined exploration process yields a discernible reduction in the exploration space and the time required for RL models to ascertain and establish an optimal policy. This phenomenon underscores the potential utility of LLMs in mitigating the challenges associated with imperfect instructional input, thereby contributing to the enhanced efficiency of RL models in their quest to identify optimal policies.

RL models (students) help LLMs (teachers): During the execution of a policy within the RL framework, RL models benefit from the support provided by LLMs. In this collaborative process, RL models are not only recipients but also evaluators of the output generated by LLMs. This reciprocal interaction allows RL models to offer constructive feedback to LLMs, thereby facilitating an iterative refinement of LLMs’ performance. LLMs progressively acquire a more nuanced understanding of the underlying environments with the progression of iterations. Consequently, they become increasingly adept at furnishing improved output, which aids RL models and LLMs in executing complex tasks with greater efficacy. This iterative and symbiotic relationship between RL models and LLMs emphasizes the potential for continuous improvement and optimization in their collaborative endeavors. The corresponding practical algorithm is provided in Algorithm 1.

Refer to caption
Refer to caption
Figure 1: An LLM and an RL model collaboratively engage with an environment to accomplish complex tasks, facilitating bi-directional feedback throughout the process.
1:  Initial Q value Qθ0subscript𝑄subscript𝜃0Q_{\theta_{0}}italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, V value Vϕ0subscript𝑉subscriptitalic-ϕ0V_{\phi_{0}}italic_V start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and advantage function value A0P=Qθ0Vϕ0superscriptsubscript𝐴0𝑃subscript𝑄subscript𝜃0subscript𝑉subscriptitalic-ϕ0A_{0}^{P}=Q_{\theta_{0}}-V_{\phi_{0}}italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT = italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_V start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, state s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, action a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, token x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.
2:  for t=0,1,,T𝑡01𝑇t=0,1,\dots,Titalic_t = 0 , 1 , … , italic_T do
3:     Conduct tasks with an RL model Pθt(atxt,st)subscript𝑃subscript𝜃𝑡conditionalsubscript𝑎𝑡subscript𝑥𝑡subscript𝑠𝑡P_{\theta_{t}}(a_{t}\mid x_{t},s_{t})italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and an LLM M(xtst)𝑀conditionalsubscript𝑥𝑡subscript𝑠𝑡M(x_{t}\mid s_{t})italic_M ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
4:     LLM M(xtst)𝑀conditionalsubscript𝑥𝑡subscript𝑠𝑡M(x_{t}\mid s_{t})italic_M ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) provides decision information to RL model Pθt(atxt,st)subscript𝑃subscript𝜃𝑡conditionalsubscript𝑎𝑡subscript𝑥𝑡subscript𝑠𝑡P_{\theta_{t}}(a_{t}\mid x_{t},s_{t})italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) by leveraging comsense capabilities and environment information.
5:     Estimate new RL AtPA_{t}^{P}{{}^{\prime}}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT based on new Q value Qθtsubscript𝑄subscript𝜃𝑡Q_{\theta_{t}}italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT and V value Vϕtsubscript𝑉subscriptitalic-ϕ𝑡V_{\phi_{t}}italic_V start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT.
6:     if AP>AtPA^{P}>A_{t}^{P}{{}^{\prime}}italic_A start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT > italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT  then
7:        Provide negative instruction feedback to LLM, the token is worse than the last one.
8:     else
9:        Provide positive instruction feedback to LLM, the token is better than the last one.
10:     end if
11:     AP=AtPA^{P}=A_{t}^{P}{{}^{\prime}}italic_A start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT = italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT.
12:  end for
Algorithm 1 Onling Learning to make decisions with RL and LLMs.

3 Experiments

Our experimental investigation is conducted within the context of the BabyAI benchmark babyaiiclr19. To facilitate our experimentation, we leverage the Lamorel framework carta2023grounding. Within the Lamorel framework, we augment LLMs with RL instruction feedback, thereby establishing a feedback loop for the enhancement of LLM performance. Specifically, we proceed to conduct experiments focusing on the task of GoToRedBallNoDists-v0. The experiments are executed over two cases: one comprising 40 iteration steps and another spanning 2100 iteration steps. In the evaluation, we draw a comparative analysis between our proposed method and the baseline. In our method, we consider RL model feedback to LLMs and LLMs’s information to RL models. However, the state-of-the-art baseline represented by the original Lamorel method that lacks such instruction feedback to LLMs from RL models.

It is noteworthy that for our experiments, we employ the ”google/flan-t5-small” model chung2022scaling as the LLM, characterized by a parameter count of 80 million. The experimental results are presented in Figure 2. These findings clearly illustrate the superior performance of our method, as quantified by the performance value metric (where higher values indicate better performance). Furthermore, our method demonstrates notably expedited convergence (one-shot/few-shot learning) when compared to the Lamorel baseline. This empirical evidence highlights the effectiveness of our approach in harnessing bi-directional feedback between RL models and LLMs for improving performance in the context of the BabyAI benchmark.

Refer to caption
(a)
Refer to caption
(b)
Figure 2: Experiments on BabyAI tasks babyaiiclr19 with 40 (a) and 2100 (b) iteration steps.

4 Conclusion

In this study, we developed a teacher-student learning framework for unlocking LLMs’ powerful capabilities by leveraging an RL model with bi-directional feedback mechanisms in a cooperative game setting. To empirically assess the effectiveness of our method, we conducted experiments using the BabyAI benchmark as an assessment platform. The results of these experiments demonstrate the superior performance of our approach in comparison to the state-of-the-art baseline, underscoring its potential for substantially enhancing learning outcomes. Importantly, our approach could exhibit promise in cultivating safe and robust learning systems gu2022review, particularly when confronted with the inherent challenges of imperfect information environments. Furthermore, we hope our insights inspire novel avenues of research in the realms of LLMs and RL for future investigations.

References

  • (1) Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 365(6456):885–890, 2019.
  • (2) Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, and Pierre-Yves Oudeyer. Grounding large language models in interactive environments with online reinforcement learning. In ICML, volume 202, pages 3676–3713. PMLR, 23–29 Jul 2023.
  • (3) Yupeng Chang, Xu Wang, **dong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models. arXiv:2307.03109, 2023.
  • (4) Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. Babyai: First steps towards grounded language learning with a human in the loop. ICLR, 2019.
  • (5) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv:2210.11416, 2022.
  • (6) Yuqing Du, Olivia Watkins, Zihan Wang, Cédric Colas, Trevor Darrell, Pieter Abbeel, Abhishek Gupta, and Jacob Andreas. Guiding pretraining in reinforcement learning with large language models. In ICML, volume 202, pages 8657–8677. PMLR, 23–29 Jul 2023.
  • (7) Shangding Gu, Alap Kshirsagar, Yali Du, Guang Chen, Jan Peters, and Alois Knoll. A human-centered safe robot reinforcement learning framework with interactive behaviors. Frontiers in Neurorobotics, 17, 2023.
  • (8) Shangding Gu, Long Yang, Yali Du, Guang Chen, Florian Walter, Jun Wang, Yaodong Yang, and Alois Knoll. A review of safe reinforcement learning: Methods, theory and applications. arXiv:2205.10330, 2022.
  • (9) OpenAI. Gpt-4 technical report, 2023.
  • (10) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. NeurIPS, 35:27730–27744, 2022.
  • (11) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  • (12) Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models. In ICRA, pages 11523–11530. IEEE, 2023.
  • (13) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • (14) Andrew Szot, Max Schwarzer, Harsh Agrawal, Bogdan Mazoure, Walter Talbott, Katherine Metcalf, Natalie Mackraz, Devon Hjelm, and Alexander Toshev. Large language models as generalizable policies for embodied tasks. arXiv:2310.17722, 2023.
  • (15) Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine. Nature medicine, 29(8):1930–1940, 2023.
  • (16) Quoc-Dai Luong Tran and Anh-Cuong Le. Exploring bi-directional context for improved chatbot response generation using deep reinforcement learning. Applied Sciences, 13(8):5041, 2023.
  • (17) Biao Zhang, Barry Haddow, and Alexandra Birch. Prompting large language model for machine translation: A case study. In ICML, volume 202, pages 41092–41110. PMLR, 23–29 Jul 2023.