Smurfs: Leveraging Multiple Proficiency Agents with Context-Efficiency for Tool Planning

Junzhi Chen¹¹1These authors contributed equally to this work., Juhao Liang¹¹1These authors contributed equally to this work., Benyou Wang^🖂
Shenzhen Research Institute of Big Data
The Chinese University of Hong Kong, Shenzhen
[email protected]

Abstract

The emergence of large language models (LLMs) has opened up unprecedented possibilities for automating complex tasks that are often comparable to human performance. Despite their capabilities, LLMs still encounter difficulties in completing tasks that require high levels of accuracy and complexity due to their inherent limitations in handling multifaceted problems single-handedly. This paper introduces ‘Smurfs’, a cutting-edge multi-agent framework designed to revolutionize the application of LLMs. By seamlessly transforming a conventional LLM into a synergistic multi-agent ensemble, Smurfs can enhance the model’s ability to solve complex tasks at no additional cost. This is achieved through innovative prompting strategies that allocate distinct roles within the model, thereby facilitating collaboration among specialized agents and forming an intelligent multi-agent system. Our empirical investigation on both open-ended task of StableToolBench and closed-ended task on HotpotQA showcases Smurfs’ superior capability in intricate tool utilization scenarios. Notably, Smurfs outmatches all the baseline methods in both experiments, setting new state-of-the-art performance. Furthermore, through comprehensive ablation studies, we dissect the contribution of the core components of the multi-agent framework to its overall efficacy. This not only verifies the effectiveness of the framework, but also sets a route for future exploration of multi-agent LLM systems. Our code is available at https://github.com/FreedomIntelligence/Smurfs/.

Junzhi Chen¹¹1These authors contributed equally to this work., Juhao Liang¹¹1These authors contributed equally to this work., Benyou Wang^🖂 Shenzhen Research Institute of Big Data The Chinese University of Hong Kong, Shenzhen [email protected]

^†^†footnotetext: 🖂Corresponding author.

Refer to caption — Figure 1: Demonstration of the whole process of the Smurfs framework.

1 Introduction

Tool manipulation has traditionally been seen as a distinctive human characteristic, dating back approximately 2.5 million years Oakley and Museum (1972); Ambrose (2001). For large language models (LLMs), access to external tools can equip them with broader capabilities beyond their fixed language modeling knowledge. For example, the search engine API empowers ChatGPT to access real-time information Zhao et al. (2023). However, LLMs still encounter several challenges when using multiple tools to solve tasks. These challenges include effective solution planning and adaptability to new tools. Hao et al. (2024); Guu et al. (2020); Qin et al. (2024).

This paper addresses the critical research problem of enhancing the problem-solving capabilities of LLMs through the adoption of a plug-and-play multi-agent system (MAS) framework Dorri et al. (2018); Van der Hoek and Wooldridge (2008). We posit that a MAS approach can significantly augment the efficacy of LLMs in handling tasks that require a high degree of precision, adaptability, and comprehensive knowledge integration.

	Pass Rate $\uparrow$ (%)	Win Rate $\uparrow$ (%)	# of Tokens per request $\downarrow$	# of Tokens per query $\downarrow$
ReACT	$44.4_{\pm 1.1}$	base	1,424	6,479
DFSDT	$55.4_{\pm 2.0}$	60.4	1,743	20,714
Smurfs (ours)	$57.4_{\pm 1.1}$	62.4	459	8,096

Table 1: Comparison of token cost and performance between tool planning methods over StableToolBench. Existing methods, ReACT and DFSDT, have limitations due to high token costs or poor performance. The results are averaged over the subtasks within StableToolBench.

To this end, we introduce ‘Smurfs’ an innovative MAS framework inspired by the collaborative and versatile nature of its namesake cartoon characters. The proposed framework is based on the principle: synergistic collaboration among specialized agents can overcome the limitations faced by individual LLMs. Each agent within the Smurfs framework is designed to perform specific sub-tasks, facilitating a more nuanced and effective approach to complex problem-solving. Our research delves into the architectural design, coordination mechanisms, and the operational dynamics of integrating specialized agents into a cohesive system. The effectiveness of Smurfs is validated through both open-ended and closed-ended tool planning benchmark experiments Guo et al. (2024); Yang et al. (2018), where the proposed MAS system consistently outperform baseline methods on both benchmarks. An ablation study followed by a case study further investigates the underlying reasons for this effectiveness. These results not only establish a new state-of-the-art in the field but also offer concrete evidence of the multi-agent approach’s efficacy in enhancing LLM capabilities.

The contributions of this paper can be summarized as follows:

1.

We introduce a novel plug-and-play MAS framework to enhance the tool planning capabilities of LLMs. Experiments demonstrate the effectiveness of this approach, which is also more cost-efficient compared to existing tool planning methods.
2.

Ablation studies further reveal the underlying reasons for the effectiveness of the MAS framework, providing valuable insights for future research.

2 Motivation

2.1 Multi-Tool Planning

To augment LLMs to do multi-tool planning for solving complex problems, previous work has seen numerous attempts. Chain-of-Thought Wei et al. (2023) was the first to propose the method of thought and answer chain reasoning. ReACT Yao et al. (2022) further introduced the thought-action-observation format for tool chain reasoning, leading to the development of various multi-tool planning methods Chen et al. (2023a); Xu et al. (2023); Shinn et al. (2023). The latest work, DFSDT Qin et al. (2024), was proposed to address the inherent limitations of CoT and ReACT: error propagation and limited exploration. Deep First Search Decision Tree, denoted as DFSDT, is powerful in addressing multi-tool planning problems. Its core concept involves employing a depth-first search (DFS) approach for multi-tool planning (for more details, see Appendix A). When a tool fails or is deemed inadequate for solving the current problem, DFSDT backtracks to the previous solution state and attempts to resolve the issue using a different tool. However, several limitations were identified with the mechanism of DFSDT: (1) instability of the rollback mechanism, (2) redundant context, and (3) premature termination. The following sections will introduce these limitations in detail.

2.1.1 Instability of the Rollback Mechanism

The rollback mechanism in DFSDT is determined by the model. The number of steps to roll back and the selection of new tools after rollback are guided using prompt containing the errors encountered in the previous failed trajectory. When the model is sufficiently robust, this rollback mechanism serves as a highly flexible and efficient planning strategy. However, when the model’s capability is insufficient, it will fail to execute the correct rollback mechanism, i.e. retry the same error tools or roll back too far.

2.1.2 Redundant Context

In the process of planning with DFSDT, each tool plan is generated using the entire conversation history (including all the thoughts, actions, action inputs and tool responses) as context. However, in reality, each step of tool planning only requires a very small portion of the relevant history for effective planning.

The context redundancy not only increases computational overhead but also reduces the accuracy of model inference due to the inclusion of irrelevant historical data. As highlighted by Liu et al. (2024), redundant context become particularly noticeable in tasks requiring assimilation and processing of large inputs, like verbose tool documents and API responses. The situation worsens when LLMs are supplemented with external information, such as document retrieval or online searching Petroni et al. (2020); Ram et al. (2023); Mallen et al. (2022). Although numerous language models capable of handling larger contexts are emerging Dai et al. (2019); Dao et al. (2022), they often face significant performance degradation when the important information is located at some positions Liu et al. (2024); Shi et al. (2023), which is known as the ‘lost-in-the-middle’ problem.

2.1.3 Premature Termination

The termination mechanism set by DFSDT involves adding a termination tool to the model’s selectable toolkit. When the model selects this termination tool, DFSDT stops and provides an answer. However, in practical applications, this mechanism often prematurely terminates when dealing with complex problems requiring multi-step reasoning. We hypothesize that this issue arises due to the redundant interference of other tool information and history information, which disrupts the model’s ability to judge whether the original problem should be terminated. Instead, the model focuses on whether the current sub-problem requires termination, leading the mechanism to terminate after resolving the sub-problem.

2.2 Multi Agent System

Method	Multi-Agent	Training	Generality	Reflection	Planning
ReAct Yao et al. (2022)	✗	✗	✔	✗	Iterative
Reflexion Shinn et al. (2023)	✗	✗	✔	✔	Iterative
Chameleon Lu et al. (2023)	✗	✗	✔	✗	Global
HuggingGPT Shen et al. (2023)	✗	✗	✔	✗	Global
BOLAA Liu et al. (2023)	✔	✗	✔	✗	Iterative
AgentVerse Chen et al. (2023b)	✔	✗	✔	✗	Iterative
FireAct Chen et al. (2023a)	✗	✔	✗	✔	Iterative
DFSDT Qin et al. (2024)	✗	✔	✗	✗	Iterative
RestGPT Song et al. (2023)	✔	✗	✔	✗	Iterative
Lumos Yin et al. (2024)	✔	✔	✗	✗	Iterative or Global
AutoAct Qiao et al. (2024)	✔	✔	✗	✔	Iterative
Smurfs (Ours)	✔	✗	✔	✔	Iterative and Global

Table 2: Comparison of related works.

To address the limitations inherent in DFSDT and to further enhance LLM’s multi-tool planning capabilities, multi-agent system (MAS) has emerged as a natural solution. Inspired by human social division of labor and cooperation, MAS aim to enable AI agents to accomplish more complex tasks more effectively and efficiently through the division of labor and collaboration. Previous works Song et al. (2023); Liu et al. (2023); Chen et al. (2023b); Yin et al. (2024); Qiao et al. (2024) has leveraged MAS to achieve this goal. Table 2 shows the difference between them. Based on those works, we further design the MAS named Smurfs to address issues with DFSDT. By dividing tasks among different agents, each agent can focus on a specific part of the DFSDT task, accessing only the necessary history as context during task execution, which effectively addresses the issue of redundant context. The redesign of the rollback mechanism to incorporate memory and tool list rollback mechanisms addresses the instability of the rollback mechanism. Drawing on the concept of least-to-most prompting Zhou et al. (2023), the original problem is first decomposed into sub-problems for macro-level planning. Subsequently, DFSDT is used to solve each sub-problem at the micro-level, with macro-level planning guiding the micro-level planning, thereby resolving the issue of premature termination.

3 Smurfs: A framework with multiple agents

The Smurfs, the beloved cartoon characters, symbolize unity and resourcefulness, and are good at using tools to overcome any challenge they encounter.

3.1 Framework Overview

Figure 1 illustrates the entire workflow for the Smurfs framework. Initially, the Planning Agent identifies the user’s complex request and breaks it down into manageable sub-tasks. Executor Agents are then tasked with collecting task specific information, utilizing access to external tools. Answer Agent compiles the findings into a cohesive response, which is subsequently verified by the Verifier Agent to ensure accuracy and relevance. Each agent focus on its own task and only use the relevant part of the conversation history to reduce the Redundant Context. This process exemplifies the framework’s capability to efficiently handle complex queries by leveraging the specialized roles of multiple agents, thereby ensuring both the precision of task execution and the quality of the output. In the following sections, the system mechanism and functions of each agent will be detailed. More details of memory system can be seen at Appendix B.

3.2 Agent Components

Tools

The tool documents about the tools that Smurfs can utilize in the completion of a complex task are denoted as $D=\{n_{i},d_{i},p_{i}\}^{|d|}_{i=1}$ , where n represents the tool name, d represents tool usage description, p represents parameter description and $|d|$ represents the amount of the available tools. The available tool list is denoted as $\tau=\{n_{i},d_{i}\}^{|\tau|}_{i=1}$ . $\tau_{t}$ denotes the tool list Smurfs can utilize at time t.

Memory

The memory of the agent system at time t is the history of the task-solving process before t, denoted as $M=(m_{1},m_{2},...,m_{t-1})$ and $m_{i}=(\gamma_{i},a_{i})$ , where $m_{i}$ represents memory element at time i and $\gamma_{i}$ , $a_{i}$ represents thought and answer generated by the system at time i. There are two types of memory in Smurfs: local memory and global memory. the local memory is used to record the ongoing solution trajectory and to generate the next action in the current trajectory. The global memory, meanwhile, records all trajectories and is used to generate the sub-problem’s answer by combining all trajectory records when the maximum number of retries is exceeded. This local-global combined memory system ensures that the planning of the current solution trajectory is not influenced by the context of erroneous trajectories. It also generates an answer that combines all trajectories when the verifier agent cannot determine task completion within the maximum number of planned steps. This memory system ensures context efficiency during the task-solving process.

3.3 Macro Planning

Planning Agent

The primary responsibility of the Planning Agent is doing macro-level planning through task decomposition to prevent premature termination. The inference process of the Planning Agent is:

Plan\>P:(p1,p2,...)=PA(q)

(1)

Where $p_{i}$ represents sub-problem of the original query q, PA represents the Planning Agent. After the task decomposition, the agent system will use Executor Agent, Answer Agent an Verifier Agent to solve each sub-problem using DFSDT collaboratively in a sequential order. To utilize the answer of the previous sub-problem when solving subsequent sub-problem, the strategy known as least-to-most prompting Zhou et al. (2023) is used.

3.4 Subtask Solving Process

After introducing the function of plan agent, this section outlines how the agents collaborate to solve sub-tasks, as shown in Figure 2.

Stable Rollback

To address the instability of the rollback mechanism in DFSDT, we propose a rollback mechanism based on rules. Whenever an error occurs while using a tool $\tau_{t,i}$ at time t, the tool list at t $\tau_{t}$ will pop $\tau_{t,i}$ out and reperform tool selection and tool planning (ensuring that the faulty tool is not selected again). If, at time t, the tool list becomes empty, it signifies that after the system choosing tool $\tau_{t-1,j}$ at time t-1, no subsequent trajectory can solve the problem. In this case, the agent system will roll back to time t-1, meaning that the local memory M will pop out the memory element $m_{t-1}$ at time t-1, and the tool list at time t-1 $\tau_{t-1}$ will pop out tool $\tau_{t-1,j}$ . The agent system will then set the time t=t-1 and continue planning. This rule-based rollback mechanism, compared to the original model-based rollback mechanism of DFSDT, is less flexible and might reduce rollback efficiency. However, it is more stable, ensuring the correctness of deep first search and enabling models with weaker capabilities to utilize DFSDT on tool planning.

Backbone

Method

StableToolBench

I1-Inst.

I1-Cat.

I1-Tool.

I2-Cat.

I2-Inst.

I3-Inst.

Average

Pass

Win

Pass

Win

Pass

Win

Pass

Win

Pass

Win

Pass

Win

Pass

Win

GPT-3.5 Turbo

ReACT

41.6_{\pm 1.2}

48.4_{\pm 0.5}

52.5_{\pm 0.5}

52.2_{\pm 1.0}

31.6_{\pm 1.2}

39.9_{\pm 2.0}

44.4_{\pm 1.1}

/

GPT-3.5 Turbo

DFSDT

54.1_{\pm 1.0}

64.4

60.1_{\pm 0.0}

61.4

59.9_{\pm 1.7}

53.8

60.9_{\pm 0.9}

62.9

52.8_{\pm 3.7}

66.0

44.3_{\pm 4.8}

54.1

55.4_{\pm 2.0}

60.4

GPT-3.5 Turbo

Smurfs

\underline{60.3_{\pm 1.5}}

65.0

67.0_{\pm 1.0}

\underline{69.9}

60.3_{\pm 1.3}

54.4

54.3_{\pm 0.4}

\underline{63.7}

42.6_{\pm 1.6}

64.2

60.1_{\pm 1.0}

57.4

57.4_{\pm 1.1}

62.4

Mistral-7B

ReACT

0.0

Mistral-7B

DFSDT

0.0

Mistral-7B

Smurfs

\mathbf{76.3_{\pm 0.8}}

63.8

\mathbf{86.7_{\pm 1.2}}

62.7

\mathbf{81.0_{\pm 1.9}}

58.2

\mathbf{70.4_{\pm 2.7}}

54.0

\mathbf{63.8_{\pm 2.4}}

\underline{67.0}

\mathbf{85.2_{\pm 0.7}}

57.4

\mathbf{77.2_{\pm 1.6}}

60.5

GPT-4 Turbo

ReACT

41.1_{\pm 1.5}

60.1

53.2_{\pm 1.3}

62.1

42.2_{\pm 1.1}

48.1

50.0_{\pm 0.7}

57.3

38.7_{\pm 0.8}

65.1

37.7_{\pm 1.3}

47.5

43.8_{\pm 1.1}

56.7

GPT-4 Turbo

DFSDT

52.7_{\pm 1.4}

\underline{69.9}

58.2_{\pm 0.9}

66.0

59.7_{\pm 1.2}

\underline{58.2}

59.3_{\pm 0.7}

62.1

52.2_{\pm 2.3}

\mathbf{67.9}

61.5_{\pm 1.8}

\underline{65.6}

57.3_{\pm 1.4}

\underline{65.0}

GPT-4 Turbo

Smurfs

59.3_{\pm 1.4}

\mathbf{71.2}

\underline{73.3_{\pm 1.3}}

\mathbf{72.5}

\underline{67.4_{\pm 0.7}}

\mathbf{69.6}

\underline{66.7_{\pm 1.9}}

\mathbf{73.4}

\underline{55.5_{\pm 1.4}}

66.0

\underline{70.5_{\pm 0.0}}

\mathbf{72.1}

\underline{65.5_{\pm 1.1}}

\mathbf{70.8}

Table 3: The open-end tool planning task evaluation on the StableToolBench benchmark Guo et al. (2024). The most effective approach is highlighted in bold, while the second-best is underlined. Win rate is calculated by comparing each model with ChatGPT-ReACT. A win rate higher than 50% means the model performs better than ChatGPT-ReACT.

Executor Agent

The Executor Agent is responsible for choosing and executing the tools to solve the sub-tasks. At each time t, the agent can invoke one tool to tackle the given task:

	$\displaystyle\gamma=EA.gen\_thought(p,M,\tau,h)$		(2)
	$\displaystyle\alpha=EA.choose\_tool(p,\gamma,\tau)$		(3)
	$\displaystyle\beta=EA.gen\_arguments(p,M,D[\alpha])$		(4)
	$\displaystyle r=EA.call\_tool(\alpha,\beta)$		(5)

Where p is the sub-problem from Planning Agent, h is the hint from the Verifier Agent, $\tau$ is the tool list, M is local memory, $D[\alpha]$ means the tool document of tool $\alpha$ . The agent, using the ReACT format Yao et al. (2022) to choose the tool and arguments, then execute the tool. Noticed that each inference process only uses the relevant part from the local memory and tool list to reduce the context redundancy. More detailed information of the Executor Agent can be found in Figure 6.

Answer Agent

To mitigate the performance degradation caused by lengthy contexts, we introduce the Answer Agent role, designed to extract crucial content for each step and sub-problem:

\displaystyle Answer:a=AA(q,r,M)

(6)

Where q is sub-problem from the Planning Agent, r is response from the Executor Agent, M is the local memory (or global memory if max retry reaches). As the ‘lost-in-the-middle’ theory described in section 2.1, retaining all information may not always be beneficial, particularly in cases where the solution path is challenging to discern. Therefore, the primary role of the Answer Agent is to succinctly summarize the generated answers and tool responses to maintain the memory efficiency.

Verifier Agent

The Verifier Agent serves as an early-stop** and reflection mechanism, allowing for a balance between effectiveness and efficiency:

\displaystyle h,c=VA(q,a)

(7)

Where q denotes the sub-problems from the Planning Agent, a denotes the answer from the answer agent, h and c denotes hint and check status respectively. If check status generated is 0, that means the Verifier Agent thinks the sub-problem isn’t completed, the system will add the thought and answer of this time to the local and global memory, set t=t+1 and continue the inference procedure.If check status is 1, the sub-problems is thought to be solved and the system will start to deal with the next sub-problem.

4 Experiments

To evaluate both the effectiveness and efficiency of the Smurfs framework, in thie section, we carried out two multi-tool planning tasks: (1) an open-ended task, StableToolBench Guo et al. (2024), and (2) a closed-ended task, HotpotQA Yang et al. (2018). In addition to these main experiments designed to assess the entire framework, we conducted an ablation studies followed by a case study to test the capabilities of each component within the multi-agent framework and investigate the underlying reasons for its effectiveness.

4.1 Open-ended Task: StableToolBench

StableToolBench is a tool learning benchmark derived from ToolBench Qin et al. (2024), encompassing multi-step tool usage tasks across over 16,000 APIs. The benchmark employs two metrics for evaluation: (1) Pass Rate measures the percentage of instructions successfully executed within the allocated budget. (2) Win Rate represents the preference selection by a ChatGPT evaluator when presented with two solution paths.

Baselines

Following the original paper that introduced the benchmark, we adopt ReACT (CoT) Wei et al. (2023) and DFSDT Touvron et al. (2023) as baseline methods for comparison. Additionally, we include the backbones used in the paper: gpt-3.5-turbo-0613 (GPT-3.5 Turbo) OpenAI and gpt-4-turbo-preview (GPT-4 Turbo). To explore the adaptability of the tool-planning methods, we also include Mistral-7B-Instruct-v0.2 (Mistral-7B) Jiang et al. (2023) as one of the selected backbones in our experiments.

Settings

To minimize the influence of varying tool APIs on experimental results, we conducted all experiments using the same API cache Guo et al. (2024). For a fair comparison among the candidate methods and to reduce variability, each model was executed once and evaluated three times, with results averaged. Other settings follow those specified in the original benchmark paper.

Results

Table 3 displays the results on StableToolBench. For the untrained LLM, Mistral-7B, existing agent frameworks did not improve its performance in tool planning tasks; Mistral-7B failed these tasks when integrated with the ReACT and DFSDT frameworks ^*^**Experiment results show that Mistral-7B failed to correctly execute the ‘finish’ action during inference, resulting in invalid responses.. However, Smurfs exhibited exceptional performance: when combined with Mistral-7B, Smurfs achieved competitive scores among the baselines. Through its task decomposition mechanism, Smurfs transforms long-context tasks into simpler ones, enabling the untrained model to effectively utilize external tools for managing complex tasks. Regarding closed-source models, specifically GPT4 in these experiments, Smurfs also demonstrated outstanding performance on the benchmark compared to other agent frameworks and achieved state-of-the-art results on the benchmark. Its high success rate suggests that Smurfs is more effective at finding optimal solution paths compared to ChatGPT.

Further Analysis

We conducted a detailed analysis of the token costs associated with each tool planning method for the tasks, a critical evaluation aspect for multihop reasoning tasks. As shown in Table 1 (detailed in Appendix E), the average token costs per question and API request are evaluated for ReACT, DFSDT, and Smurfs on StableToolBench. The analysis reveals that DFSDT generally requires about 20,000 tokens per question, encompassing both prompt and completion tokens. This is nearly three times the token cost compared to ReACT and twice as much as Smurfs. Despite this higher token cost, DFSDT does not demonstrate commensurate effectiveness improvements over other methods. These findings underscore the cost-efficiency of the proposed MAS framework, Smurfs, which not only reduces token expenditure in solving multihop planning tasks but also delivers outstanding performance in evaluations.

Backbone	Method \faUserSingle-Agent \faUsers Multi-Agent	HotpotQA
Backbone	Method \faUserSingle-Agent \faUsers Multi-Agent	Easy	Medium	Hard	All
GPT-3.5 Turbo	\faToggleOff \faUser CoT	48.21	44.52	34.22	42.32
GPT-3.5 Turbo	\faToggleOff \faUser Zero-Shot Plan	50.71	45.17	38.23	44.70
Mistral-7B Instruct-v0.2	\faToggleOff \faUser CoT	33.70	22.38	22.14	26.07
	\faToggleOff \faUser ReAct	38.09	27.57	22.05	29.24
	\faToggleOff \faUser Chameleon	37.07	26.67	19.20	27.65
	\faToggleOff \faUser Reflexion	40.78	35.02	28.36	34.72
	\faToggleOff \faUsers BOLAA	40.86	32.11	22.36	31.78
	\faToggleOff \faUsers ReWOO	38.42	31.89	25.98	32.10
	\faToggleOff \faUsers Smurfs (ours)	45.94	40.74	30.72	39.13
	\faToggleOn \faUser FireAct	45.52	32.02	30.17	35.90
	\faToggleOn \faUsers AUTOACT	48.69	36.65	31.37	38.89
Llama-2 13B-chat	\faToggleOff \faUser CoT	37.90	25.28	21.64	28.27
	\faToggleOff \faUser ReAct	28.68	22.15	21.69	24.17
	\faToggleOff \faUser Chameleon	40.01	25.39	22.82	29.41
	\faToggleOff \faUser Reflexion	44.43	37.50	28.17	36.70
	\faToggleOff \faUsers BOLAA	33.23	25.46	25.23	27.97
	\faToggleOff \faUsers ReWOO	30.09	24.01	21.13	25.08
	\faToggleOff \faUsers Smurfs (ours)	42.62	27.21	22.92	30.92
	\faToggleOn \faUser FireAct	45.83	38.94	26.06	36.94
	\faToggleOn \faUsers AUTOACT	47.29	41.27	32.92	40.49
Llama-2 70B-chat	\faToggleOff \faUser CoT	45.37	36.33	32.27	37.99
	\faToggleOff \faUser ReAct	39.70	37.19	33.62	36.83
	\faToggleOff \faUser Chameleon	46.86	38.79	34.43	40.03
	\faToggleOff \faUser Reflexion	48.01	46.35	35.64	43.33
	\faToggleOff \faUsers BOLAA	46.44	37.29	33.49	39.07
	\faToggleOff \faUsers ReWOO	42.00	39.58	35.32	38.96
	\faToggleOff \faUsers Smurfs (ours)	52.86	50.77	44.87	49.50
	\faToggleOn \faUser FireAct	50.82	41.43	35.86	42.70
	\faToggleOn \faUsers AUTOACT	56.94	50.12	38.35	48.47

Table 4: The closed-end tool planning evaluation on HotpotQA Yang et al. (2018), with some results derived from Qiao et al. (2024). The most effective approach for each group is highlighted in bold, while the second-best is underlined. Methods marked with \faToggleOnrequire model training.

4.2 Closed-ended Task: HotpotQA

Compared to open-ended tasks, closed-ended tasks provide a more stable and robust evaluation. To this end, we evaluate the methods on HotpotQA Yang et al. (2018) in addition to StableToolBench. HotpotQA is a multi-hop QA task that is challenging due to the requirement for rich background knowledge, with answers typically being short entities or yes/no responses.

Baselines

The compared baselines include CoT Wei et al. (2023), ReActYao et al. (2022), ChameleonLu et al. (2023), Reflexion Shinn et al. (2023), BOLAA Liu et al. (2023), ReWOO Xu et al. (2023), FireAct Chen et al. (2023a), AutoActQiao et al. (2024).

Settings and Metrics

Following the settings in Qiao et al. (2024), we use open-source Llama-2 models Touvron et al. (2023) and Mistral-7B Jiang et al. (2023) as the backbones of each agent to evaluate the performance of Smurfs. The evaluation metrics is $\texttt{reward}\in[0,1]$ , defined as the F1 score grading between the prediction and ground-truth answer. For more details about the experiment, see Appendix C.

Results

Smurfs, as an untrained MAS system, not only comprehensively outperforms untrained agents but also achieves and even surpasses the accuracy of trained agents across most backbone models. This sufficiently demonstrates that the mechanism of smurfs ensures strong generalization capabilities while maintaining high effectiveness.

Observations indicate that the performance of LLama-2-13b-chat on smurfs-related tasks is suboptimal, likely due to its limited capabilities in tool arguments generation. Specifically, the primary issue identified is that, when the Executor agent successfully selects relevant tool, it tends to produce hallucination arguments that can’t be used by the tools. This indicates that LLama-2-13b-chat may need further training for usage of tools. The experimental results may substantiate this viewpoint, demonstrating that the untrained methods of llama-2-13b-chat generally exhibit significantly lower accuracy compared to the trained methods. Nevertheless, Smurfs achieves the second highest accuracy among the untrained methods, only slightly behind reflexion, which still attests to Smurfs’ capability.

	I3-Inst.
	Pass ( $\%$ )	Win ( $\%$ )
GPT-3.5 Turbo with Smurfs	$60.1_{\pm 1.0}$	57.4
w/o Answer Agent	$57.4_{\pm 2.9}$	$49.2$
w/o Verifier Agent	$54.1_{\pm 2.7}$	$42.6$
w/o Planning Agent	$35.5_{\pm 3.3}$	$42.6$
GPT-4 Turbo with Smurfs	$70.5_{\pm 1.0}$	72.1
w/o Answer Agent	$82.2_{\pm 2.5}$	$72.1$
w/o Verifier Agent	$79.2_{\pm 0.8}$	$63.9$
w/o Planning Agent	$71.9_{\pm 2.8}$	$63.9$

Table 5: Ablation study on StableToolBench I3-Inst subset to investigate the importance of each component within the framework.

5 Ablation Study

5.1 Importance of each component in MAS

We performed an ablation study to investigate the impact of each agent in our framework. We removed each agent individually, except for the indispensable Executor Agent, and compared the results to the complete framework. Table 5 shows that the Planning Agent is the most crucial component, followed by the Verifier Agent, with the Answer Agent being the least important.

(1) Verifier Agent Removal: Without verification, the framework uses a general depth-first search, leading to increased computational demand and more tool invocations.

(2) Answer Agent Removal: Removing this agent means the Executor Agent’s answers won’t be summarized, risking the ’lost-in-the-middle’ problem due to lengthy tool responses. As shown in the results, a more intelligent model, GPT-4 Turbo, can mitigate the negative impact of the Answer Agent’s removal. We believe this is because the more powerful model can leverage more information effectively.

(3) Planning Agent Removal: Removing the Planning Agent affects the global path-searching strategy. Models with Smurfs may show reduced performance without preliminary planning, as seen in current frameworks like ReACT and DFSDT. The results demonstrate that the impact of removing the Planning Agent is significant, as it directly influences the multihop reasoning ability of the MAS.

5.2 Case Study

As shown in Figure 3, although GPT4-DFSDT and GPT4-Smurfs use the same tool calls to solve the problem, GPT4-DFSDT only answers the first sub-question correctly while GPT4-Smurfs answers both sub-questions accurately. In the process of addressing the second sub-question, it is notable that the tool response only mentions titles of film and television products related to "Star Wars", without addressing OTT platforms. GPT-4-DFSDT erroneously interprets these titles as responses to the question, while GPT-4-Smurfs adeptly identifies this discrepancy and provides a more appropriate response. This case highlights that in situations where tool responses are lengthy and questions are complex, the single agent framework like DFSDT may be susceptible to distractions from irrelevant information, leading to erroneous answers. Conversely, the context-efficient Smurfs framework demonstrates a reduced susceptibility to irrelevant information, thereby generating more accurate answers.

6 Conclusion

In this study, we present a novel MAS framework, ‘Smurfs’, tailored to enhance the planning and reasoning capabilities of LLMs in handling complex tasks that involve lengthy contexts and tools. We conduct experiments on the multi-step tool usage benchmark, StableToolBench and HotpotQA, and the results demonstrate the overall effectiveness and efficiency of the Smurfs framework compared to baseline methods.

In conclusion, this research contributes to the expanding field of study focused on enhancing LLM capabilities, particularly for multi-step tool usage tasks. It emphasizes the importance of task decomposition, preliminary planning, and efficient verification for improving task execution performance. For future work, we believe incorporating more dedicated and specific roles within the system may further enhance effectiveness and efficiency, based on the ‘Smurfs principle’: synergistic collaboration among specialized agents can overcome the limitations faced by individual LLMs.

7 Limitations

Model Size Constraints:

Due to computational constraints, our experiments did not include larger and more diverse types of LLMs.

Agent Component Scale-Up:

Although we selected the most common and intuitive agent roles for the proposed MAS, there are many possibilities for researchers to explore. Investigating more well-designed agent roles may help improve the effectiveness of the agent system, and develo** automated methods to identify these roles could facilitate effective scaling.

Acknowledging these limitations, future research should aim to address these gaps to provide a more comprehensive understanding of the Smurfs framework’s capabilities and potential areas for improvement.

References

Ambrose (2001) Stanley H Ambrose. 2001. Paleolithic technology and human evolution. Science, 291(5509):1748–1753.
Chen et al. (2023a) Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. 2023a. Fireact: Toward language agent fine-tuning. Preprint, arXiv:2310.05915.
Chen et al. (2023b) Weize Chen, Yusheng Su, **gwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. 2023b. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. Preprint, arXiv:2308.10848.
Dai et al. (2019) Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
Dao et al. (2022) Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
Dorri et al. (2018) Ali Dorri, Salil S Kanhere, and Raja Jurdak. 2018. Multi-agent systems: A survey. Ieee Access, 6:28573–28593.
Guo et al. (2024) Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. 2024. Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models. Preprint, arXiv:2403.07714.
Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
Hao et al. (2024) Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. 2024. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. Advances in neural information processing systems, 36.
Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
Liu et al. (2024) Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173.
Liu et al. (2023) Zhiwei Liu, Weiran Yao, Jianguo Zhang, Le Xue, Shelby Heinecke, Rithesh Murthy, Yihao Feng, Zeyuan Chen, Juan Carlos Niebles, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, and Silvio Savarese. 2023. Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents. Preprint, arXiv:2308.05960.
Lu et al. (2023) Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. 2023. Chameleon: Plug-and-play compositional reasoning with large language models. Preprint, arXiv:2304.09842.
Mallen et al. (2022) Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511.
Oakley and Museum (1972) Kenneth Page Oakley and London British Museum. 1972. Man the tool-maker. 538. British Museum (Natural History) London.
(16) OpenAI. ChatGPT. https://openai.com/blog/chatgpt.
Petroni et al. (2020) Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. 2020. How context affects language models’ factual predictions. arXiv preprint arXiv:2005.04611.
Qiao et al. (2024) Shuofei Qiao, Ningyu Zhang, Runnan Fang, Yujie Luo, Wangchunshu Zhou, Yuchen Eleanor Jiang, Chengfei Lv, and Huajun Chen. 2024. Autoact: Automatic agent learning from scratch for qa via self-planning. Preprint, arXiv:2401.05268.
Qin et al. (2024) Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, dahai li, Zhiyuan Liu, and Maosong Sun. 2024. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. In The Twelfth International Conference on Learning Representations.
Ram et al. (2023) Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331.
Shen et al. (2023) Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Preprint, arXiv:2303.17580.
Shi et al. (2023) Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Schärli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. Preprint, arXiv:2302.00093.
Shinn et al. (2023) Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning. Preprint, arXiv:2303.11366.
Song et al. (2023) Yifan Song, Weimin Xiong, Dawei Zhu, Wenhao Wu, Han Qian, Mingbo Song, Hailiang Huang, Cheng Li, Ke Wang, Rong Yao, Ye Tian, and Sujian Li. 2023. Restgpt: Connecting large language models with real-world restful apis. Preprint, arXiv:2306.06624.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Van der Hoek and Wooldridge (2008) Wiebe Van der Hoek and Michael Wooldridge. 2008. Multi-agent systems. Foundations of Artificial Intelligence, 3:887–928.
Wei et al. (2023) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-thought prompting elicits reasoning in large language models. Preprint, arXiv:2201.11903.
Xu et al. (2023) Binfeng Xu, Zhiyuan Peng, Bowen Lei, Subhabrata Mukherjee, Yuchen Liu, and Dongkuan Xu. 2023. Rewoo: Decoupling reasoning from observations for efficient augmented language models. Preprint, arXiv:2305.18323.
Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
Yao et al. (2022) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
Yin et al. (2024) Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Chandu, Kai-Wei Chang, Ye** Choi, and Bill Yuchen Lin. 2024. Agent lumos: Unified and modular training for open-source language agents. Preprint, arXiv:2311.05657.
Zhao et al. (2023) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223.
Zhou et al. (2023) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. 2023. Least-to-most prompting enables complex reasoning in large language models. Preprint, arXiv:2205.10625.

Appendix A Details of DFSDT

See Figure 5.

Appendix B Details of the Smurfs

See Figure 6 for executor working process and Figure 4 for memory and tool library of Smurfs.

Appendix C Experiment Settings for Hotpot QA

Following settings in Qiao et al. (2024), which is randomly select 300 dev questions divided into three levels for evaluation, with 100 questions in each level. For tool library that can be used in HotpotQA, see Table 6

Name	Definition	Usage
BingSearch	BingSearch engine can search for rich knowledge on the internet based on keywords, which can compensate for knowledge fallacy and knowledge outdated.	BingSearch[query], which searches the exact detailed query on the Internet and returns the relevant information to the query. Be specific and precise with your query to increase the chances of getting relevant results. For example, Bingsearch[popular dog breeds in the United States]
Retrieve	Retrieve additional background knowledge crucial for tackling complex problems. It is especially beneficial for specialized domains like science and mathematics, providing context for the task	Retrieve[entity], which retrieves the exact entity on Wikipedia and returns the first paragraph if it exists. If not, it will return some similar entities to retrieve. For example, Retrieve[Milhouse]
Lookup	A Lookup Tool returns the next sentence containing the target string in the page from the search tool, simulating Ctrl+F functionality on the browser.	Lookup[keyword], which returns the next sentence containing the keyword in the last passage successfully found by Retrieve or BingSearch. For example, Lookup[river].

Table 6: Tool library for HotpotQA.

Appendix D Prompts for multi-agent implementation

Prompts used by each agent and their example outputs are shown in Figure 7 to 13.

Appendix E Token Cost on StableToolBench Evaluation

We analyzed the token cost for the StableToolBench experiments. As shown in Table 7, the total token cost for each subtask within the StableToolBench is compared across three candidate tool-planning methods. The data demonstrates that, across all tasks from easy to hard, DFSDT consistently incurs high token costs, while the other two methods maintain relatively low token costs. This verifies the context-efficiency of the proposed method.

Backbone	Method	StableToolBench
		I1-Inst.		I1-Cat.		I1-Tool.		I2-Cat.		I2-Inst.		I3-Inst.		Average
		Total	Avg.	Total	Avg.	Total	Avg.	Total	Avg.	Total	Avg.	Total	Avg.	Total	Avg.
GPT-3.5 Turbo	ReACT	1,010,304	6,198	824,676	5,390	1,010,514	6,396	900,855	7,265	824,510	7,778	461,121	7,559	838,663	6,764
GPT-3.5 Turbo	DFSDT	3,303,062	20,264	2,745,667	17,945	3,152,532	19,953	2,560,297	20,648	3,098,365	29,230	1,390,787	22,800	2,708,452	21,807
GPT-3.5 Turbo	Smurfs	1,090,404	7,127	1,917,348	11,763	1,464,535	9,269	957,088	7,638	1,096,162	10,341	632,084	10,362	1,191,270	9,417

Table 7: Token costs for various candidate tool-planning methods on the StableToolBench benchmark Guo et al. (2024). ‘Total’ indicates the total number of tokens used to complete each subtask, including both prompt and completion tokens. ‘Avg.’ represents the average number of tokens used per question within the subtasks. Higher token counts imply greater costs for solving the same task.

Figure 7: An example prompt for task decomposition in the framework.

Figure 8: An example prompt for tool check in the framework.

Figure 9: An example prompt for tool check in the framework.

Figure 10: An example prompt for action generation in the framework.

Figure 11: An example prompt for action input generation in the framework.

Figure 12: An example prompt for Answer Agent in the framework.

Figure 13: An example prompt for Verifier Agent in the framework.