CodeNav: Beyond tool-use to using real-world codebases with LLM agents

Tanmay Gupta*  Luca Weihs*  Aniruddha Kembhavi
PRIOR @ Allen Institute for AI
https://codenav.allenai.org
* equal contribution
Abstract

We present CodeNav, an LLM agent that navigates and leverages previously unseen code repositories to solve user queries. In contrast to tool-use LLM agents that require "registration" of all relevant tools via manual descriptions within the LLM context, CodeNav automatically indexes and searches over code blocks in the target codebase, finds relevant code snippets, imports them, and uses them to iteratively generate a solution with execution feedback. To highlight the core-capabilities of CodeNav, we first showcase three case studies where we use CodeNav for solving complex user queries using three diverse codebases. Next, on three benchmarks, we quantitatively compare the effectiveness of code-use (which only has access to the target codebase) to tool-use (which has privileged access to all tool names and descriptions). Finally, we study the effect of varying kinds of tool and library descriptions on code-use performance, as well as investigate the advantage of the agent seeing source code as opposed to natural descriptions of code. All code will be made open source under a permissive license.

1 Introduction

Today, tool-use is the de facto approach for enabling LLMs to interact with external systems or programs to complete domain-specific tasks (Gupta and Kembhavi, 2023; Sur’is et al., 2023; Wu et al., 2023; Yang et al., 2023; Wang et al., 2023). In this tool-use paradigm, a list of functions, or more generally code snippets, are first “registered” with the LLM by adding their descriptions, potentially with examples of their usage, to the LLM’s context. Then, given a user query, the LLM generates invocations of these tools, executed in an external environment, to solve the user’s task.

While tool-use enables new capabilities, it also limits the expressiveness of LLMs by constraining them to invoke only a handful of meticulously described functions or API calls. With LLMs becoming increasingly capable at code understanding and generation (Wang et al., 2024; Chen et al., 2021a; GitHub, 2024; Jimenez et al., 2024), we argue that it is time to move beyond tool-use to code-use, i.e. move from meticulously designed and registered tools to a setting where the LLM agent directly reads, imports, and uses the source code of any given repository111We use the terms codebase, library, and repository interchangeably. to solve a user’s query.

An effective code-use agent must be able to identify and use the right code snippets (functions, classes, constants, etc.) from the codebase to solve the given query. This free-form search over code is made possible due to a simple observation: well-designed libraries are written by humans for humans. These libraries often make crucial domain-specific assumptions, use meaningful abstractions and variable names, and organize code in files and directories so that it is easy to discover relevant functionality and document concisely. Instead of describing every single function or class in every file as done in tool-use, code-use leverages this structure to search for the required snippets. Code-use may further benefit from a high-level library description that exposes the structure inherent in the codebase. This may be done in various ways: e.g., highlighting important directories, files, or entry points, describing the contents and purpose of complex files, or elucidating the library’s abstractions and assumptions. This library description often already exists in the form of a README file, can be manually provided, or generated automatically via rule based parsing or an LLM.

We propose CodeNav, a single-agent, multi-environment, interaction framework (see Fig. 1) where an agent Navigates through a given Codebase to find the code snippets it needs to solve a users’ query. Given a user query and high-level library description, CodeNav iterates between searching the codebase for useful code snippets and generating part of the solution code that imports, instantiates, or calls the relevant variables, classes, or functions defined in the retrieved results. In this iterative process, the agent inspects execution results, fixes errors (if any), searches to resolve ambiguities, and gradually builds a solution to the query. The multi-environment setup makes it easy to extend the system’s capability by adding a new environment (e.g., a \mintinlinepythonterminal) along with its supported actions (e.g., \mintinlinepythonbash commands) and its responses to those actions (e.g., \mintinlinepythonSTDOUT).

For evaluating CodeNav, we begin by quantifying the gap between tool-use and code-use on three existing tool-use benchmarks: m&m’s (Ma et al., 2024), M3ToolEval (Wang et al., 2024), and API-Bank (Li et al., 2023b). To adapt these benchmarks for code-use, we provide the files containing the tool implementations as the target codebase. Tool-use forms an upper bound for code-use as the tool-use agent is privy to comprehensive, hand-crafted information that is available for these benchmarks. Surprisingly, despite not having access to this privileged information, we find that CodeNav is competitive with tool-use across these evaluations. While these results are impressive, they fall short of demonstrating the full potential of code-use. To this end, we present three qualitative case studies using distinct codebases. In these case studies, we find that CodeNav can follow complex multi-step queries, recover from execution errors, run iterative searches to better understand usage of code snippets, and format results as per user instructions to visualize computations (e.g., as a webpage).

We argue that code-use is more than just a niche application of generating code conditioned on another repository. We envision a future where code-use is how LLMs discover and use domain-specific tools without imposing any constraints on how the tools are developed or written (e.g., as a list of simplistic functions) and without any interventions (e.g., manual description and registration of tools) required for LLMs to use them. Our results are suggestive of a future where a codebase is all you need! We highlight six key contributions. (1) We introduce a novel code-use paradigm for LLM agents to move beyond tool-use to directly using real-world code bases to solve complex user queries. (2) We propose CodeNav that formulates code-use as a multi-step interaction between a single LLM agent and stateful retrieval, and code execution environments. (3) On three tool-use benchmarks (m&m’s, M3ToolEval, and API-Bank), we find a minimal gap from code-use to the tool-use upper bound without requiring arduous tool registration. (4) We study the effect of library or tool description richness on code-use performance. (5) We investigate the advantage of having access to the source code as part of retrieval result as opposed to just function signatures or docstrings. (6) We present three case studies to demonstrate the promise of code-use agents on solving complex queries using real-world codebases. Our code will be made open source under a permissive license.

Refer to caption
Figure 1: CodeNav’s single-agent, multi-environment interaction protocol. Given a user query, a brief description of the codebase (library description), and the interaction history, the LLM agent produces an \mintinlinepythonaction comprising of a thought, action type, and action content. The action gets executed in the target environment (identified by action type) to produce a \mintinlinepythonresponse. The interaction at the current step consisting of the \mintinlinepythonaction-\mintinlinepythonresponse pair is appended to the interaction history as context for the LLM to produce the next action.

2 Related Work

Tool-use with LLMs. Tool-use refers to an LLM identifying and invoking appropriate tools or functions to solve a task described by the user. Training-free methods for tool-use require first "registering" tools with the LLM. Tool registration techniques include descriptive registration by describing the tools (e.g., signature and docstrings of the class or function that implements a tool) (Hsieh et al., 2023; Sur’is et al., 2023) or prescriptive registration by providing in-context examples of similar task descriptions and corresponding tool invocations (Gupta and Kembhavi, 2023). Training-based methods involve finetuning an LLM on instruction and tool invocation pairs (Qin et al., 2024). In this work, we focus on training-free methods for tool-use and code-use.

Retrieval for tool-use. Scaling training-free methods to a large number of tools is challenging due to limited (albeit steadily increasing) context windows of LLMs. An approach to circumvent this is retrieving only the relevant tool descriptions or usage examples from a library. For instance, Hsieh et al. (2023) employ TF-IDF search to retrieve relevant tool documentation. DocPrompting (Zhou et al., 2023) explores sparse and dense documentation retrieval for code generation. ART (Paranjape et al., 2023) use cross-task demonstration retrieval for multi-step reasoning and tool-use. EcoAssistant (Zhang et al., 2023) saves past successful solutions to user queries in a database and for new queries retrieves solutions to similar queries in the database as demonstrations. Unlike previous works that retrieve documentation, our CodeNav agent directly retrieves source code using boolean Elasticsearch queries (e.g. (type: CLASS) AND (text: ObjectDetection)).

Prompting strategies for tool-use. For user queries that require multi-step tool-use solutions, the agent has to simultaneously exercise the ability to plan (which tools to use and when) as well as invoke the tools correctly. To improve planning, prompting strategies like Chain of Thought (Wei et al., 2022) prompt the LLM to precede the actual solution with a thought that describes the step-by-step plan. ReAct (Yao et al., 2023) interleaves thought, action (tool invocation), and observation (result of executing the action or feedback). CodeAct (Wang et al., 2024) generalizes ReAct by allowing actions to be free-form code. In contrast to CodeAct, CodeNav uses a single-agent, multi-environment framework in which the agent can both search for and execute code. Further, CodeAct operates in the tool-use regime where the exact tools that are needed for the query are registered ahead of time.

Code generation. While not all tool-use methods require the LLM to generate executable code (Khot et al., 2022) (instead generating function names and arguments and using a custom interpreter), tool-use may be implemented as free-form code generation (Wang et al., 2024; Ma et al., 2024). Code generation with LLMs has been explored for a wide range of applications including code-completion in code editors (GitHub, 2024), generating functions from docstrings (Chen et al., 2021b), editing files in a repository to fix Github issues (Jimenez et al., 2024), and even generating an entire repository consisting of multiple files from a single natural language instruction (Osika, 2023).

Feedback and correction. As tasks get more complex, a single LLM call may only produce a partial or partially correct solution. For such tasks, agentic workflows enable a closed-loop system (Wu et al., 2023; Wang et al., 2023) where the LLM agent iteratively produces an intermediate solution, receives feedback on the solution, and proceeds to either fix any errors or generate the next step. In context of code generation and tool-use, the feedback may consist of execution output such as variable values and exceptions raised, output of test cases (Huang et al., 2023), output of static analysis tools like linting and type-checking, and human or LLM feedback (Madaan et al., 2023). Recent works have also explored emulating execution feedback using an LLM (Li et al., 2023a; Ni et al., 2024).

3 CodeNav

3.1 Overview

We formulate code-use with LLMs in a single-agent, multi-environment interaction framework consisting of stateful code retrieval and code execution environments (see Fig. 1). The agent is given a brief high-level library description (e.g. important directory or file paths, description of content and purpose of complex files, crucial abstractions and assumptions made by the library, etc.) and the user query to be solved. The agent then proceeds to interact with these environments over multiple rounds. Each interaction consists of an \mintinlinepythonaction from the agent and a \mintinlinepythonresponse from an environment. Each \mintinlinepythonaction consists of a (i) thought used for chain-of-thought reasoning (Wei et al., 2022; Yao et al., 2023), (ii) an action type specifying the environment the agent wishes to act upon (e.g. code, search, etc.), and (iii) the action content which is executed in the selected environment. The action content gets routed to the appropriate environment based on the action type and the environment executes the content updating its state and producing a \mintinlinepythonresponse. The history of past interactions is provided to the agent as context to generate the next action. The interactions continue up to a maximum number of interactions specified by the user or until the agent takes the \mintinlinepythondone action. We now describe details of environments, actions, and responses.

3.2 Environments

For code-use, the agent must be able to perform two essential functions: (i) search for or discover relevant code snippets in the target codebase; (ii) generate the next portion of the code solution that imports, instantiates, or calls the needed classes or functions. In CodeNav, the agent performs these functions by taking actions in one of the following environments.

Retrieval Environment. This environment serves as an interface to a search index for fetching code snippets from the target codebase with rule-based re-ranking and a persistent memory to avoid resurfacing past retrievals. We parse the entire codebase and index all functions, classes, import statements, assignments, etc. as individual documents. We implement the index using \mintinlinepythonElasticsearch (\mintinlinepythonES) with fields for code string, code type (function, class, assignment, import etc.), file path, and line numbers (Elastic, 2024). The agent can use a \mintinlinepythonsearch action type to issue search queries into this index. Given the action, the environment first uses the action content as the search query to the \mintinlinepythonES index, discards any matches that have already been retrieved by past searches during the episode, and then re-ranks the results based on heuristic rules (e.g., to prioritize functions and classes and de-prioritize assignments and import statements). Finally, the top-k𝑘kitalic_k results are added to the environments persistent memory and returned as the environment reponse. Issuing the same action again surfaces the k𝑘kitalic_k next-best matches.

Execution Environment. Actions of type \mintinlinepythoncode get routed to a \mintinlinepythonPython execution environment.222In this work we consider only Python code generation for simplicity but here is no limitation on CodeNav preventing extension to other languages. At the beginning of the episode, the environment is initialized with an empty \mintinlinepythonglobal_variables dictionary. Each code block is executed in scope of these global variables (using \mintinlinepythonexec(code_str, global_vars)) and any changes to the global namespace (i.e., modification or deletion of existing variables, or creation of new variables) are reflected in this dictionary. Prior to execution, the environment optionally performs linting (using \mintinlinepythonflake8), type-checking (using \mintinlinepythonmypy), and formatting (using \mintinlinepythonblack) of the code block. Standard output (\mintinlinepythonstdout), updated variables in \mintinlinepythonglobal_vars (new variables or variables whose string representations have changed), and errors if any (execution, linting, or type-checking) are returned as part of the response.

Done and Code Summary Environments. Additionally, we create a few helper environments. The Done Environment returns a \mintinlinepythonnull response to the \mintinlinepythondone action that marks the end of the episode. The Code Summary Environment treats the content of the \mintinlinepythoncode_summary action as a cleaned up summary of the code solution produced by the agent during the episode and saves it for easy access.

3.3 Agent Actions

The CodeNav agent interacts with the environments using an \mintinlinepythonaction that consists of 3 components: thought, type, and content, see Fig. 1 (right). The LLM underlying the agent is prompted to only produce outputs in a structured XML format and in compliance with a set of rules (e.g., thought and type must always be provided; type must be one of the available action types; etc.). Before routing the \mintinlinepythonaction to the target environment, we check its validity. If the \mintinlinepythonaction is found to be invalid, an \mintinlinepythonInvalidAction response is returned to the agent containing a description of the violated rule. The agent may use this violation description to fix the \mintinlinepythonaction in the next step.

Refer to caption
Figure 2: Running the CodeNav agent with the CodeNav codebase.

3.4 Environment Responses

Each action elicits a response from the target environment or an \mintinlinepythonInvalidAction response if the action is invalid for execution in the target environment. Each response in CodeNav is implemented as a data class with a \mintinlinepythonformat() method that specifies how the response data should be serialized to a text string to be included in the agent’s context when predicting the next action.

Retrieval Response. Given a search query (i.e., the content of a \mintinlinepythonsearch action), the retrieval environment returns a list of documents (containing code blocks from the target codebase along with metadata) from the search index that match the search queries. Implementation of the \mintinlinepythonformat() method of the retrieval response answers the question: what should be shown to the agent from these retrieved code blocks? On one hand, the agent may gain a better understanding of how to use a class or a function by reading its source code (containing function signatures, argument types, outputs, and implementation details) as opposed to reading an imprecise, incomplete, outdated, or entirely absent human written description or \mintinlinepythondocstring of the class or function. On the other hand, showing all implementation details for every retrieved code block results in an explosion in the number of context tokens to be processed by the LLM. To strike a balance, we first retrieve up to M𝑀Mitalic_M (=100absent100=100= 100) matched documents. From these, we show the top-K𝐾Kitalic_K (=3absent3=3= 3) matches with source code and metadata. For large codebases, this is usually insufficient to surface target code blocks unless the exact function or class name is given in the search query. Therefore, we additionally show prototypes (signatures and filenames) for up to P𝑃Pitalic_P classes or functions in the remaining retrieved results. Further, for the top-K matches, we use GPT-4 to generate \mintinlinepythondocstrings for the top-3 retrievals and show whichever is shorter between the source code and the function signature with the generated docstring.

Execution Response. Given a \mintinlinepythoncode action, the Python execution environment executes the content producing a response. The \mintinlinepythonformat() method of a response serializes standard output (\mintinlinepythonstdout), variables changed during execution shown as variable names along with string representation of their values, execution errors if any, and (optionally) linting, type-checking, and formatting errors. The error messages contain reference to the line in the code string that produced the error to help localize the error. The \mintinlinepythonstdout and changed variables allow easy inspection of function calls but can get quite long (e.g., when printing a large array) and are therefore truncated to a maximum number of characters. We show the start and end of \mintinlinepythonstdout and the beginning of variable values.

4 Case Studies

While we quantitatively evaluate CodeNav on tool-use benchmarks in Sec. 5, these benchmarks are not sufficiently complex to highlight the advantages of code-use over tool-use. Therefore, we showcase CodeNav’s impressive capabilities in three case studies using diverse codebases to solve complex queries. For the first case study (Sec. 4.1), Fig. 2 depicts the entire episode. For the other two case studies, we show the inputs and outputs in Fig. 3 and provide the full episodes as part of the supplementary material. For all case studies we provide library descriptions in the appendix (App. C). Please also see App. Table 6 for information about the size and complexity for the codebases used in these case studies and our quantitative experiments (e.g., searching the transformers library requires searching over 50,508 snippets in 3,475 files).

4.1 CodeNav on CodeNav

We imagine a researcher who, possibly after reading this paper, wishes to use CodeNav to answer a query using the transformers library (Wolf et al., 2020). In place of a researcher however, we use a CodeNav agent; i.e., a CodeNav agent uses the CodeNav repository to instantiate another CodeNav agent to answer a given query with transformers. This example serves two goals: (1) it provides a pedagogical example of using our codebase, and (2) it shows CodeNav’s zero-shot abilities as we can guarantee that the underlying LLM (GPT-4) was not trained on our codebase.

For this case-study, our user query consists of 7 steps divided into 2 distinct parts (see User query in Fig. 2). Steps 1-4 specify instructions for creating and running the episode while Steps 5-7 contain instructions to visualize the results of the interaction. In Steps 1-4, the user asks CodeNav to first create an agent using \mintinlinepythonOpenAICodenavAgent and to instantiate various environments using \mintinlinepythonPythonCodeEnv, \mintinlinepythonRetrievalEnv, and \mintinlinepythonDoneEnv with the specified parameters like the Elasticsearch host and index name to use for retrieval. Then the query asks the agent to create an episode for solving another query (within the original query!) using \mintinlinepythontransformers. This “subquery” requires the agent to detect dogs in an image (specified by a file path) using the \mintinlinepythonfacebook/detr-resnet-101 model in the object detection pipeline, add red detection boxes on the image, and store the image in variable \mintinlinepythondetected_dogs. The subquery also asks agent to store the detection coordinates and scores as a pandas dataframe in the variable \mintinlinepythondetection_coords_and_scores. The first part of the full query ends with asking the agent to run the episode for a maximum of 10 steps Steps 5-7 specify how to visualize the interaction. Specifically, we ask the agent to: (i) tabulate the interaction as a dataframe with columns for action type and thought; (ii) save the \mintinlinepythondetected_dogs image as a PNG at a specified file path; and (iii) print \mintinlinepythondetection_coords_and_scores.

The CodeNav episode for the above can be found in Fig. 2. The agent initially searches for information about the \mintinlinepythonOpenAICodenavAgent class and the environments (A1, R1, A2, R2), and then attempts to instantiate them with code (A3). This code results in an error due to a misuse of the \mintinlinepythonRetrievalEnv initializer (R3). The agent then searches for additional information to resolve this error (A4, R4) and eventually succeeds in running a new CodeNav agent using transformers (A9, R9). Finally it prints and saves the requested outputs.

Refer to caption
Figure 3: CodeNav unifies "agentic" applications via code-use. Two case studies: (top) a visual reasoning and image editing agent; and (bottom) an information gathering agent. These applications are enabled simply by changing the target codebase search index and the high-level library description.

4.2 Multimodal Processing and Reasoning

In our proposed code-use formulation, multimodal tasks are identical to text-only tasks so long as multimodal processing functionality is available in the target codebase. We demonstrate this with an image editing task requiring visual reasoning to localize the region to edit. We use the m&m’s codebase as it contains computer vision tools for detection, segmentation, QA, etc. In this case-study, see Fig. 3 (top), the agent is required to find and highlight the person wearing glasses who is talking on the phone. Our query first specifies steps to localize the region to edit. In particular, the agent is instructed to first segment the image and select the person segments. Then for each person, the agent should zoom in on the face by taking a crop of the top-third of the person bounding box. For each face, the agent must use visual question answering to verify whether the person is wearing glasses and is talking on cell phone. To visualize these predictions, the agent is instructed to save these face crops along with the predictions as an HTML table. Finally, the agent is instructed to highlight the person for which both attributes are true via a color-pop effect. We observe that CodeNav not only uses multimodal models proficiently but also uses the outputs to perform visual reasoning. Further, this case study highlights CodeNav’s ability to produce human-interpretable intermediate outputs.

4.3 Research assistant

Imagine an agent that curates reading material on a given topic including news articles, blog posts, and research papers and then emails it to you. For CodeNav, this is simply a matter of using a codebase that provides the necessary functionality for querying knowledge sources on the web and sending emails. One such codebase is PhiData (phi, 2024). In this case-study (see Fig. 3, bottom) we query the agent to curate reading material on "Alphafold-3" consisting of definition from Wikipedia, a list of news articles, and a list of papers on "Protein folding with Deep Learning". Further, our query specifies various presentation requirements: e.g., for each news article, we require the agent to show the article source, title, link, and first 140 characters. Finally, we ask the agent to write this information to an HTML and markdown-formatted documents. The text document’s content is then sent to the user’s email address with the specified subject. This case study demonstrates the versatility of CodeNav to create specialized agents simply by providing an appropriate codebase. Here CodeNav serves as a research assistant by using functionality built in PhiData to query Wikipedia, DuckDuckGo News, and arXiv.

5 Experiments

We now present our quantitative results on 3 tool-use benchmarks: (1) m&m’s, which requires multi-step planning with 33 multi-modal, e.g. vision/language, tools (Ma et al., 2024); (2) API-Bank that involves managing user state in sandboxed environment via calls to any of 73 APIs (Li et al., 2023b); and (3) M3ToolEval that contains “82 human-curated tasks” requiring multiple tools, calls, and interactions (Wang et al., 2024). In all experiments we report the mean performance across 3 independent evaluations; the reported ±plus-or-minus\pm± values correspond to 2×2{\times}2 × the standard dev. across evaluations. As these benchmarks were not necessarily designed for evaluation with code-use agent in mind (e.g., API-Bank uses JSON API calls), this has necessitated some changes in how we evaluate CodeNav on these benchmarks. See App. A for these details as well as descriptions of our metrics.

5.1 How does code-use compare to tool-use on tool-use benchmarks?

Tool-use benchmarks are designed to test LLMs’ ability to invoke a small set of pre-registered tools. Since tools in these benchmarks are relatively simple function calls with human written descriptions, code-use is upper bounded by tool-use on these benchmarks. We wish to quantify the gap between code-use (where source code search is necessary) and tool-use (where no search is needed as tool names/descriptions are provided). Tab. 1 shows that on m&m’s and API-Bank, code-use achieves similar or slightly lower tool-f1. On both benchmarks, code-use takes a minor hit on tool-recall. This is intuitive as tool-use is provided tool names while code-use has the harder task of searching to discover available tools. On M3ToolEval, which evaluates final answer correctness, code-use is within two points of tool-use. In all datasets on average, code-use takes similar-to{\sim}2 more interaction steps compared to tool-use to search for required tools. Finally, code-use results in only a minor increase in performance variance despite the added uncertainly due to lack of knowledge of available tools.

Table 1: Code-use is competitive with tool-use even without tool prompts.
m&m’s M3ToolEval API-Bank
method precision recall f1 steps accuracy steps precision recall f1 steps
tool-use 82.9 ± 4.5 81.7 ± 0.4 79.6 ± 2.3 4.9 ± 0.1 83.7 ± 2.8 6.6 ± 0.5 86.6 ± 0.8 93.6 ± 1.1 88.5 ± 0.7 3.4 ± 0.1
code-use 88.0 ± 6.1 78.2 ± 4.5 80.6 ± 5.1 7.2 ± 0.2 81.7 ± 4.9 7.8 ± 0.4 84.0 ± 0.7 89.3 ± 0.6 85.3 ± 0.3 5.3 ± 0.0

5.2 Is a library description sufficient for tool-use?

Table 2: Desc. ablations on m&m’s.
tool description length f1 steps
w/o desc 0 74.1 ± 1.9 7.0 ± 0.1
tool names 694 78.1 ± 3.5 6.7 ± 0.2
   + desc 3680 80.8 ± 0.4 6.9 ± 0.1
    + prototypes 4627 80.7 ± 5.0 6.1 ± 0.1
library desc (CodeNav) 2061 80.6 ± 5.1 7.2 ± 0.2

Instead of meticulously listing each tool or function name along with detailed descriptions of its purpose and input and output arguments, CodeNav allows a user to only provide a high-level description of the codebase or library that implements these tools. In Tab. 2, we compare library description (Fig. 6 in appendix) with CodeNav without any description as well as tool descriptions with three "levels of detail"; the lowest level contains only the tool names while the richest contains tool names, descriptions, and function signatures (Fig. 10 in appendix). As providing tool details in the prompt helps the agent identify the tools needed to solve a query and use them correctly, we find a consistent increase in tool-f1 with increasing tool detail. Spectacularly, library description achieves similar performance as the richest tool description with less than half the description length. The convenience of not requiring description of each tool comes at the cost of minor increase in number of steps since now the agent needs to search and discover the tool as opposed to recalling from context.

5.3 Does seeing the source code help code-use?

Table 3: Search response formatting ablation on m&m’s.
retrieval response f1 steps
prototypes 80.0 ± 5.2 7.8 ± 0.1
code 81.2 ± 2.3 7.3 ± 0.2
code or docstring (CodeNav) 80.6 ± 5.1 7.2 ± 0.2

Well written code is its own documentation. Therefore, for an agent that is proficient in code understanding, seeing the actual implementation details in the source code (which might also include docstrings) alleviates the need to manually register tools and provides strictly more information than just the function signatures or prototypes. We compare various retrieval response formats in Tab. 3 using the m&m’s benchmark. We observe that returning code for the top-3 matches results in higher tool-f1 than showing prototypes only in the retrieval response. We also see a reduction in number of interaction steps needed by the agent to solve the query which has been a consistent indicator of lower uncertainty in deciding which tools to use and how. Finally, real-world code blocks can span 100s of lines which quickly increases the context length to be processed by the agent. To remedy this we generate docstrings for the top-3 retrievals using GPT-4 and between the docstring and the raw code we show whichever is shorter. As expected, this default configuration in CodeNav achieves tool-f1 higher than prototypes only but lower than code.

5.4 What makes a good library description?

Table 4: Comparing two library descriptions on M3ToolEval. To provide a reference for length, desc. length for tool-use for M3ToolEval is 5K.
library description length accuracy steps
file path 253 76.8 ± 6.5 8.3 ± 0.3
file path + file desc 641 81.7 ± 4.9 7.8 ± 0.4

We demonstrate the impact of library description in Tab. 4. M3ToolEval is implemented as a codebase consisting of 5 files, each containing tools dedicated to a problem domain; web browsing, travel planning, dna sequencing, message encryption, and financial calculations. The first library description simply provides relative file paths to these files (e.g., m3eval/travel_planner.py), while the second description also includes a one line summary of what the file contains (e.g., “functions for planning travel including finding flights, making hotel reservation, and budget calculations”). Intuitively, this enables CodeNav to come up with keywords for search and results in superior performance with fewer interactions needed to reach a solution. We provide the following three recommendations to write good library descriptions: (i) provide context for the target domain to enable the LLM to then use its knowledge of the domain to generate useful keywords for search; (ii) describe library structure (e.g., directory structure or key assumptions the library makes); (iii) provide a brief (not necessarily exhaustive) natural language description of the available functionality.

5.5 Impact of LLM choice on performance

Table 5: LLM choice ablation on m&m’s.
LLM precision recall f1 steps
gpt-4-1106-preview 88.0 ± 6.1 78.2 ± 4.5 80.6 ± 5.1 7.2 ± 0.2
gpt-3.5-turbo-0125 54.36 ±  2.3 15.77 ±  1.4 22.96 ± 0.8 9.08 ± 1.07
Mixtral-8x22B-Instruct-v0.1 82.50 ±  2.1 62.31 ±  1.9 67.91 ± 1.3 9.06 ± 0.3
Qwen1.5-110B-Chat 78.15 ±  3.1 38.49 ±  5.5 48.84 ± 5.1 10.00 ± 0.2

We have used GPT-4 (OpenAI, 2023), in particular gpt-4-1106-preview,333https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4 as the LLM underlying CodeNav. As GPT-4 is one of the most performant publicly available LLMs, it is natural to ask how CodeNav performance degrades when using smaller, possibly open source, LLMs. We run m&m’s evaluations using CodeNav when replacing GPT-4 with GPT-3.5, the Mistral 8×\times×22B mixture-of-experts model (Jiang et al., 2023, 2024), and the Qwen1.5 110B model (Bai et al., 2023).444Model completions obtained via the together.ai API, see https://docs.together.ai. We chose to use the Mistral and Qwen models as they represent some of the largest, open-source, LLMs with long context windows (empirically, LLMs with context sizes below 16k tokens regularly fail by exceeding their context limit). Our results are displayed in Table 5. While the GPT-4 powered CodeNav outperforms, the open source models do perform quite well falling behind primarily in their recall (suggesting search failures). Surprisingly the GPT-3.5 variant performs poorly; when inspecting the failed trajectories this appears to often be caused by the agent failing to appropriately summarize its actions at the end of the episode which may ameliorated by additional prompt tuning.

6 Discussion

We have argued that it is time to move from tool-use to code-use; from feeding LLM agents manually curated and meticuluously descibed tool sets, to instead presenting them with existing codebases written by humans for humans. As we have shown in our case-studies and quantitative results, simply remarkable behavior can be obtained by code-use agents when using modern LLMs so long as considerable care is taken to engineer code search and execution environments that provide the agent with significant flexibility and feedback.

References

  • phi [2024] phidata, 2024. URL https://github.com/phidatahq/phidata.
  • Bai et al. [2023] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y. Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y. Zhang, Z. Zhang, C. Zhou, J. Zhou, X. Zhou, and T. Zhu. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  • Chen et al. [2021a] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021a. URL https://arxiv.longhoe.net/abs/2107.03374.
  • Chen et al. [2021b] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. W. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, I. Babuschkin, S. Balaji, S. Jain, A. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba. Evaluating large language models trained on code. ArXiv, abs/2107.03374, 2021b. URL https://api.semanticscholar.org/CorpusID:235755472.
  • Elastic [2024] Elastic. Elasticsearch, 2024. URL https://www.elastic.co/elasticsearch/. Version 8.12.1.
  • GitHub [2024] GitHub. GitHub Copilot - Your AI pair programmer, 2024. URL https://github.com/features/copilot. Accessed: 2024-05-16.
  • Gupta and Kembhavi [2023] T. Gupta and A. Kembhavi. Visual programming: Compositional visual reasoning without training. CVPR, pages 14953–14962, 2023. URL https://api.semanticscholar.org/CorpusID:253734854.
  • Hsieh et al. [2023] C.-Y. Hsieh, S. Chen, C.-L. Li, Y. Fujii, A. J. Ratner, C.-Y. Lee, R. Krishna, and T. Pfister. Tool documentation enables zero-shot tool-usage with large language models. ArXiv, abs/2308.00675, 2023. URL https://api.semanticscholar.org/CorpusID:260351459.
  • Huang et al. [2023] D. Huang, Q. Bu, J. M. Zhang, M. Luck, and H. Cui. Agentcoder: Multi-agent-based code generation with iterative testing and optimisation. ArXiv, abs/2312.13010, 2023. URL https://api.semanticscholar.org/CorpusID:266374622.
  • Jiang et al. [2023] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de Las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7b. CoRR, abs/2310.06825, 2023. doi: 10.48550/ARXIV.2310.06825. URL https://doi.org/10.48550/arXiv.2310.06825.
  • Jiang et al. [2024] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de Las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mixtral of experts. CoRR, abs/2401.04088, 2024. doi: 10.48550/ARXIV.2401.04088. URL https://doi.org/10.48550/arXiv.2401.04088.
  • Jimenez et al. [2024] C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66.
  • Khot et al. [2022] T. Khot, H. Trivedi, M. Finlayson, Y. Fu, K. Richardson, P. Clark, and A. Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. ArXiv, abs/2210.02406, 2022. URL https://api.semanticscholar.org/CorpusID:252715485.
  • Li et al. [2023a] C. Li, J. Liang, A. Zeng, X. Chen, K. Hausman, D. Sadigh, S. Levine, F.-F. Li, F. Xia, and B. Ichter. Chain of code: Reasoning with a language model-augmented code emulator. ArXiv, abs/2312.04474, 2023a. URL https://api.semanticscholar.org/CorpusID:266051661.
  • Li et al. [2023b] M. Li, Y. Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, and Y. Li. Api-bank: A comprehensive benchmark for tool-augmented llms. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 3102–3116. Association for Computational Linguistics, 2023b. doi: 10.18653/V1/2023.EMNLP-MAIN.187. URL https://doi.org/10.18653/v1/2023.emnlp-main.187.
  • Ma et al. [2024] Z. Ma, W. Huang, J. Zhang, T. Gupta, and R. Krishna. m&m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks. ArXiv, abs/2403.11085, 2024. URL https://api.semanticscholar.org/CorpusID:268512938.
  • Madaan et al. [2023] A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Welleck, B. P. Majumder, S. Gupta, A. Yazdanbakhsh, and P. Clark. Self-refine: Iterative refinement with self-feedback. ArXiv, abs/2303.17651, 2023. URL https://api.semanticscholar.org/CorpusID:257900871.
  • Ni et al. [2024] A. Ni, M. Allamanis, A. Cohan, Y. Deng, K. Shi, C. Sutton, and P. Yin. Next: Teaching large language models to reason about code execution. 2024. URL https://api.semanticscholar.org/CorpusID:269302914.
  • OpenAI [2023] OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  • Osika [2023] A. Osika. Gpt-engineer, 2023. URL https://github.com/gpt-engineer-org/gpt-engineer.
  • Paranjape et al. [2023] B. Paranjape, S. M. Lundberg, S. Singh, H. Hajishirzi, L. Zettlemoyer, and M. T. Ribeiro. ART: automatic multi-step reasoning and tool-use for large language models. CoRR, abs/2303.09014, 2023. doi: 10.48550/ARXIV.2303.09014. URL https://doi.org/10.48550/arXiv.2303.09014.
  • Qin et al. [2024] Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y.-T. Lu, Y. Lin, X. Cong, X. Tang, B. Qian, S. Zhao, R. Tian, R. Xie, J. Zhou, M. H. Gerstein, D. Li, Z. Liu, and M. Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis. ICLR, abs/2307.16789, 2024. URL https://api.semanticscholar.org/CorpusID:260334759.
  • Sur’is et al. [2023] D. Sur’is, S. Menon, and C. Vondrick. Vipergpt: Visual inference via python execution for reasoning. ICCV, pages 11854–11864, 2023. URL https://api.semanticscholar.org/CorpusID:257505358.
  • Wang et al. [2023] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. J. Fan, and A. Anandkumar. Voyager: An open-ended embodied agent with large language models. ArXiv, abs/2305.16291, 2023. URL https://api.semanticscholar.org/CorpusID:258887849.
  • Wang et al. [2024] X. Wang, Y. Chen, L. Yuan, Y. Zhang, Y. Li, H. Peng, and H. Ji. Executable code actions elicit better LLM agents. CoRR, abs/2402.01030, 2024. doi: 10.48550/ARXIV.2402.01030. URL https://doi.org/10.48550/arXiv.2402.01030.
  • Wei et al. [2022] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.
  • Wolf et al. [2020] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, Oct. 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
  • Wu et al. [2023] Q. Wu, G. Bansal, J. Zhang, Y. Wu, S. Zhang, E. Zhu, B. Li, L. Jiang, X. Zhang, and C. Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. ArXiv, abs/2308.08155, 2023. URL https://api.semanticscholar.org/CorpusID:260925901.
  • Yang et al. [2023] Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu, M. Zeng, and L. Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. ArXiv, abs/2303.11381, 2023. URL https://api.semanticscholar.org/CorpusID:257637012.
  • Yao et al. [2023] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=WE_vluYUL-X.
  • Zhang et al. [2023] J. Zhang, R. Krishna, A. H. Awadallah, and C. Wang. Ecoassistant: Using llm assistant more affordably and accurately. ArXiv, abs/2310.03046, 2023. URL https://api.semanticscholar.org/CorpusID:263671677.
  • Zhou et al. [2023] S. Zhou, U. Alon, F. F. Xu, Z. Wang, Z. Jiang, and G. Neubig. Docprompting: Generating code by retrieving the docs. In ICLR, Kigali, Rwanda, May 2023. URL https://arxiv.longhoe.net/abs/2207.05987.

Appendices

This appendices contain the following:

  • Experiment details for m&m’s, M3ToolEval, and API-Bank (App. A)

  • Compute requirements (App. B)

  • Library and tool descriptions (App. C)

  • A full retrieval response that shows code, automatically generated summary docstrings, and prototypes (App. D)

  • Limitations of CodeNav (App. E)

  • Societal impact (App. F)

Please see our project website, https://codenav.allenai.org/, for:

  • Full episode trajectories for the 3 case studies.

  • An example run of CodeNav on m&m’s. This contains an HTML file that can be opened in a browser to view the programs generated by CodeNav along with ground truth programs and links to trajectories.

  • Similarly, an example runs of CodeNav on M3ToolEval with HTML visualizations.

  • A side-by-side comparison of library and tool descriptions for the m&m’s and M3ToolEval codebases.

Table 6: Codebase statistics.
Codebase files snippets lines characters functions classes
m&m’s 2 58 971 34277 39 0
M3ToolEval 8 39 485 14009 21 2
API-Bank 54 163 6144 207656 1 53
CodeNav 36 369 4055 141636 58 46
PhiData 426 2549 49731 2169989 133 398
\mintinlinepythontransformers 3475 50508 1362242 63105978 4354 11424

Appendix A Experiment details

For each of the 3 tool-use benchmarks used in CodeNav evaluation, we now provide details on how the benchmark was adapted for code-use as well as metrics used.

A.1 m&m’s

A.1.1 Adaptation for code-use

All tools in m&m’s are implemented in two files; a single python file with tool functions and a config file. Since functions in m&m’s neither contain type hints nor are annotated with inline comments or docstrings, to provide minimal context, we add the tool description provided in the original m&m’s benchmark as a single-line docstring to each function. For instance, for the function \mintinlinepythoncolor_pop, we add It takes an image and one or multiple objects, and returns an image where only the object is colored and the rest is black and white. Note that we provide no additional information about input arguments or outputs. Further, m&m’s evaluation expects a sequence of function calls as outputs, we enable a ‘code_summary‘ action in the agent to get the code solution in the desired format. To comply with the evaluation we provide the following guidelines:

When solving tasks YOU MUST RESPECT the following guidelines:
1. Do not implement any new functions. Just use the available functions.
2. Generally, try searching for function names. Only if needed, include
function argument names. Do not include argument values.
3. When you have a solution, use the code_summary action to summarize
the solution.
4. When asked to generate text, don’t generate text yourself but rather
see if there is a function to do it.
5. Tasks typically require 1 to 3 function calls.

A.1.2 Metrics

We use the macro-averaged tool-f1 metric from the original m&m’s paper (Ma et al., 2024) which is the harmonic mean of the precision and recall of the function names in reference to the ground truth program. When multiple correct ground truths are available we used the best match. Since many queries in m&m’s do not have a single correct answer, we do not use answer correctness as a metric. Similarly, we find that there are many ways of specifying the arguments while generating free-form code for m&m’s queries and hence find argument name and value based metrics unreliable for evaluating free-form code generation agents like CodeNav.

A.1.3 Data split

m&m’s consists of approximately 800 samples. Since evaluating CodeNav on the full set using GPT-4 as the LLM could cost between $150 to $200 (and we run each experiment thrice to compute error bars), we randomly sample a smaller set of 200 samples for our evaluation.

A.2 M3ToolEval

A.2.1 Adaptation for code-use

To prepare M3ToolEval for code-use, we begin by creating a codebase consisting of just the tool implementations and associated data (web page data used by their web browsing tools as well as flight, hotel, and location information used by the travel planning tools). Particularly for web browsing, while M3ToolEval registers methods of the WebBrowser class as individual tools, for code-use we let the agent use the class directly. Further, since the web browsing task uses unrealistic pages with strong assumptions about how the pages are formatted, we provide some context to the agent for the web browsing task as guidelines. Finally, tools in M3ToolEval are grouped by task (e.g. all tools for DNA sequencing are in the same file), and tasks only use tools from a single file. Therefore, to let the agent make use of this assumption, we provided file paths in library/tool description and we ask the agent to identify and specify the relevant file name in its search queries to zone in on required tools. Here are the guidelines we use:

When solving tasks YOU MUST RESPECT the following guidelines:
1. For browsing tasks use the WebBrowser object and navigate the web pages
using available methods to find what you are looking for. Sometimes the
relevant information may not be visible on the page but if you see
[Viewing page m of n] (where m < n) then you may use the scroll functions
to see more page content. If you see the information you need displayed on
the web page, feel free to use it directly without worrying about parsing
it using code. Do not write complex code. Rather, try to interact with the
browser one action at a time like clicking or scrolling.
2. You may need to identify the relevant python file and specify this target
file in your search queries to get the relevant search results

A.2.2 Metrics

We use the final answer accuracy as used in the original M3ToolEval work (Wang et al., 2024).

A.3 API-Bank

A.3.1 Adaptation for code-use

The API-Bank benchmark with a human-AI “chat” context in mind where an agent and user send messages back and forth with the user asking the agent to, possibly, perform many tasks one after another. The AI agent is then evaluated via a next-step prediction approach where the agent is fed the entire chat context up to time t𝑡titalic_t and required to predict some ground-truth chat message or JSON API call at time t+1𝑡1t+1italic_t + 1. As CodeNav was not designed to be used for back-and-forth chat with a user (neither is it meant to be evaluated on producing natural language responses) we filter all ground-truth interactions in the API-Bank level-1-given-desc set to include only those chats for which the last user message is followed by at least one message from the agent where the agent invokes an API call. After filtering, we are left with 186 (of originally 214) samples. During evaluation, we then give our CodeNav the full chat context up to and including the last user message (and also modify the sandbox to be in the state up to this point) and then evaluate CodeNav’s ability to produce all remaining API calls.

In order for CodeNav to make API calls, we require that it directly instantiate the appropriate API-Bank class and then invoke the call method on that class (e.g., the model might instantiate the AddAgenda class as aa = AddAgenda() and then run aa.call(token, content, time, location) with token, content, time, and location variables it has previously defined. In order to encourage this behavior, we include instructions of the form:

When solving tasks YOU MUST RESPECT the following guidelines:
1. When calling APIs you should instantiate the relevant class and use the
‘call‘ method defined in the class. DO NOT USE INVOKE OTHER METHODS ON
THE CLASS, YOU MUST ONLY CALL THE ‘call‘ METHOD.
2. Everything can be solved by calling APIs, do not define new APIs or
modify the existing ones.

in the library description given to the CodeNav agent.

Note that the above differs substantially from how the agents are traditionally evaluated with API-Bank where they, generally, produce JSON formatted API calls which are routed by API-Bank to the correct class and call method.

A.3.2 Metrics

As noted above, we evaluate CodeNav’s ability to produce the correct remaining API calls given some chat context. As for the m&m’s benchmark evaluation, we only evaluate CodeNav’s ability to call the correct APIs and ignore, for ease of evaluation, whether these APIs were called with the correct arguments or produced the correct results. Supposing that CodeNav called a sequence of APIs A={a1,,an}𝐴subscript𝑎1subscript𝑎𝑛A=\{a_{1},...,a_{n}\}italic_A = { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } and that the ground-truth set of APIs’ called was G={g1,,gm}𝐺subscript𝑔1subscript𝑔𝑚G=\{g_{1},...,g_{m}\}italic_G = { italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }, we count the number of matches between A𝐴Aitalic_A and G𝐺Gitalic_G (counting multiplicities) and compute recall as R=(#matches)/|G|𝑅#𝑚𝑎𝑡𝑐𝑒𝑠𝐺R=(\#matches)/|G|italic_R = ( # italic_m italic_a italic_t italic_c italic_h italic_e italic_s ) / | italic_G | and precision as P=(#matches)/|A|𝑃#𝑚𝑎𝑡𝑐𝑒𝑠𝐴P=(\#matches)/|A|italic_P = ( # italic_m italic_a italic_t italic_c italic_h italic_e italic_s ) / | italic_A | (precision is taken to be 0 if |A|=0𝐴0|A|=0| italic_A | = 0). Given this precision and recall, we compute the F1 score as usual as F1=2PR/(R+P)𝐹12𝑃𝑅𝑅𝑃F1=2\cdot P\cdot R/(R+P)italic_F 1 = 2 ⋅ italic_P ⋅ italic_R / ( italic_R + italic_P ) with F1𝐹1F1italic_F 1 being set to 0 if P+R=0𝑃𝑅0P+R=0italic_P + italic_R = 0 as usual.

Appendix B Compute requirements

We run our CodeNav evaluations on Ubuntu servers each with 8 NVIDIA RTX A6000 GPUs. As mentioned previously, we do not train any models and make use of the OpenAI and together.ai APIs to perform inference using LLMs. This means we do not require (local) GPUs for LLM inference but there are many instances when CodeNav may benefit from having access to a GPU (e.g., for image inpainting). While inference time varies per benchmark, running m&m’s evaluations (200 queries) with 24 parallel processes on a single 8 GPU server takes approximately similar-to{\sim}12 minutes in wall clock time (96 minutes of GPU time).

Appendix C Library and Tool Descriptions

Here we provide the library and tool descriptions used in our case studies as well as quantitative evaluation -

  • Figures 567 shows the library descriptions used by CodeNav in the three case studies.

  • Figures 689 show the library descriptions used by CodeNav for quantiative evaluation on m&m’s, M3ToolEval, and API-Bank respectively.

  • Figures 101112 show the tool descriptions used for the tool-use baselines for m&m’s, M3ToolEval, and API-Bank respectively. Note that the tool descriptions are significantly more detailed than the corresponding library descriptions.

Appendix D Retrieval Response Example

Refer to caption
Figure 4: Full retrieval response example. An example of a full retrieval. This corresponds to an expanded version of R5 in Fig. 2.

We show a full example of a retrieval response in Fig. 2. This corresponds to an expanded version of R5 in Fig. 2 where we only showed one response for brevity. Notice: (1) the collection of class and function prototypes/signatures shown at the bottom of the retrieval response and (2) that the 3 expanded retrieved results contain a mix of full code (for the EsCodeRetriever and NumpyJSONEncoder classes) as well as automatically generated summary docstrings (for the RetrievalResult class).

Appendix E Limitations

While CodeNav is capable of some producing impressive results, we highlight three key limitations. (1) Our current implementation of CodeNav assumes that the agent produces Python code (and that the given codebase is a Python codebase). While extending to use other languages is not a significant engineering challenge, it is possible that LLMs’ performance degrades as one moves away from popular languages like Python, for which there is significant data on the web. (2) Our agent has no long-term memory that is available across queries. This means that, given the same query twice, CodeNav will make the same errors and repeat the same searches. (3) Finally, CodeNav performance is strongly dependent on the underlying LLM and performance can sharply degrade when using smaller LLMs. This means that most researchers can only use CodeNav through the use of paid APIs and larger-scale experimentation when using these APIs can be costly.

Appendix F Societal impact

While CodeNav is not unique in this respect, the growing popularity of increasingly competent LLM agents has the potential to automate or augment many skilled tasks. While automation has brought about many societal positives on aggregate, it can clearly has a profoundly negative impact on anyone whose may lose their job. The environmental impact (via the huge energy costs of running these LLMs) may also be significant.

As a more acute potential negative impact: in the code-use paradigm, the underlying agent the CodeNav agent has the ability to run arbitrary code on the user’s machine. Given this, it can be dangerous unintentionally (there is nothing stop** CodeNav from making a logic error, e.g. filtering a file list incorrectly and then accidentally deleting all files on the computer) and intentionally (e.g., a malicious party might upload tainted model weights, or intercept API calls, so as to run a “randomsome ware” scam when executed by a CodeNav agent). These dangers are somewhat easier to mitigate in the tool-use paradigm as, when constrained to use only certain tools, there are fundamentally fewer attack vectors to consider.

The codebase you will use is called ‘codenav‘. It is a library for creating LLM agents that can interact with one of the avilable environments to solve the user queries that require using an external codebase. For example, an agent can interact with "PythonCodeEnv" for code execution, and with "RetrievalEnv" for retrieving code snippets from an ElasticSearch index. It provides an "Episode" class for running this interaction with a specific agent and a set of environments. Here’s the directory structure:
codenav/agents/ - contains subdirectories that store implementations of LLM agents implemented with different LLMs codenav/environments/ - contains environments that the agent can interact with codenav/interaction/ - contains Episode and messages implementations (messages is how agent interacts with environments) codenav/prompts/ - stores various system prompts for the LLM agent codenav/retrieval/ - various files related to creating elastic search index and retrieving items from the index codenav/utils/ - contains various utility python files codenav/constants.py - contains important constants
Figure 5: Library description for CodeNav case study in Sec. 4.1
The codebase you’ll be using to solve user tasks is called ‘mnm‘. It has a single tool_api.py file which contains functions for various image, text, and audio related tasks. Specifically, here’s a high-level summary of available functions:
- text understanding functions: for tasks like text generation, summarization, classification, answering questions based on a text context
- image understanding functions: for tasks like image classification (1000 IMAGENET categories only), image captioning, answering questions about image (use this if you have a question that can’t be answered with IMAGENET classification), detecting objects (producing bounding boxes and labels for COCO categories) and segmenting objects (producing segmentation masks and labels for COCO categories), transcribing alphanumeric characters in an image to text (also known as optical character recognition).
- image editing functions: for generating images give a text description, editing images given the original image and a description of how to modify the image (can handle queries that require replacing or removing certain objects in the scene without detecting the object first), image crop**, or achieving effects like color pop and background blur given the segmented objects which you want to highlight in the image.
- information retrieval functions: for retrieving factual information or interesting facts about dates, years, numbers, movies, weather, geographical coordinates of a city, wikipedia articles, or fun trivia. Also includes a love calculator for checking compatibility given two names.
- object centric functions: these are functions that accept a list of detected or segmented objects for tasks like counting, selecting an object, tagging an image with objects (drawing bounding boxes and labels), or replacing objects with emojis. Note that these functions here do not detect the objects themselves.
- audio understanding functions: for tasks related to audio understanding like speech recognition
Figure 6: Library description for m&m’s as well as the multimodal case study in Sec. 4.2
The codebase you will use is called ‘phi‘ (already installed via pip).
Here are some key file paths: phi/tools/email.py - for sending emails (requires passkey which can be read from environment variable GOOGLE_KEY; use ABC as sender name and [email protected] as sender email id) phi/tools/arxiv_toolkit.py - for search arxiv for research papers phi/tools/tavily.py - an AI driven search engine (requires pass key stored in environment variable TAVILY_KEY) phi/tools/duckduckgo.py - a search engine similar to google phi/tools/newspaper4k.py - for scar** news articles phi/tools/pubmed.py - for searching biomedical and life sciences literature phi/tools/yfinance.py - for fetching financial data like stock prices from Yahoo Finance phi/tools/wikipedia.py - for searching wikipedia (need to use json.loads on the output to extract content)
Figure 7: Library description for PhiData case study in Sec. 4.3
The codebase you will use is called ‘m3eval‘. It has the following directory structure:
m3eval/travel_planner.py - function for planning travel including finding flights, making hotel reservation, and budget calculations m3eval/browser.py - contains the WebBrowser class for navigating web pages m3eval/dna_sequencer.py - various functions related to dna sequencing m3eval/message_decoder.py - functions for encoding and decoding messages including converting hex string to ascii and decoding caesar cipher m3eval/trade_calculator.py - functions for currency conversion, calculating tariffs etc
Figure 8: Library description for M3ToolEval.
The codebase you’ll be using to solve user tasks is called ‘api-bank‘. This codebase implements a collection of different APIs as classes, here is a full list of these APIs:
AddAgenda: The API for adding a agenda item includes content, time and location. AddAlarm: The API for setting an alarm includes a parameter for the alarm time. AddMeeting: This API allows users to make a reservation for a meeting and store the meeting information (e.g., topic, time, location, attendees) in the database. AddReminder: The API for adding a reminder item includes content and time. AddScene: This API adds a scene of smart home system, given the scene name and a list of smart devices AppointmentRegistration: This API registers an appointment of hospital. BookHotel: This API orders a hotel room. Two rooms are ordered if the number of adults is greater than 2. Only one order can be made at same time. Calculator: This API provides basic arithmetic operations: addition, subtraction, multiplication, and division. CancelRegistration: This API cancels the registration of a patient given appointment ID. CancelTimedSwitch: Cancels a timed switch for a smart device. CheckToken: Check the user token. DeleteAccount: Delete an account. DeleteAgenda: The API for deleting a schedule item includes parameters for token, content, time, and location. DeleteAlarm: The API for removing an alarm includes a parameter for the time. DeleteMeeting: This API allows users to delete a reservation for a meeting and remove the meeting information in the database. DeleteReminder: The API for deleting a reminder item includes content and time. DeleteScene: This API deletes a scene by its name. Dictionary: This API searches the dictionary for a given keyword. DocumentQA: This API answers the question from a given document url. EmergencyKnowledge: This API searches for a given symptom for emergency knowledge. ForgotPassword: Sends an email to the user with a link to reset the password. Need call twice, first with ’Forgot Password’ status to get the verification code, then call again with ’Verification Code’ status to change the password. Must pass the name of the parameters when calling the API, like ForgotPassword(status=’Forgot Password’, username=’username’). GetToday: This API gets the current date. GetUserToken: Get the user token by username and password. ImageCaption: This API generates a caption for a given image. ModifyAgenda: The API for modifying a schedule item includes parameters for content, time, and location. ModifyAlarm: The API for modifying an alarm includes a parameter for the from_time to to_time. ModifyMeeting: This API allows users to modify a reservation for a meeting ModifyPassword: The API for modifying the password of the account. ModifyRegistration: This API modifies the registration of a patient given appointment ID. ModifyReminder: The API for deleting a reminder item includes content and time. ModifyScene: This API modifies a scene of smart home system, given the scene name and a list of smart devices OpenBankAccount: This is an API for opening a bank account for a user, given the account, password and name. PlayMusic: This API triggers a music player to play music. QueryAgenda: The API for getting a schedule item includes parameters for token, content, time, and location. QueryAlarm: The API for querying alarm clock, help user to check the alarm clock they have set. QueryBalance: This API queries the balance of a given user. QueryHealthData: This API queries the recorded health data in database of a given user and time span. QueryHistoryToday: This API queries the history of the given date. QueryMeeting: This API allows users to query the information of a meeting. QueryRegistration: This API queries the registration of a patient, given patient ID. QueryReminder: The API for querying a reminder item includes content and time. QueryScene: This API queries a scene of smart home system, given the scene name QueryStock: This API queries the stock price of a given stock code and date. RecordHealthData: This API records the health data of a user. RegisterUser: The API for registering a account, given the username, password and email. SearchEngine: This API searches for a given keyword for search engine. SendEmail: This API for sending email, given the receiver, subject and content. SpeechRecognition: This API recognizes the speech from a given audio url. SymptomSearch: This API searches for a given symptom. TimedSwitch: This API for setting a timed switch for a smart device. ToolSearcher: Searches for relevant tools in library based on the keywords. Translate: Translate the text to the target language. Wiki: This API for searching a keyword in Wikipedia.
Figure 9: Library description for API-Bank.
The codebase you’ll be using to solve user tasks is called ‘mnm‘, this codebase consists of a single file called tool_api.py which contains the following functions:
text_generation(text) -> text: It takes an input text prompt and outputs a text that is most likely to follow the input text. text_summarization(text) -> text: it takes a paragraph of text and summarizes into a few sentences. text_classification(text) -> text: It takes a text and classifies it into a category in the model’s vocaburary (e.g. positive or negative based on its sentiment). question_answering(text, question) -> text: It takes a text and a question, and outputs an answer to that question based on the text. image_generation(text) -> image: It takes a text prompt and generates an image that matches the text description. image_captioning(image) -> text: It takes an image and generates a text caption of the image. optical_character_recognition(image) -> text: It takes an image and outputs recognized texts in the image. image_classification(image) -> text: It takes an image and classifies the subject in the image into a category such as cat or dog. image_editing(image, prompt) -> image: It takes an image and a text prompt and outputs a new image based on the text. object_detection(image) -> image, objects: It takes an image and outputs rectangular bounding boxes of objects detected in the image. image_segmentation(image) -> image, objects: It takes an image, segments it into different parts, and outputs segmentation masks of any shape for the parts. automatic_speech_recognition(audio) -> text: It takes an audio file and produces a transcription of the audio. visual_question_answering(image, question) -> text: It takes an image and a question about the image, and generates an answer to the question. image_crop(image, object) -> image: It takes an image and 4 numbers representing the coordinates of a bounding box and crops the image to the region within the box. image_crop_left(image) -> image: It takes an image, crops and keeps the left part of the image. image_crop_right(image) -> image: It takes an image, crops and keeps the right part of the image. image_crop_top(image) -> image: It takes an image, crops and keeps the top part of the image. image_crop_bottom(image) -> image: It takes an image, crops and keeps the bottom part of the image. background_blur(image, object) -> image: It takes an image and one or multiple objects in the foreground, and returns an image where the backgroud is blurred. color_pop(image, object) -> image: It takes an image and one or multiple objects, and returns an image where only the object is colored and the rest is black and white. count(objects) -> number: It takes a list of objects and returns the count of the objects. tag(image, objects) -> image: It takes an image and a list of objects with their bounding boxes and classes, and tags all the objects. select_object(objects, object_name) -> object: It takes a list of objects, and selects the object based on the input object name. emoji(image, object, emoji) -> image: It takes an image and the bounding box coordinates of one or multiple objects, and replaces the object with an emoji (e.g. angry/flushed/crying/dizzy/sleepy/grimacing/kissing/smiling_face, alien, ghost, goblin etc). get_date_fact(date) -> text: It provides interesting facts about dates. get_year_fact(year) -> text: It provides interesting facts about years. get_math_fact(number) -> text: It provides interesting math facts about numbers. get_trivia_fact(number) -> text: It provides interesting trivia facts about number. love_calculator(first_name, second_name) -> number: Enter your name and the name of your partner/lover/crush to find Love compatibility & chances of successful love relationship. get_location(city) -> lon, lat: Convert a city name or address to geographical coordinates using OpenStreetMap’s Nominatim API. search_movie(movie_title, movie_year) -> text: Retrieve basic movie information, including title, year, genre, and director. get_weather(lon, lat) -> objects: Provides weather forecast data based on specific geographical coordinates. wikipedia_simple_search(text) -> text: Perform a basic search query on Wikipedia to retrieve a summary of the most relevant page.
NOTE - all of the above functions produce a python dictionary as output. The function signatures above show the key present in the dictionary.
For example, get_year_fact(year) -> text means that the fact can be accessed through
output = get_year_fact(year)
fact = output[’text’]
Figure 10: Tool description for m&m’s.
The codebase you will use is called ‘m3eval‘. It has the following files and functions implemented in those files
file: m3eval/message_decoder.py [1] convert_hex_to_ascii: Converts a hexadecimal string to ASCII. Arguments: hex_string (str) Signature: convert_hex_to_ascii(hex_string: str) -> str [2] reverse_string: Reverses a string. Arguments: string (str) Signature: reverse_string(string: str) -> str [3] caesar_decode: Decodes a string using the Caesar cipher. Arguments: message (str), shift (int) Signature: caesar_decode(message: str, shift: int) -> str [4] string_length: Finds the length of a string. Arguments: string (str) Signature: string_length(string: str) -> int [5] minimum_value: Finds the minimum value from given arguments. Arguments: *args (variable number of arguments) Signature: minimum_value(*args) -> int/float [6] maximum_value: Finds the maximum value from given arguments. Arguments: *args (variable number of arguments) Signature: maximum_value(*args) -> int/float
file: m3eval/dna_sequencer.py [7] count_nucleotides: Counts the occurrences of each nucleotide in a DNA sequence. Arguments: dna_sequence (str) Signature: count_nucleotides(dna_sequence: str) -> dict [8] transcribe_dna_to_mrna: Transcribes DNA sequence to mRNA. Arguments: dna_sequence (str) Signature: transcribe_dna_to_mrna(dna_sequence: str) -> str [9] translate_mrna_to_amino_acid: Translates mRNA sequence to a chain of amino acids. Arguments: mrna_sequence (str) Signature: translate_mrna_to_amino_acid(mrna_sequence: str) -> str [10] find_max_nucleotide: Return the nucleotide (str) with the maximum count (int). Arguments: nucleotide_counts in the form of (k1, v1, k2, v2, …, kn, vn) Signature: find_max_nucleotide(*args) -> (str, int) [11] is_valid_dna_sequence: Checks if the DNA sequence is valid. Arguments: dna_sequence (str) Signature: is_valid_dna_sequence(dna_sequence: str) -> bool [12] reverse_transcribe_mrna_to_dna: Reverse transcribes mRNA sequence to DNA. Arguments: mrna_sequence (str) Signature: reverse_transcribe_mrna_to_dna(mrna_sequence: str) -> str
file: m3eval/trade_calculator.py [13] convert_currency: Converts the commodity price to local currency. Arguments: base_price (float), conversion_rate (float) Signature: convert_currency(base_price: float, conversion_rate: float) -> float [14] calculate_tariff: Calculates the trade tariff based on the converted price. Arguments: price (float), tariff_rate (float, in Signature: calculate_tariff(price: float, tariff_rate: float) -> float [15] estimate_final_value: Estimates the final trade value including the tariff. Arguments: price (float), tariff (float) Signature: estimate_final_value(price: float, tariff: float) -> float [16] calculator: Evaluates the given expression and returns the result. Accepts a calculation expression as input. For example, "2 + (3 * 4)" will return 14. Signature: calculator(expression: str) -> float [17] find_minimum: Finds the minimum value among the given arguments. Accepts variable number of float arguments. Signature: find_minimum(*args: float) -> float [18] find_maximum: Finds the maximum value among the given arguments. Accepts variable number of float arguments. Signature: find_maximum(*args: float) -> float
file: m3eval/travel_planner.py [19] find_flights: Finds flights based on source, destination and date. Arguments: from_location (str), to_location (str), date (str) in YYYY-MM-DD format. Returns a list of flights, each represented as a dictionary with keys "from_location", "to_location" (destination), "date", and "price". Example: ["from_location": "A", "to_location": "B", "date": "2023-12-25", "price": 450] Signature: find_flights(destination: str, date: str) -> List[Dict] [20] book_hotel: Books a hotel based on location and preferences. Arguments: location (str), *preferences (variable number of str arguments). Returns a list of hotels, each represented as a dictionary with keys "location", "preferences", "price_per_night", and "rating". Example: ["location": "A", "preferences": ["wifi", "pool"], "price_per_night": 120, "rating": 4] Signature: book_hotel(location: str, *preferences: str) -> List[Dict] [21] budget_calculator: Calculates the total budget for a trip. Arguments: flight_price (float), hotel_price_per_night (float), num_nights (int). Returns the total budget (float). Signature: budget_calculator(flight_price: float, hotel_price_per_night: float, num_nights: int) -> float
file: m3eval/browser.py Note - To use the browser functions first create a browser instance using browser=WebBrowser() [22] browser.click_url: Clicks on a URL. A clickable URL looks like [Clickable ’<url_argument>’] in the webpage. Arguments: url (str). Returns the rendered content of the webpage after clicking the URL showing on the current rendered page. Signature: browser.click_url(url: str) -> str [23] browser.go_to_previous_page(): Goes back to the previous page. It has no arguments. After going back to the previous page, return the rendered content of the webpage. Signature: browser.go_to_previous_page() -> str [24] browser.scroll_down: Scrolls down the view. It has no arguments. Returns the rendered content of the webpage after scrolling down. Signature: browser.scroll_down() -> str [25] browser.scroll_up: Scrolls up the view. It has no arguments. Returns the rendered content of the webpage after scrolling up. Signature: browser.scroll_up() -> str [26] browser.view: Return the current view in string format of the rendered webpage. It has no arguments. Returns the rendered content of the webpage. You should call this when you want to see the rendered content of the current webpage. Signature: browser.view() -> str
Figure 11: Tool description for M3ToolEval.
The codebase you’ll be using to solve user tasks is called ‘api-bank‘. This codebase implements a collection of different APIs as classes, here is a full list of these APIs: AddAgenda().call(token, content, time, location) Import as: from apis import AddAgenda Description: The API for adding a agenda item includes content, time and location. Arguments: - token (str): User’s token. - content (str): The content of the agenda. - time (str): The time for agenda. Format: - location (str): The location of the agenda. Returns: A dictionary whose "output" key has value of type str and description success or failed. AddAlarm().call(token, time) Import as: from apis import AddAlarm Description: The API for setting an alarm includes a parameter for the alarm time. Arguments: - token (str): User’s token. - time (str): The time for alarm. Format: Returns: A dictionary whose "output" key has value of type str and description success or failed. AddMeeting().call(token, meeting_topic, start_time, end_time, location, attendees) Import as: from apis import AddMeeting Description: This API allows users to make a reservation for a meeting and store the meeting information (e.g., topic, time, location, attendees) in the database. Arguments: - token (str): User’s token. - meeting_topic (str): The title of the meeting, no more than 50 characters. - start_time (str): The start time of the meeting, in the pattern of - end_time (str): The end time of the meeting, in the pattern of - location (str): The location where the meeting to be held, no more than 100 characters. - attendees (list(str)): The attendees of the meeting, including names, positions and other information. Returns: A dictionary whose "output" key has value of type str and description success or failed. AddReminder().call(token, content, time) Import as: from apis import AddReminder Description: The API for adding a reminder item includes content and time. Arguments: - token (str): User’s token. - content (str): The content of the conference. - time (str): The time for conference. Format: Returns: A dictionary whose "output" key has value of type str and description success or failed. AddScene().call(name, devices) Import as: from apis import AddScene Description: This API adds a scene of smart home system, given the scene name and a list of smart devices Arguments: - name (str): The name of the scene. - devices (list): The list of smart devices, containing the name and description. Format be like ["name": "light", "description": "Smart light in the kitchen", "name": "oven", "description": "Smart oven in the kitchen", "name": "range hood", "description": "Smart range hood in the kitchen"] Returns: A dictionary whose "output" key has value of type str and description Whether succeed.. AppointmentRegistration().call(patient_name, date, doctor_name) Import as: from apis import AppointmentRegistration Description: This API registers an appointment of hospital. Arguments: - patient_name (str): The name of patient. - date (str): The date of appointment. Format be like - doctor_name (str): The name of appointed doctor. Returns: A dictionary whose "output" key has value of type str and description The ID of appointment.. BookHotel().call(hotel_name, check_in_time, check_out_time, room_count, adult_count, child_count) Import as: from apis import BookHotel Description: This API orders a hotel room. Two rooms are ordered if the number of adults is greater than 2. Only one order can be made at same time. Arguments: - hotel_name (str): The name of the hotel. - check_in_time (str): The time to check in. Format: - check_out_time (str): The time to check out. Format: - room_count (int): The number of rooms to order. - adult_count (int): The number of adults. - child_count (int): The number of children. Returns: A dictionary whose "output" key has value of type str and description The ID of the order.. Calculator().call(formula) Import as: from apis import Calculator Description: This API provides basic arithmetic operations: addition, subtraction, multiplication, and division. Arguments: - formula (str): The formula that needs to be calculated. Only integers are supported. Valid operators are +, -, *, /, and (, ). For example, ’(1 + 2) * 3’. Returns: A dictionary whose "output" key has value of type float and description The result of the formula.. CancelRegistration().call(appointment_id) Import as: from apis import CancelRegistration Description: This API cancels the registration of a patient given appointment ID. Arguments: - appointment_id (str): The ID of appointment. Returns: A dictionary whose "output" key has value of type str and description The status of cancellation..
. . .
SearchEngine().call(keyword) Import as: from apis import SearchEngine Description: This API searches for a given keyword for search engine. Arguments: - keyword (str): The keyword to search. Returns: A dictionary whose "output" key has value of type list and description The list of results.. SendEmail().call(receiver, subject, content) Import as: from apis import SendEmail Description: This API for sending email, given the receiver, subject and content. Arguments: - receiver (str): The receiver address of the email. - subject (str): The subject address of the email. - content (str): The content of the email. Returns: A dictionary whose "output" key has value of type str and description The status of the email.. SpeechRecognition().call(url) Import as: from apis import SpeechRecognition Description: This API recognizes the speech from a given audio url. Arguments: - url (str): The url to download the audio. It should end with .wav. Returns: A dictionary whose "output" key has value of type str and description The transcript of the audio.. SymptomSearch().call(symptom) Import as: from apis import SymptomSearch Description: This API searches for a given symptom. Arguments: - symptom (str): The symptom to search. Returns: A dictionary whose "output" key has value of type list and description The list of results. Format be like ["name":possible disease name, "description": disease details,…]. TimedSwitch().call(name, time, on) Import as: from apis import TimedSwitch Description: This API for setting a timed switch for a smart device. Arguments: - name (str): The name of the smart device. - time (str): The time to switch the device on or off. Format: - on (bool): Whether to switch the device on or off. Returns: A dictionary whose "output" key has value of type str and description Whether the time switch is successful.. ToolSearcher().call(keywords) Import as: from apis import ToolSearcher Description: Searches for relevant tools in library based on the keywords. Arguments: - keywords (str): The keyword to search for. Returns: A dictionary whose "output" key has value of type Union[List[dict], dict] and description The best match tool(s).. Translate().call(src, src_lang, tgt_lang) Import as: from apis import Translate Description: Translate the text to the target language. Arguments: - src (str): The text to be translated. - src_lang (str): [Optional] The source language to translate from. Default is auto. - tgt_lang (str): [Optional] The target language to translate to. Default is english/en. Returns: A dictionary whose "output" key has value of type str and description The translated text.. Wiki().call(keyword) Import as: from apis import Wiki Description: This API for searching a keyword in Wikipedia. Arguments: - keyword (str): The keyword to search. Returns: A dictionary whose "output" key has value of type dict and description The list of results. Format be like "url": "xxx", "summary": "xxx", "content": "xxx".
Figure 12: Tool description for API-Bank. We only show the beginning and end of the full description for brevity.