CodeNav: Beyond tool-use to using real-world codebases with LLM agents

Tanmay Gupta* Luca Weihs* Aniruddha Kembhavi
PRIOR @ Allen Institute for AI
https://codenav.allenai.org
* equal contribution

Abstract

We present CodeNav, an LLM agent that navigates and leverages previously unseen code repositories to solve user queries. In contrast to tool-use LLM agents that require "registration" of all relevant tools via manual descriptions within the LLM context, CodeNav automatically indexes and searches over code blocks in the target codebase, finds relevant code snippets, imports them, and uses them to iteratively generate a solution with execution feedback. To highlight the core-capabilities of CodeNav, we first showcase three case studies where we use CodeNav for solving complex user queries using three diverse codebases. Next, on three benchmarks, we quantitatively compare the effectiveness of code-use (which only has access to the target codebase) to tool-use (which has privileged access to all tool names and descriptions). Finally, we study the effect of varying kinds of tool and library descriptions on code-use performance, as well as investigate the advantage of the agent seeing source code as opposed to natural descriptions of code. All code will be made open source under a permissive license.

1 Introduction

Today, tool-use is the de facto approach for enabling LLMs to interact with external systems or programs to complete domain-specific tasks (Gupta and Kembhavi, 2023; Sur’is et al., 2023; Wu et al., 2023; Yang et al., 2023; Wang et al., 2023). In this tool-use paradigm, a list of functions, or more generally code snippets, are first “registered” with the LLM by adding their descriptions, potentially with examples of their usage, to the LLM’s context. Then, given a user query, the LLM generates invocations of these tools, executed in an external environment, to solve the user’s task.

While tool-use enables new capabilities, it also limits the expressiveness of LLMs by constraining them to invoke only a handful of meticulously described functions or API calls. With LLMs becoming increasingly capable at code understanding and generation (Wang et al., 2024; Chen et al., 2021a; GitHub, 2024; Jimenez et al., 2024), we argue that it is time to move beyond tool-use to code-use, i.e. move from meticulously designed and registered tools to a setting where the LLM agent directly reads, imports, and uses the source code of any given repository¹¹1We use the terms codebase, library, and repository interchangeably. to solve a user’s query.

An effective code-use agent must be able to identify and use the right code snippets (functions, classes, constants, etc.) from the codebase to solve the given query. This free-form search over code is made possible due to a simple observation: well-designed libraries are written by humans for humans. These libraries often make crucial domain-specific assumptions, use meaningful abstractions and variable names, and organize code in files and directories so that it is easy to discover relevant functionality and document concisely. Instead of describing every single function or class in every file as done in tool-use, code-use leverages this structure to search for the required snippets. Code-use may further benefit from a high-level library description that exposes the structure inherent in the codebase. This may be done in various ways: e.g., highlighting important directories, files, or entry points, describing the contents and purpose of complex files, or elucidating the library’s abstractions and assumptions. This library description often already exists in the form of a README file, can be manually provided, or generated automatically via rule based parsing or an LLM.

We propose CodeNav, a single-agent, multi-environment, interaction framework (see Fig. 1) where an agent Navigates through a given Codebase to find the code snippets it needs to solve a users’ query. Given a user query and high-level library description, CodeNav iterates between searching the codebase for useful code snippets and generating part of the solution code that imports, instantiates, or calls the relevant variables, classes, or functions defined in the retrieved results. In this iterative process, the agent inspects execution results, fixes errors (if any), searches to resolve ambiguities, and gradually builds a solution to the query. The multi-environment setup makes it easy to extend the system’s capability by adding a new environment (e.g., a \mintinlinepythonterminal) along with its supported actions (e.g., \mintinlinepythonbash commands) and its responses to those actions (e.g., \mintinlinepythonSTDOUT).

For evaluating CodeNav, we begin by quantifying the gap between tool-use and code-use on three existing tool-use benchmarks: m&m’s (Ma et al., 2024), M³ToolEval (Wang et al., 2024), and API-Bank (Li et al., 2023b). To adapt these benchmarks for code-use, we provide the files containing the tool implementations as the target codebase. Tool-use forms an upper bound for code-use as the tool-use agent is privy to comprehensive, hand-crafted information that is available for these benchmarks. Surprisingly, despite not having access to this privileged information, we find that CodeNav is competitive with tool-use across these evaluations. While these results are impressive, they fall short of demonstrating the full potential of code-use. To this end, we present three qualitative case studies using distinct codebases. In these case studies, we find that CodeNav can follow complex multi-step queries, recover from execution errors, run iterative searches to better understand usage of code snippets, and format results as per user instructions to visualize computations (e.g., as a webpage).

We argue that code-use is more than just a niche application of generating code conditioned on another repository. We envision a future where code-use is how LLMs discover and use domain-specific tools without imposing any constraints on how the tools are developed or written (e.g., as a list of simplistic functions) and without any interventions (e.g., manual description and registration of tools) required for LLMs to use them. Our results are suggestive of a future where a codebase is all you need! We highlight six key contributions. (1) We introduce a novel code-use paradigm for LLM agents to move beyond tool-use to directly using real-world code bases to solve complex user queries. (2) We propose CodeNav that formulates code-use as a multi-step interaction between a single LLM agent and stateful retrieval, and code execution environments. (3) On three tool-use benchmarks (m&m’s, M³ToolEval, and API-Bank), we find a minimal gap from code-use to the tool-use upper bound without requiring arduous tool registration. (4) We study the effect of library or tool description richness on code-use performance. (5) We investigate the advantage of having access to the source code as part of retrieval result as opposed to just function signatures or docstrings. (6) We present three case studies to demonstrate the promise of code-use agents on solving complex queries using real-world codebases. Our code will be made open source under a permissive license.

Refer to caption — Figure 1: CodeNav’s single-agent, multi-environment interaction protocol. Given a user query, a brief description of the codebase (*library description*), and the interaction history, the LLM agent produces an \mintinlinepythonaction comprising of a *thought*, action *type*, and action *content*. The action gets executed in the target environment (identified by action *type*) to produce a \mintinlinepythonresponse. The interaction at the current step consisting of the \mintinlinepythonaction-\mintinlinepythonresponse pair is appended to the interaction history as context for the LLM to produce the next action.

2 Related Work

Tool-use with LLMs. Tool-use refers to an LLM identifying and invoking appropriate tools or functions to solve a task described by the user. Training-free methods for tool-use require first "registering" tools with the LLM. Tool registration techniques include descriptive registration by describing the tools (e.g., signature and docstrings of the class or function that implements a tool) (Hsieh et al., 2023; Sur’is et al., 2023) or prescriptive registration by providing in-context examples of similar task descriptions and corresponding tool invocations (Gupta and Kembhavi, 2023). Training-based methods involve finetuning an LLM on instruction and tool invocation pairs (Qin et al., 2024). In this work, we focus on training-free methods for tool-use and code-use.

Retrieval for tool-use. Scaling training-free methods to a large number of tools is challenging due to limited (albeit steadily increasing) context windows of LLMs. An approach to circumvent this is retrieving only the relevant tool descriptions or usage examples from a library. For instance, Hsieh et al. (2023) employ TF-IDF search to retrieve relevant tool documentation. DocPrompting (Zhou et al., 2023) explores sparse and dense documentation retrieval for code generation. ART (Paranjape et al., 2023) use cross-task demonstration retrieval for multi-step reasoning and tool-use. EcoAssistant (Zhang et al., 2023) saves past successful solutions to user queries in a database and for new queries retrieves solutions to similar queries in the database as demonstrations. Unlike previous works that retrieve documentation, our CodeNav agent directly retrieves source code using boolean Elasticsearch queries (e.g. (type: CLASS) AND (text: ObjectDetection)).

Prompting strategies for tool-use. For user queries that require multi-step tool-use solutions, the agent has to simultaneously exercise the ability to plan (which tools to use and when) as well as invoke the tools correctly. To improve planning, prompting strategies like Chain of Thought (Wei et al., 2022) prompt the LLM to precede the actual solution with a thought that describes the step-by-step plan. ReAct (Yao et al., 2023) interleaves thought, action (tool invocation), and observation (result of executing the action or feedback). CodeAct (Wang et al., 2024) generalizes ReAct by allowing actions to be free-form code. In contrast to CodeAct, CodeNav uses a single-agent, multi-environment framework in which the agent can both search for and execute code. Further, CodeAct operates in the tool-use regime where the exact tools that are needed for the query are registered ahead of time.

Code generation. While not all tool-use methods require the LLM to generate executable code (Khot et al., 2022) (instead generating function names and arguments and using a custom interpreter), tool-use may be implemented as free-form code generation (Wang et al., 2024; Ma et al., 2024). Code generation with LLMs has been explored for a wide range of applications including code-completion in code editors (GitHub, 2024), generating functions from docstrings (Chen et al., 2021b), editing files in a repository to fix Github issues (Jimenez et al., 2024), and even generating an entire repository consisting of multiple files from a single natural language instruction (Osika, 2023).

Feedback and correction. As tasks get more complex, a single LLM call may only produce a partial or partially correct solution. For such tasks, agentic workflows enable a closed-loop system (Wu et al., 2023; Wang et al., 2023) where the LLM agent iteratively produces an intermediate solution, receives feedback on the solution, and proceeds to either fix any errors or generate the next step. In context of code generation and tool-use, the feedback may consist of execution output such as variable values and exceptions raised, output of test cases (Huang et al., 2023), output of static analysis tools like linting and type-checking, and human or LLM feedback (Madaan et al., 2023). Recent works have also explored emulating execution feedback using an LLM (Li et al., 2023a; Ni et al., 2024).

3 CodeNav

3.1 Overview

We formulate code-use with LLMs in a single-agent, multi-environment interaction framework consisting of stateful code retrieval and code execution environments (see Fig. 1). The agent is given a brief high-level library description (e.g. important directory or file paths, description of content and purpose of complex files, crucial abstractions and assumptions made by the library, etc.) and the user query to be solved. The agent then proceeds to interact with these environments over multiple rounds. Each interaction consists of an \mintinlinepythonaction from the agent and a \mintinlinepythonresponse from an environment. Each \mintinlinepythonaction consists of a (i) thought used for chain-of-thought reasoning (Wei et al., 2022; Yao et al., 2023), (ii) an action type specifying the environment the agent wishes to act upon (e.g. code, search, etc.), and (iii) the action content which is executed in the selected environment. The action content gets routed to the appropriate environment based on the action type and the environment executes the content updating its state and producing a \mintinlinepythonresponse. The history of past interactions is provided to the agent as context to generate the next action. The interactions continue up to a maximum number of interactions specified by the user or until the agent takes the \mintinlinepythondone action. We now describe details of environments, actions, and responses.

3.2 Environments

For code-use, the agent must be able to perform two essential functions: (i) search for or discover relevant code snippets in the target codebase; (ii) generate the next portion of the code solution that imports, instantiates, or calls the needed classes or functions. In CodeNav, the agent performs these functions by taking actions in one of the following environments.

Retrieval Environment. This environment serves as an interface to a search index for fetching code snippets from the target codebase with rule-based re-ranking and a persistent memory to avoid resurfacing past retrievals. We parse the entire codebase and index all functions, classes, import statements, assignments, etc. as individual documents. We implement the index using \mintinlinepythonElasticsearch (\mintinlinepythonES) with fields for code string, code type (function, class, assignment, import etc.), file path, and line numbers (Elastic, 2024). The agent can use a \mintinlinepythonsearch action type to issue search queries into this index. Given the action, the environment first uses the action content as the search query to the \mintinlinepythonES index, discards any matches that have already been retrieved by past searches during the episode, and then re-ranks the results based on heuristic rules (e.g., to prioritize functions and classes and de-prioritize assignments and import statements). Finally, the top- $k$ results are added to the environments persistent memory and returned as the environment reponse. Issuing the same action again surfaces the $k$ next-best matches.

Execution Environment. Actions of type \mintinlinepythoncode get routed to a \mintinlinepythonPython execution environment.²²2In this work we consider only Python code generation for simplicity but here is no limitation on CodeNav preventing extension to other languages. At the beginning of the episode, the environment is initialized with an empty \mintinlinepythonglobal_variables dictionary. Each code block is executed in scope of these global variables (using \mintinlinepythonexec(code_str, global_vars)) and any changes to the global namespace (i.e., modification or deletion of existing variables, or creation of new variables) are reflected in this dictionary. Prior to execution, the environment optionally performs linting (using \mintinlinepythonflake8), type-checking (using \mintinlinepythonmypy), and formatting (using \mintinlinepythonblack) of the code block. Standard output (\mintinlinepythonstdout), updated variables in \mintinlinepythonglobal_vars (new variables or variables whose string representations have changed), and errors if any (execution, linting, or type-checking) are returned as part of the response.

Done and Code Summary Environments. Additionally, we create a few helper environments. The Done Environment returns a \mintinlinepythonnull response to the \mintinlinepythondone action that marks the end of the episode. The Code Summary Environment treats the content of the \mintinlinepythoncode_summary action as a cleaned up summary of the code solution produced by the agent during the episode and saves it for easy access.

3.3 Agent Actions

The CodeNav agent interacts with the environments using an \mintinlinepythonaction that consists of 3 components: thought, type, and content, see Fig. 1 (right). The LLM underlying the agent is prompted to only produce outputs in a structured XML format and in compliance with a set of rules (e.g., thought and type must always be provided; type must be one of the available action types; etc.). Before routing the \mintinlinepythonaction to the target environment, we check its validity. If the \mintinlinepythonaction is found to be invalid, an \mintinlinepythonInvalidAction response is returned to the agent containing a description of the violated rule. The agent may use this violation description to fix the \mintinlinepythonaction in the next step.

3.4 Environment Responses

Each action elicits a response from the target environment or an \mintinlinepythonInvalidAction response if the action is invalid for execution in the target environment. Each response in CodeNav is implemented as a data class with a \mintinlinepythonformat() method that specifies how the response data should be serialized to a text string to be included in the agent’s context when predicting the next action.

Retrieval Response. Given a search query (i.e., the content of a \mintinlinepythonsearch action), the retrieval environment returns a list of documents (containing code blocks from the target codebase along with metadata) from the search index that match the search queries. Implementation of the \mintinlinepythonformat() method of the retrieval response answers the question: what should be shown to the agent from these retrieved code blocks? On one hand, the agent may gain a better understanding of how to use a class or a function by reading its source code (containing function signatures, argument types, outputs, and implementation details) as opposed to reading an imprecise, incomplete, outdated, or entirely absent human written description or \mintinlinepythondocstring of the class or function. On the other hand, showing all implementation details for every retrieved code block results in an explosion in the number of context tokens to be processed by the LLM. To strike a balance, we first retrieve up to $M$ ( $=100$ ) matched documents. From these, we show the top- $K$ ( $=3$ ) matches with source code and metadata. For large codebases, this is usually insufficient to surface target code blocks unless the exact function or class name is given in the search query. Therefore, we additionally show prototypes (signatures and filenames) for up to $P$ classes or functions in the remaining retrieved results. Further, for the top-K matches, we use GPT-4 to generate \mintinlinepythondocstrings for the top-3 retrievals and show whichever is shorter between the source code and the function signature with the generated docstring.

Execution Response. Given a \mintinlinepythoncode action, the Python execution environment executes the content producing a response. The \mintinlinepythonformat() method of a response serializes standard output (\mintinlinepythonstdout), variables changed during execution shown as variable names along with string representation of their values, execution errors if any, and (optionally) linting, type-checking, and formatting errors. The error messages contain reference to the line in the code string that produced the error to help localize the error. The \mintinlinepythonstdout and changed variables allow easy inspection of function calls but can get quite long (e.g., when printing a large array) and are therefore truncated to a maximum number of characters. We show the start and end of \mintinlinepythonstdout and the beginning of variable values.

4 Case Studies

While we quantitatively evaluate CodeNav on tool-use benchmarks in Sec. 5, these benchmarks are not sufficiently complex to highlight the advantages of code-use over tool-use. Therefore, we showcase CodeNav’s impressive capabilities in three case studies using diverse codebases to solve complex queries. For the first case study (Sec. 4.1), Fig. 2 depicts the entire episode. For the other two case studies, we show the inputs and outputs in Fig. 3 and provide the full episodes as part of the supplementary material. For all case studies we provide library descriptions in the appendix (App. C). Please also see App. Table 6 for information about the size and complexity for the codebases used in these case studies and our quantitative experiments (e.g., searching the transformers library requires searching over 50,508 snippets in 3,475 files).

4.1 CodeNav on CodeNav

We imagine a researcher who, possibly after reading this paper, wishes to use CodeNav to answer a query using the transformers library (Wolf et al., 2020). In place of a researcher however, we use a CodeNav agent; i.e., a CodeNav agent uses the CodeNav repository to instantiate another CodeNav agent to answer a given query with transformers. This example serves two goals: (1) it provides a pedagogical example of using our codebase, and (2) it shows CodeNav’s zero-shot abilities as we can guarantee that the underlying LLM (GPT-4) was not trained on our codebase.

For this case-study, our user query consists of 7 steps divided into 2 distinct parts (see User query in Fig. 2). Steps 1-4 specify instructions for creating and running the episode while Steps 5-7 contain instructions to visualize the results of the interaction. In Steps 1-4, the user asks CodeNav to first create an agent using \mintinlinepythonOpenAICodenavAgent and to instantiate various environments using \mintinlinepythonPythonCodeEnv, \mintinlinepythonRetrievalEnv, and \mintinlinepythonDoneEnv with the specified parameters like the Elasticsearch host and index name to use for retrieval. Then the query asks the agent to create an episode for solving another query (within the original query!) using \mintinlinepythontransformers. This “subquery” requires the agent to detect dogs in an image (specified by a file path) using the \mintinlinepythonfacebook/detr-resnet-101 model in the object detection pipeline, add red detection boxes on the image, and store the image in variable \mintinlinepythondetected_dogs. The subquery also asks agent to store the detection coordinates and scores as a pandas dataframe in the variable \mintinlinepythondetection_coords_and_scores. The first part of the full query ends with asking the agent to run the episode for a maximum of 10 steps Steps 5-7 specify how to visualize the interaction. Specifically, we ask the agent to: (i) tabulate the interaction as a dataframe with columns for action type and thought; (ii) save the \mintinlinepythondetected_dogs image as a PNG at a specified file path; and (iii) print \mintinlinepythondetection_coords_and_scores.

The CodeNav episode for the above can be found in Fig. 2. The agent initially searches for information about the \mintinlinepythonOpenAICodenavAgent class and the environments (A1, R1, A2, R2), and then attempts to instantiate them with code (A3). This code results in an error due to a misuse of the \mintinlinepythonRetrievalEnv initializer (R3). The agent then searches for additional information to resolve this error (A4, R4) and eventually succeeds in running a new CodeNav agent using transformers (A9, R9). Finally it prints and saves the requested outputs.

4.2 Multimodal Processing and Reasoning

In our proposed code-use formulation, multimodal tasks are identical to text-only tasks so long as multimodal processing functionality is available in the target codebase. We demonstrate this with an image editing task requiring visual reasoning to localize the region to edit. We use the m&m’s codebase as it contains computer vision tools for detection, segmentation, QA, etc. In this case-study, see Fig. 3 (top), the agent is required to find and highlight the person wearing glasses who is talking on the phone. Our query first specifies steps to localize the region to edit. In particular, the agent is instructed to first segment the image and select the person segments. Then for each person, the agent should zoom in on the face by taking a crop of the top-third of the person bounding box. For each face, the agent must use visual question answering to verify whether the person is wearing glasses and is talking on cell phone. To visualize these predictions, the agent is instructed to save these face crops along with the predictions as an HTML table. Finally, the agent is instructed to highlight the person for which both attributes are true via a color-pop effect. We observe that CodeNav not only uses multimodal models proficiently but also uses the outputs to perform visual reasoning. Further, this case study highlights CodeNav’s ability to produce human-interpretable intermediate outputs.

4.3 Research assistant

Imagine an agent that curates reading material on a given topic including news articles, blog posts, and research papers and then emails it to you. For CodeNav, this is simply a matter of using a codebase that provides the necessary functionality for querying knowledge sources on the web and sending emails. One such codebase is PhiData (phi, 2024). In this case-study (see Fig. 3, bottom) we query the agent to curate reading material on "Alphafold-3" consisting of definition from Wikipedia, a list of news articles, and a list of papers on "Protein folding with Deep Learning". Further, our query specifies various presentation requirements: e.g., for each news article, we require the agent to show the article source, title, link, and first 140 characters. Finally, we ask the agent to write this information to an HTML and markdown-formatted documents. The text document’s content is then sent to the user’s email address with the specified subject. This case study demonstrates the versatility of CodeNav to create specialized agents simply by providing an appropriate codebase. Here CodeNav serves as a research assistant by using functionality built in PhiData to query Wikipedia, DuckDuckGo News, and arXiv.

5 Experiments

We now present our quantitative results on 3 tool-use benchmarks: (1) m&m’s, which requires multi-step planning with 33 multi-modal, e.g. vision/language, tools (Ma et al., 2024); (2) API-Bank that involves managing user state in sandboxed environment via calls to any of 73 APIs (Li et al., 2023b); and (3) M³ToolEval that contains “82 human-curated tasks” requiring multiple tools, calls, and interactions (Wang et al., 2024). In all experiments we report the mean performance across 3 independent evaluations; the reported $\pm$ values correspond to $2{\times}$ the standard dev. across evaluations. As these benchmarks were not necessarily designed for evaluation with code-use agent in mind (e.g., API-Bank uses JSON API calls), this has necessitated some changes in how we evaluate CodeNav on these benchmarks. See App. A for these details as well as descriptions of our metrics.

5.1 How does code-use compare to tool-use on tool-use benchmarks?

Tool-use benchmarks are designed to test LLMs’ ability to invoke a small set of pre-registered tools. Since tools in these benchmarks are relatively simple function calls with human written descriptions, code-use is upper bounded by tool-use on these benchmarks. We wish to quantify the gap between code-use (where source code search is necessary) and tool-use (where no search is needed as tool names/descriptions are provided). Tab. 1 shows that on m&m’s and API-Bank, code-use achieves similar or slightly lower tool-f1. On both benchmarks, code-use takes a minor hit on tool-recall. This is intuitive as tool-use is provided tool names while code-use has the harder task of searching to discover available tools. On M³ToolEval, which evaluates final answer correctness, code-use is within two points of tool-use. In all datasets on average, code-use takes ${\sim}$ 2 more interaction steps compared to tool-use to search for required tools. Finally, code-use results in only a minor increase in performance variance despite the added uncertainly due to lack of knowledge of available tools.

Table 1: Code-use is competitive with tool-use even without tool prompts.

m&m’s

M³ToolEval

API-Bank

method

precision

recall

steps

accuracy

steps

precision

recall

steps

tool-use

82.9 ± 4.5

81.7 ± 0.4

79.6 ± 2.3

4.9 ± 0.1

83.7 ± 2.8

6.6 ± 0.5

86.6 ± 0.8

93.6 ± 1.1

88.5 ± 0.7

3.4 ± 0.1

code-use

88.0 ± 6.1

78.2 ± 4.5

80.6 ± 5.1

7.2 ± 0.2

81.7 ± 4.9

7.8 ± 0.4

84.0 ± 0.7

89.3 ± 0.6

85.3 ± 0.3

5.3 ± 0.0

5.2 Is a library description sufficient for tool-use?

Table 2: Desc. ablations on m&m’s.

tool description	length	f1	steps
w/o desc	0	74.1 ± 1.9	7.0 ± 0.1
tool names	694	78.1 ± 3.5	6.7 ± 0.2
+ desc	3680	80.8 ± 0.4	6.9 ± 0.1
+ prototypes	4627	80.7 ± 5.0	6.1 ± 0.1
library desc (CodeNav)	2061	80.6 ± 5.1	7.2 ± 0.2

Instead of meticulously listing each tool or function name along with detailed descriptions of its purpose and input and output arguments, CodeNav allows a user to only provide a high-level description of the codebase or library that implements these tools. In Tab. 2, we compare library description (Fig. 6 in appendix) with CodeNav without any description as well as tool descriptions with three "levels of detail"; the lowest level contains only the tool names while the richest contains tool names, descriptions, and function signatures (Fig. 10 in appendix). As providing tool details in the prompt helps the agent identify the tools needed to solve a query and use them correctly, we find a consistent increase in tool-f1 with increasing tool detail. Spectacularly, library description achieves similar performance as the richest tool description with less than half the description length. The convenience of not requiring description of each tool comes at the cost of minor increase in number of steps since now the agent needs to search and discover the tool as opposed to recalling from context.

5.3 Does seeing the source code help code-use?

Table 3: Search response formatting ablation on m&m’s.

retrieval response	f1	steps
prototypes	80.0 ± 5.2	7.8 ± 0.1
code	81.2 ± 2.3	7.3 ± 0.2
code or docstring (CodeNav)	80.6 ± 5.1	7.2 ± 0.2

Well written code is its own documentation. Therefore, for an agent that is proficient in code understanding, seeing the actual implementation details in the source code (which might also include docstrings) alleviates the need to manually register tools and provides strictly more information than just the function signatures or prototypes. We compare various retrieval response formats in Tab. 3 using the m&m’s benchmark. We observe that returning code for the top-3 matches results in higher tool-f1 than showing prototypes only in the retrieval response. We also see a reduction in number of interaction steps needed by the agent to solve the query which has been a consistent indicator of lower uncertainty in deciding which tools to use and how. Finally, real-world code blocks can span 100s of lines which quickly increases the context length to be processed by the agent. To remedy this we generate docstrings for the top-3 retrievals using GPT-4 and between the docstring and the raw code we show whichever is shorter. As expected, this default configuration in CodeNav achieves tool-f1 higher than prototypes only but lower than code.

5.4 What makes a good library description?

Table 4: Comparing two library descriptions on M³ToolEval. To provide a reference for length, desc. length for tool-use for M³ToolEval is 5K.

library description	length	accuracy	steps
file path	253	76.8 ± 6.5	8.3 ± 0.3
file path + file desc	641	81.7 ± 4.9	7.8 ± 0.4

We demonstrate the impact of library description in Tab. 4. M³ToolEval is implemented as a codebase consisting of 5 files, each containing tools dedicated to a problem domain; web browsing, travel planning, dna sequencing, message encryption, and financial calculations. The first library description simply provides relative file paths to these files (e.g., m3eval/travel_planner.py), while the second description also includes a one line summary of what the file contains (e.g., “functions for planning travel including finding flights, making hotel reservation, and budget calculations”). Intuitively, this enables CodeNav to come up with keywords for search and results in superior performance with fewer interactions needed to reach a solution. We provide the following three recommendations to write good library descriptions: (i) provide context for the target domain to enable the LLM to then use its knowledge of the domain to generate useful keywords for search; (ii) describe library structure (e.g., directory structure or key assumptions the library makes); (iii) provide a brief (not necessarily exhaustive) natural language description of the available functionality.

5.5 Impact of LLM choice on performance

Table 5: LLM choice ablation on m&m’s.

LLM	precision	recall	f1	steps
gpt-4-1106-preview	88.0 ± 6.1	78.2 ± 4.5	80.6 ± 5.1	7.2 ± 0.2
gpt-3.5-turbo-0125	54.36 ± 2.3	15.77 ± 1.4	22.96 ± 0.8	9.08 ± 1.07
Mixtral-8x22B-Instruct-v0.1	82.50 ± 2.1	62.31 ± 1.9	67.91 ± 1.3	9.06 ± 0.3
Qwen1.5-110B-Chat	78.15 ± 3.1	38.49 ± 5.5	48.84 ± 5.1	10.00 ± 0.2

We have used GPT-4 (OpenAI, 2023), in particular gpt-4-1106-preview,³³3https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4 as the LLM underlying CodeNav. As GPT-4 is one of the most performant publicly available LLMs, it is natural to ask how CodeNav performance degrades when using smaller, possibly open source, LLMs. We run m&m’s evaluations using CodeNav when replacing GPT-4 with GPT-3.5, the Mistral 8 $\times$ 22B mixture-of-experts model (Jiang et al., 2023, 2024), and the Qwen1.5 110B model (Bai et al., 2023).⁴⁴4Model completions obtained via the together.ai API, see https://docs.together.ai. We chose to use the Mistral and Qwen models as they represent some of the largest, open-source, LLMs with long context windows (empirically, LLMs with context sizes below 16k tokens regularly fail by exceeding their context limit). Our results are displayed in Table 5. While the GPT-4 powered CodeNav outperforms, the open source models do perform quite well falling behind primarily in their recall (suggesting search failures). Surprisingly the GPT-3.5 variant performs poorly; when inspecting the failed trajectories this appears to often be caused by the agent failing to appropriately summarize its actions at the end of the episode which may ameliorated by additional prompt tuning.

6 Discussion

We have argued that it is time to move from tool-use to code-use; from feeding LLM agents manually curated and meticuluously descibed tool sets, to instead presenting them with existing codebases written by humans for humans. As we have shown in our case-studies and quantitative results, simply remarkable behavior can be obtained by code-use agents when using modern LLMs so long as considerable care is taken to engineer code search and execution environments that provide the agent with significant flexibility and feedback.

References

phi [2024] phidata, 2024. URL https://github.com/phidatahq/phidata.
Bai et al. [2023] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y. Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y. Zhang, Z. Zhang, C. Zhou, J. Zhou, X. Zhou, and T. Zhu. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
Chen et al. [2021a] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021a. URL https://arxiv.longhoe.net/abs/2107.03374.
Chen et al. [2021b] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. W. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, I. Babuschkin, S. Balaji, S. Jain, A. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba. Evaluating large language models trained on code. ArXiv, abs/2107.03374, 2021b. URL https://api.semanticscholar.org/CorpusID:235755472.
Elastic [2024] Elastic. Elasticsearch, 2024. URL https://www.elastic.co/elasticsearch/. Version 8.12.1.
GitHub [2024] GitHub. GitHub Copilot - Your AI pair programmer, 2024. URL https://github.com/features/copilot. Accessed: 2024-05-16.
Gupta and Kembhavi [2023] T. Gupta and A. Kembhavi. Visual programming: Compositional visual reasoning without training. CVPR, pages 14953–14962, 2023. URL https://api.semanticscholar.org/CorpusID:253734854.
Hsieh et al. [2023] C.-Y. Hsieh, S. Chen, C.-L. Li, Y. Fujii, A. J. Ratner, C.-Y. Lee, R. Krishna, and T. Pfister. Tool documentation enables zero-shot tool-usage with large language models. ArXiv, abs/2308.00675, 2023. URL https://api.semanticscholar.org/CorpusID:260351459.
Huang et al. [2023] D. Huang, Q. Bu, J. M. Zhang, M. Luck, and H. Cui. Agentcoder: Multi-agent-based code generation with iterative testing and optimisation. ArXiv, abs/2312.13010, 2023. URL https://api.semanticscholar.org/CorpusID:266374622.
Jiang et al. [2023] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de Las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7b. CoRR, abs/2310.06825, 2023. doi: 10.48550/ARXIV.2310.06825. URL https://doi.org/10.48550/arXiv.2310.06825.
Jiang et al. [2024] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de Las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mixtral of experts. CoRR, abs/2401.04088, 2024. doi: 10.48550/ARXIV.2401.04088. URL https://doi.org/10.48550/arXiv.2401.04088.
Jimenez et al. [2024] C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66.
Khot et al. [2022] T. Khot, H. Trivedi, M. Finlayson, Y. Fu, K. Richardson, P. Clark, and A. Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. ArXiv, abs/2210.02406, 2022. URL https://api.semanticscholar.org/CorpusID:252715485.
Li et al. [2023a] C. Li, J. Liang, A. Zeng, X. Chen, K. Hausman, D. Sadigh, S. Levine, F.-F. Li, F. Xia, and B. Ichter. Chain of code: Reasoning with a language model-augmented code emulator. ArXiv, abs/2312.04474, 2023a. URL https://api.semanticscholar.org/CorpusID:266051661.
Li et al. [2023b] M. Li, Y. Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, and Y. Li. Api-bank: A comprehensive benchmark for tool-augmented llms. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 3102–3116. Association for Computational Linguistics, 2023b. doi: 10.18653/V1/2023.EMNLP-MAIN.187. URL https://doi.org/10.18653/v1/2023.emnlp-main.187.
Ma et al. [2024] Z. Ma, W. Huang, J. Zhang, T. Gupta, and R. Krishna. m&m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks. ArXiv, abs/2403.11085, 2024. URL https://api.semanticscholar.org/CorpusID:268512938.
Madaan et al. [2023] A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Welleck, B. P. Majumder, S. Gupta, A. Yazdanbakhsh, and P. Clark. Self-refine: Iterative refinement with self-feedback. ArXiv, abs/2303.17651, 2023. URL https://api.semanticscholar.org/CorpusID:257900871.
Ni et al. [2024] A. Ni, M. Allamanis, A. Cohan, Y. Deng, K. Shi, C. Sutton, and P. Yin. Next: Teaching large language models to reason about code execution. 2024. URL https://api.semanticscholar.org/CorpusID:269302914.
OpenAI [2023] OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
Osika [2023] A. Osika. Gpt-engineer, 2023. URL https://github.com/gpt-engineer-org/gpt-engineer.
Paranjape et al. [2023] B. Paranjape, S. M. Lundberg, S. Singh, H. Hajishirzi, L. Zettlemoyer, and M. T. Ribeiro. ART: automatic multi-step reasoning and tool-use for large language models. CoRR, abs/2303.09014, 2023. doi: 10.48550/ARXIV.2303.09014. URL https://doi.org/10.48550/arXiv.2303.09014.
Qin et al. [2024] Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y.-T. Lu, Y. Lin, X. Cong, X. Tang, B. Qian, S. Zhao, R. Tian, R. Xie, J. Zhou, M. H. Gerstein, D. Li, Z. Liu, and M. Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis. ICLR, abs/2307.16789, 2024. URL https://api.semanticscholar.org/CorpusID:260334759.
Sur’is et al. [2023] D. Sur’is, S. Menon, and C. Vondrick. Vipergpt: Visual inference via python execution for reasoning. ICCV, pages 11854–11864, 2023. URL https://api.semanticscholar.org/CorpusID:257505358.
Wang et al. [2023] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. J. Fan, and A. Anandkumar. Voyager: An open-ended embodied agent with large language models. ArXiv, abs/2305.16291, 2023. URL https://api.semanticscholar.org/CorpusID:258887849.
Wang et al. [2024] X. Wang, Y. Chen, L. Yuan, Y. Zhang, Y. Li, H. Peng, and H. Ji. Executable code actions elicit better LLM agents. CoRR, abs/2402.01030, 2024. doi: 10.48550/ARXIV.2402.01030. URL https://doi.org/10.48550/arXiv.2402.01030.
Wei et al. [2022] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.
Wolf et al. [2020] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, Oct. 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
Wu et al. [2023] Q. Wu, G. Bansal, J. Zhang, Y. Wu, S. Zhang, E. Zhu, B. Li, L. Jiang, X. Zhang, and C. Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. ArXiv, abs/2308.08155, 2023. URL https://api.semanticscholar.org/CorpusID:260925901.
Yang et al. [2023] Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu, M. Zeng, and L. Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. ArXiv, abs/2303.11381, 2023. URL https://api.semanticscholar.org/CorpusID:257637012.
Yao et al. [2023] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=WE_vluYUL-X.
Zhang et al. [2023] J. Zhang, R. Krishna, A. H. Awadallah, and C. Wang. Ecoassistant: Using llm assistant more affordably and accurately. ArXiv, abs/2310.03046, 2023. URL https://api.semanticscholar.org/CorpusID:263671677.
Zhou et al. [2023] S. Zhou, U. Alon, F. F. Xu, Z. Wang, Z. Jiang, and G. Neubig. Docprompting: Generating code by retrieving the docs. In ICLR, Kigali, Rwanda, May 2023. URL https://arxiv.longhoe.net/abs/2207.05987.

Appendices

This appendices contain the following:

•

Experiment details for m&m’s, M³ToolEval, and API-Bank (App. A)
•

Compute requirements (App. B)
•

Library and tool descriptions (App. C)
•

A full retrieval response that shows code, automatically generated summary docstrings, and prototypes (App. D)
•

Limitations of CodeNav (App. E)
•

Societal impact (App. F)

Please see our project website, https://codenav.allenai.org/, for:

•

Full episode trajectories for the 3 case studies.
•

An example run of CodeNav on m&m’s. This contains an HTML file that can be opened in a browser to view the programs generated by CodeNav along with ground truth programs and links to trajectories.
•

Similarly, an example runs of CodeNav on M³ToolEval with HTML visualizations.
•

A side-by-side comparison of library and tool descriptions for the m&m’s and M³ToolEval codebases.

Table 6: Codebase statistics.

Codebase	files	snippets	lines	characters	functions	classes
m&m’s	2	58	971	34277	39	0
M³ToolEval	8	39	485	14009	21	2
API-Bank	54	163	6144	207656	1	53
CodeNav	36	369	4055	141636	58	46
PhiData	426	2549	49731	2169989	133	398
\mintinlinepythontransformers	3475	50508	1362242	63105978	4354	11424

Appendix A Experiment details

For each of the 3 tool-use benchmarks used in CodeNav evaluation, we now provide details on how the benchmark was adapted for code-use as well as metrics used.

A.1 m&m’s

A.1.1 Adaptation for code-use

All tools in m&m’s are implemented in two files; a single python file with tool functions and a config file. Since functions in m&m’s neither contain type hints nor are annotated with inline comments or docstrings, to provide minimal context, we add the tool description provided in the original m&m’s benchmark as a single-line docstring to each function. For instance, for the function \mintinlinepythoncolor_pop, we add It takes an image and one or multiple objects, and returns an image where only the object is colored and the rest is black and white. Note that we provide no additional information about input arguments or outputs. Further, m&m’s evaluation expects a sequence of function calls as outputs, we enable a ‘code_summary‘ action in the agent to get the code solution in the desired format. To comply with the evaluation we provide the following guidelines:

When solving tasks YOU MUST RESPECT the following guidelines:
1. Do not implement any new functions. Just use the available functions.
2. Generally, try searching for function names. Only if needed, include
function argument names. Do not include argument values.
3. When you have a solution, use the code_summary action to summarize
the solution.
4. When asked to generate text, don’t generate text yourself but rather
see if there is a function to do it.
5. Tasks typically require 1 to 3 function calls.

A.1.2 Metrics

We use the macro-averaged tool-f1 metric from the original m&m’s paper (Ma et al., 2024) which is the harmonic mean of the precision and recall of the function names in reference to the ground truth program. When multiple correct ground truths are available we used the best match. Since many queries in m&m’s do not have a single correct answer, we do not use answer correctness as a metric. Similarly, we find that there are many ways of specifying the arguments while generating free-form code for m&m’s queries and hence find argument name and value based metrics unreliable for evaluating free-form code generation agents like CodeNav.

A.1.3 Data split

m&m’s consists of approximately 800 samples. Since evaluating CodeNav on the full set using GPT-4 as the LLM could cost between $150 to $200 (and we run each experiment thrice to compute error bars), we randomly sample a smaller set of 200 samples for our evaluation.

A.2 M³ToolEval

A.2.1 Adaptation for code-use

To prepare M³ToolEval for code-use, we begin by creating a codebase consisting of just the tool implementations and associated data (web page data used by their web browsing tools as well as flight, hotel, and location information used by the travel planning tools). Particularly for web browsing, while M³ToolEval registers methods of the WebBrowser class as individual tools, for code-use we let the agent use the class directly. Further, since the web browsing task uses unrealistic pages with strong assumptions about how the pages are formatted, we provide some context to the agent for the web browsing task as guidelines. Finally, tools in M³ToolEval are grouped by task (e.g. all tools for DNA sequencing are in the same file), and tasks only use tools from a single file. Therefore, to let the agent make use of this assumption, we provided file paths in library/tool description and we ask the agent to identify and specify the relevant file name in its search queries to zone in on required tools. Here are the guidelines we use:

When solving tasks YOU MUST RESPECT the following guidelines:
1. For browsing tasks use the WebBrowser object and navigate the web pages
using available methods to find what you are looking for. Sometimes the
relevant information may not be visible on the page but if you see
[Viewing page m of n] (where m < n) then you may use the scroll functions
to see more page content. If you see the information you need displayed on
the web page, feel free to use it directly without worrying about parsing
it using code. Do not write complex code. Rather, try to interact with the
browser one action at a time like clicking or scrolling.
2. You may need to identify the relevant python file and specify this target
file in your search queries to get the relevant search results

A.2.2 Metrics

We use the final answer accuracy as used in the original M³ToolEval work (Wang et al., 2024).

A.3 API-Bank

A.3.1 Adaptation for code-use

The API-Bank benchmark with a human-AI “chat” context in mind where an agent and user send messages back and forth with the user asking the agent to, possibly, perform many tasks one after another. The AI agent is then evaluated via a next-step prediction approach where the agent is fed the entire chat context up to time $t$ and required to predict some ground-truth chat message or JSON API call at time $t+1$ . As CodeNav was not designed to be used for back-and-forth chat with a user (neither is it meant to be evaluated on producing natural language responses) we filter all ground-truth interactions in the API-Bank level-1-given-desc set to include only those chats for which the last user message is followed by at least one message from the agent where the agent invokes an API call. After filtering, we are left with 186 (of originally 214) samples. During evaluation, we then give our CodeNav the full chat context up to and including the last user message (and also modify the sandbox to be in the state up to this point) and then evaluate CodeNav’s ability to produce all remaining API calls.

In order for CodeNav to make API calls, we require that it directly instantiate the appropriate API-Bank class and then invoke the call method on that class (e.g., the model might instantiate the AddAgenda class as aa = AddAgenda() and then run aa.call(token, content, time, location) with token, content, time, and location variables it has previously defined. In order to encourage this behavior, we include instructions of the form:

When solving tasks YOU MUST RESPECT the following guidelines:
1. When calling APIs you should instantiate the relevant class and use the
‘call‘ method defined in the class. DO NOT USE INVOKE OTHER METHODS ON
THE CLASS, YOU MUST ONLY CALL THE ‘call‘ METHOD.
2. Everything can be solved by calling APIs, do not define new APIs or
modify the existing ones.

in the library description given to the CodeNav agent.

Note that the above differs substantially from how the agents are traditionally evaluated with API-Bank where they, generally, produce JSON formatted API calls which are routed by API-Bank to the correct class and call method.

A.3.2 Metrics

As noted above, we evaluate CodeNav’s ability to produce the correct remaining API calls given some chat context. As for the m&m’s benchmark evaluation, we only evaluate CodeNav’s ability to call the correct APIs and ignore, for ease of evaluation, whether these APIs were called with the correct arguments or produced the correct results. Supposing that CodeNav called a sequence of APIs $A=\{a_{1},...,a_{n}\}$ and that the ground-truth set of APIs’ called was $G=\{g_{1},...,g_{m}\}$ , we count the number of matches between $A$ and $G$ (counting multiplicities) and compute recall as $R=(\#matches)/|G|$ and precision as $P=(\#matches)/|A|$ (precision is taken to be 0 if $|A|=0$ ). Given this precision and recall, we compute the F1 score as usual as $F1=2\cdot P\cdot R/(R+P)$ with $F1$ being set to 0 if $P+R=0$ as usual.

Appendix B Compute requirements

We run our CodeNav evaluations on Ubuntu servers each with 8 NVIDIA RTX A6000 GPUs. As mentioned previously, we do not train any models and make use of the OpenAI and together.ai APIs to perform inference using LLMs. This means we do not require (local) GPUs for LLM inference but there are many instances when CodeNav may benefit from having access to a GPU (e.g., for image inpainting). While inference time varies per benchmark, running m&m’s evaluations (200 queries) with 24 parallel processes on a single 8 GPU server takes approximately ${\sim}$ 12 minutes in wall clock time (96 minutes of GPU time).

Appendix C Library and Tool Descriptions

Here we provide the library and tool descriptions used in our case studies as well as quantitative evaluation -

•

Figures 5, 6, 7 shows the library descriptions used by CodeNav in the three case studies.
•

Figures 6, 8, 9 show the library descriptions used by CodeNav for quantiative evaluation on m&m’s, M³ToolEval, and API-Bank respectively.
•

Figures 10, 11, 12 show the tool descriptions used for the tool-use baselines for m&m’s, M³ToolEval, and API-Bank respectively. Note that the tool descriptions are significantly more detailed than the corresponding library descriptions.

Appendix D Retrieval Response Example

We show a full example of a retrieval response in Fig. 2. This corresponds to an expanded version of R5 in Fig. 2 where we only showed one response for brevity. Notice: (1) the collection of class and function prototypes/signatures shown at the bottom of the retrieval response and (2) that the 3 expanded retrieved results contain a mix of full code (for the EsCodeRetriever and NumpyJSONEncoder classes) as well as automatically generated summary docstrings (for the RetrievalResult class).

Appendix E Limitations

While CodeNav is capable of some producing impressive results, we highlight three key limitations. (1) Our current implementation of CodeNav assumes that the agent produces Python code (and that the given codebase is a Python codebase). While extending to use other languages is not a significant engineering challenge, it is possible that LLMs’ performance degrades as one moves away from popular languages like Python, for which there is significant data on the web. (2) Our agent has no long-term memory that is available across queries. This means that, given the same query twice, CodeNav will make the same errors and repeat the same searches. (3) Finally, CodeNav performance is strongly dependent on the underlying LLM and performance can sharply degrade when using smaller LLMs. This means that most researchers can only use CodeNav through the use of paid APIs and larger-scale experimentation when using these APIs can be costly.

Appendix F Societal impact

While CodeNav is not unique in this respect, the growing popularity of increasingly competent LLM agents has the potential to automate or augment many skilled tasks. While automation has brought about many societal positives on aggregate, it can clearly has a profoundly negative impact on anyone whose may lose their job. The environmental impact (via the huge energy costs of running these LLMs) may also be significant.

As a more acute potential negative impact: in the code-use paradigm, the underlying agent the CodeNav agent has the ability to run arbitrary code on the user’s machine. Given this, it can be dangerous unintentionally (there is nothing stop** CodeNav from making a logic error, e.g. filtering a file list incorrectly and then accidentally deleting all files on the computer) and intentionally (e.g., a malicious party might upload tainted model weights, or intercept API calls, so as to run a “randomsome ware” scam when executed by a CodeNav agent). These dangers are somewhat easier to mitigate in the tool-use paradigm as, when constrained to use only certain tools, there are fundamentally fewer attack vectors to consider.

Figure 5: Library description for CodeNav case study in Sec. 4.1

Figure 6: Library description for m&m’s as well as the multimodal case study in Sec. 4.2

Figure 7: Library description for PhiData case study in Sec. 4.3

Figure 8: Library description for M³ToolEval.

Figure 9: Library description for API-Bank.

Figure 10: Tool description for m&m’s.

Figure 11: Tool description for M³ToolEval.

Figure 12: Tool description for API-Bank. We only show the beginning and end of the full description for brevity.

CodeNav: Beyond tool-use to using real-world codebases with LLM agents

Abstract

1 Introduction

2 Related Work

3 CodeNav

3.1 Overview

3.2 Environments

3.3 Agent Actions

3.4 Environment Responses

4 Case Studies

4.1 CodeNav on CodeNav

4.2 Multimodal Processing and Reasoning

4.3 Research assistant

5 Experiments

5.1 How does code-use compare to tool-use on tool-use benchmarks?

5.2 Is a library description sufficient for tool-use?

5.3 Does seeing the source code help code-use?

5.4 What makes a good library description?

5.5 Impact of LLM choice on performance

6 Discussion

References

Appendices

Appendix A Experiment details

A.1 m&m’s

A.1.1 Adaptation for code-use

A.1.2 Metrics

A.1.3 Data split

A.2 M3ToolEval

A.2.1 Adaptation for code-use

A.2.2 Metrics

A.3 API-Bank

A.3.1 Adaptation for code-use

A.3.2 Metrics

Appendix B Compute requirements

Appendix C Library and Tool Descriptions

Appendix D Retrieval Response Example

Appendix E Limitations

Appendix F Societal impact

A.2 M³ToolEval