LLM-Assist: Enhancing Closed-Loop Planning with Language-Based Reasoning

S P Sharan

{}^{1}

Francesco Pittaluga

{}^{2}

Vijay Kumar B G

{}^{2}

Manmohan Chandraker

{}^{2,3}

{}^{1}

UT Austin

{}^{2}

NEC Labs America

{}^{3}

UC San Diego

Abstract

Although planning is a crucial component of the autonomous driving stack, researchers have yet to develop robust planning algorithms that are capable of safely handling the diverse range of possible driving scenarios. Learning-based planners suffer from overfitting and poor long-tail performance [37]. On the other hand, rule-based planners generalize well, but might fail to handle scenarios that require complex driving maneuvers [10]. To address these limitations, we investigate the possibility of leveraging the common-sense reasoning capabilities of Large Language Models (LLMs) such as GPT4 [19] and Llama2 [28] to generate plans for self-driving vehicles. In particular, we develop a novel hybrid planner that leverages a conventional rule-based planner in conjunction with an LLM-based planner. Guided by commonsense reasoning abilities of LLMs, our approach navigates complex scenarios which existing planners struggle with, produces well-reasoned outputs while also remaining grounded through working alongside the rule-based approach. Through extensive evaluation on the nuPlan benchmark, we achieve state-of-the-art performance, outperforming all existing pure learning- and rule-based methods across most metrics. Our code will be available at https://llmassist.github.io.

1 Introduction

In recent years, aided by advances in deep learning, novel sensing technologies, and low-cost graphics processing units (GPUs), self-driving vehicles have taken major leaps forward. We’ve even witnessed the deployment of fully self-driving taxi services in limited areas of certain cities. That said, develo** planning algorithms for self-driving vehicles capable of handling all the complexities of driving in fully unconstrained environments still remains a significant challenge.

While deep learning has had major impacts on the perception and prediction components of the self-driving stack, it has yet to have a major impact on closed-loop planning. This is evidence by the fact that a rule-based planning algorithm [10] just won the nuPlan benchmark competition at CVPR 2023 [5] for both the close-loop non-reactive and reactive settings. A possible challenge for learning-based planners might be that their training in an open-loop setting fails to generalize to a closed loop setting, while training in a closed-loop setting fails to converge to reasonable solutions. On the other hand, while rule-based planners can succeed in most settings, they are not scalable, as it’s not possible to enumerate all possible driving scenarios.

The question we seek to answer in this paper is: Can we leverage the common-sense reasoning of LLMs to overcome the limitations of existing learning- and rule-based planners? We answer this in the affirmative, through the key insight that judicious use of an LLM can supplement an existing base planner to perform well in conditions where it might otherwise suffer. Our base planner, PDM-Closed, achieved the previous SOTA on the nuPlan benchmark, through an intelligent driver model-based approach that controls centerline offsets and target speeds. First, we define conditions based on scored proposals of the base planner where the LLM is automatically invoked. Next, we consider an unconstrained LLM planner that must directly return a safe future trajectory, which works surprisingly well but falls short of the base planner in closed loop evaluation. Finally, we allow the LLM to define the planner parameters to safely navigate a scenario, which we find can significantly overcome deficiencies in the base planner. Distinct from prior works that directly use LLMs for planning, we believe this is the first demonstration of an LLM controlling an existing planner to supplement it.

Refer to caption — Figure 1: Architecture of LLM-Assist. We propose a novel hybrid planning approach that leverages a SoTA rule-based planner, PDM-Closed, for common scenarios and a novel LLM-based planner, for challenging high uncertainty scenarios. When the PDM-generated scores are deemed insufficient—falling below predetermined thresholds for various metrics such as collision risk and passenger comfort—we invoke the LLM-Planner.

In extensive experiments on the nuPlan benchmark, our LLM-assisted planner achieves SOTA performance on the nuPlan benchmark in both the challenging closed-loop reactive and non-reactive settings. We demonstrate several qualitative examples where the LLM-assisted planner can perform non-trivial maneuvers to navigate complex scenarios where the base planner does not succeed or achieves suboptimal safety, efficiency or comfort outcomes. Importantly, the use of an LLM also allows obtaining reasoned outputs for the planner behavior, where our approach of controlling physically meaningful planning variables leads to a well-grounding reasoning.

Our contributions can be summarized as follows:

•

A strategy to invoke an LLM based on the scores of a base planner, which allows us to judiciously exploit the strengths of both the rule-based planner and the LLM.
•

An unconstrained hybrid rule- and LLM-based planner that processes text-based scene descriptions as input to directly generate a navigational plan for the ego vehicle as 2D coordinates.
•

A hybrid planner that processes text-based scene descriptions to provide parameters for a base planner, PDM-Closed, to plan a safe trajectory for the ego car.

2 Related Work

2.1 Planning for Autonomous Driving

Rule-based planning involves the use of explicit rules to guide the decision-making process in autonomous vehicles [26, 2, 16, 30, 7, 11]. It provides a structured and easily understandable framework for determining how the vehicle should behave in different situations. One well-known example of rule-based planning is the Intelligent Driver Model (IDM) [29], which is designed to help vehicles follow a leading vehicle while maintaining a safe following distance. PDM-Closed [10], an extension of IDM, which executes multiple policies with different settings and evaluates them to choose the best course of action, recently won the nuPlan benchmark challenge [5].

In addition to pure rule-based planning, previous research has explored hybrid approaches that combine rule-based decision-making with machine learning components [10, 15, 13, 8, 22, 25, 12, 32, 23, 7, 9, 36, 6]. These hybrid planners often involve forecasting future environmental conditions, which allows for informed and adaptable driving decisions. This forecasting can take different forms, such as agent-centric, where trajectories are predicted for each vehicle in the environment, or environment-centric, which involves occupancy or cost maps. Furthermore, the forecasting can be influenced by the ego vehicle’s plan, taking into account how the ego vehicle’s actions affect the future of the scene.

2.2 Large-Language Models

Large Language Models (LLMs) like GPT [4], its successors GPT-3 and GPT4 [19], and its open-source counterparts Llama [27] and Llama2 [28] are a type of artificial intelligence that are designed to understand, generate, and manipulate human language. Built on advanced machine learning algorithms, particularly deep neural networks, these models undergo extensive training on extensive text datasets. This training enables them to grasp the intricacies of language, encompassing grammar, syntax, and semantics. InstructGPT [20], a specialized version of OpenAI’s GPT models, is fine-tuned for superior adherence to user instructions, offering more precise and context-relevant responses across various applications like content creation and information retrieval. “Chain-of-Thought” reasoning [33] introduces a novel model for large language models, enhancing their complex reasoning capabilities through a series of intermediate reasoning steps, significantly improving performance on tasks involving arithmetic, commonsense, and symbolic reasoning. “ReAct” [34] introduces a new paradigm that enhances the capabilities of large language models in complex tasks requiring both reasoning and decision-making, by prompting these models to generate verbal reasoning traces and actions in an interleaved manner, enabling dynamic reasoning and interaction with external environments.

2.3 Planning with Large-Language Models

Given their common-sense reasoning capabilities, there have been some efforts to leverage LLMs for planning tasks in robotics. A recent approach [1], allows robots to understand and execute high-level textual instructions for physically grounded tasks, merging pre-trained skills with language model insights to ensure feasible and contextually relevant actions. Another work [14], proposed the use of environment feedback to form an inner monologue, enhancing planning and interaction in robotics by integrating perception models and pre-trained skills for improved completion of complex, long-horizon tasks. Similarly, it was shown in [24] that enhancing LLMs with physical grounding, allowing them to generate and update plans that are contextually relevant to the current environment.

Figure 2: System Prompt for

\textsc{LLM-Assist}_{\textsc{unc}}

Figure 3: System Prompt for

\textsc{LLM-Assist}_{\textsc{par}}

3 Method

We propose a novel hybrid planning approach that leverages a SoTA rule-based planner, PDM-Closed, for common scenarios and a novel LLM-based planner, for challenging high uncertainty scenarios. The PDM algorithm is integral to our method, generating 15 trajectory proposals at each planning stage, each characterized by varying velocities and center-lane offsets. These proposals are subsequently assessed using an internal simulator, which applies metrics analogous to those used in the nuPlan Challenge. Our methodology builds upon this framework. When the PDM-generated scores are deemed insufficient—falling below predetermined thresholds for various metrics such as collision risk and passenger comfort—we invoke the LLM-Planner. The LLM-Planner processes the current scenario’s observations, including vehicular positioning, traffic light statuses, and lane information, along with the PDM-generated trajectories and their corresponding scores. We propose two LLM-based planners. The first, $\textsc{LLM-Assist}_{\textsc{unc}}$ considers the most unconstrained version of the planning problem, in which the LLM must directly return a safe future trajectory for the ego car. The second, $\textsc{LLM-Assist}_{\textsc{par}}$ considers a parameterized version of the planning problem, in which the LLM must only return a set of parameters for a rule-based planner, PDM-Closed [10].

3.1 Base Planner

The strong performances of IDM [29] and PDM-Closed [10] on the closed-loop nuPlan evaluations demonstrate that rule-based planners are capable of successfully maneuvering the vast majority of driving scenarios. As such, we propose using a rule-based planner as a base planner and only invoking the LLM-based planner for challenging scenarios that the rule-based planner cannot solve. The challenge, however, is how to identify which scenarios the rule-based planner cannot solve. For this, we leverage a simple constant-velocity real-time simulator to score the proposals from our rule-based planner and invoke the LLM only when the score falls below a predetermined threshold.

# Proposals	Score	Collisions	TTC	Drivable	Comfort	Progress	Speed Limit	Direction
15	92.51	98.05	93.11	99.55	95.19	91.75	99.83	99.95
8505	77.78	91.92	62.89	98.64	78.68	95.60	99.78	99.36

Table 1: Ablation of Number of PDMClosed Proposals. PDMClosed evaluated nuPlan Closed-Loop Non-Reactive Challenge on Val14 split. PDMClosed fails to select the best proposal when presented with too many options, as it relies on a constant velocity simulator.

Challenge	Method	Score	Collisions	TTC	Drivable	Comfort	Progress	Speed Limit	Direction
Closed-Loop Non-Reactive	PDMClosed	92.51	98.05	93.11	99.55	95.19	91.75	99.83	99.95
	$\textsc{GPT-3-Assist}_{\textsc{unc}}$	90.11	96.19	92.55	98.91	93.37	91.05	99.83	99.91
	$\textsc{GPT-3-Assist}_{\textsc{par}}$	93.05	98.31	93.69	99.54	95.61	92.16	99.83	99.95
Closed-Loop Reactive	PDMClosed	91.79	97.91	93.29	99.37	94.65	89.92	99.83	99.95
	$\textsc{GPT-3-Assist}_{\textsc{unc}}$	90.32	96.82	93.10	98.73	92.92	89.01	99.83	99.86
	$\textsc{GPT-3-Assist}_{\textsc{par}}$	92.20	98.18	93.62	99.64	94.72	90.07	99.83	99.95

Table 2: LLM-Assist evaluated on nuPlan Closed-Loop Challenges on Val14 split.

\textsc{GPT-3-Assist}_{\textsc{par}}

achieves SoTA performance on almost all metrics on both closed-loop challenges, reducing the number of dangerous driving scenarios by 11%.

For both the base planner and the real-time simulator, we build on PDM-Closed [10]. PDM-Closed [10] is a rule-based planner that generates 15 trajectory proposals for the ego vehicle at each time step. For each proposal, a constant velocity simulation that considers all agents within a 50-meter radius is carried out and the top scoring proposal, according to the nuPlan Challenge’s metrics [5], is selected. If the top scoring proposal is set to collide within 2 seconds, an emergency brake function is triggered. The 8 metrics considered in the nuPlan Challenge are no_ego_at_fault_collisions, drivable_area_compliance, ego_is_making_progress, driving_direction_compliance, time_to_collision_within_bound, speed_limit_compliance, ego_progress_along_expert_route, ego_is_comfortable.

The 15 trajectories are generated by leveraging the intelligent driver model (IDM) and considering all possible combinations of 2 hyperparameters: centerline offset $o=\{-1,0,1\}$ meters and target speed $v_{0}=\{20\%,40\%,60\%,80\%,100\%\}$ of the designated speed limit. Specifically, given a centerline offset $o$ and a current velocity $v$ , the longitudinal acceleration for a proposal is generated via the intelligent driver model (IDM)

\frac{dv}{dt}=a\Bigg{(}1-\bigg{(}\frac{v}{v_{0}}\bigg{)}^{\delta}-\bigg{(}% \frac{s^{*}}{s}\bigg{)}^{2}\Bigg{)},

(1)

where $a$ denotes acceleration limit, $s^{*}$ safety margin, and $\delta$ exponent (or jerk).

3.2 $\textsc{LLM-Assist}_{\textsc{unc}}$

$\textsc{LLM-Assist}_{\textsc{unc}}$ considers the most unconstrained version of the planning problem, in which the LLM must directly return a safe future trajectory for the ego car. For this task, we provide the system prompt shown in Figure 2. The system prompt consists of a task overview, a description of the state of the scene and task requirements composed of generating a trajectory and a rationale. We access the assets provided by the nuPlan API to input the lane geometry and states of all agents, including the ego-vehicle.

3.3 $\textsc{LLM-Assist}_{\textsc{par}}$

$\textsc{LLM-Assist}_{\textsc{par}}$ considers a parameterized version of the planning problem. Rather than directly returning a future trajectory for the ego car, the LLM instead returns a set of parameters to be used by the base planner, PDM-Closed, to plan a safe trajectory for the ego car. For this task, we provide the system prompt shown in Figure 3 at each time step. As in for $\textsc{LLM-Assist}_{\textsc{unc}}$ , we at each time step, provide scene information including the history of vehicle, pedestrian, and object positions, headings, and speeds, and their current lane ID. The task of the LLM is to return valid values for the following parameters:

1.

lateral_offsets: Ego offset relative to lane center.
2.

speed_limit_fraction: Speed-limit fraction in free traffic.
3.

fallback_target_velocity: Fallback speed in free traffic.
4.

min_gap_to_lead_agent: Min distance to lead car.
5.

headway_time: Min time to the lead car.
6.

accel_max: Max acceleration.
7.

decel_max: Max deceleration.

Additionally, the LLM should provide a one sentence explanation for why the specific trajectory was chosen. As illustrated in Fig. 1, LLM-Assist queries the LLM planner multiple times at a given time step until a trajectory is proposed that has a predicted score that meets a predefined threshold or the number of queries per time step exceeds a predefined threshold. If the query threshold is exceeded, the trajectory with the highest predicted score is selected.

Challenge	Method	Score	Collisions	TTC	Drivable	Comfort	Progress	Speed Limit	Direction
Closed-Loop Non-Reactive	GPT-3	18.08	63.04	60.14	78.26	57.97	31.30	99.93	98.91
Closed-Loop Non-Reactive	$\textsc{GPT-3-Assist}_{\textsc{par}}$	94.80	100.00	94.89	100.00	97.81	90.18	99.86	99.64
Closed-Loop Reactive	GPT-3	22.33	75.18	73.72	81.02	56.93	31.77	99.96	100.00
Closed-Loop Reactive	$\textsc{GPT-3-Assist}_{\textsc{par}}$	92.82	98.55	97.10	99.28	94.20	89.15	99.86	99.28

Table 3: GPT-3 vs LLM-Assist. Comparison between GPT-3 planner and LLM-Assist on nuPlan Closed-Loop Challenges on a subset of Val14 split consisting of 140 samples. Without fine-tuning, GPT-3 on its own is incapable of directly generating successful plans. This shows the importance of LLM-Assist’s hybrid architecture.

Brake	Score	Collisions	TTC	Drivable	Comfort	Progress	Speed Limit	Direction
$\times$	91.85	97.81	93.99	99.36	92.99	90.40	99.83	99.91
✓	92.16	97.96	94.10	99.46	92.82	90.34	99.83	99.91

Table 4: Ablation Study of Emergency Break.

\textsc{GPT-3-Assist}_{\textsc{par}}

evaluated on nuPlan Closed-Loop Reactive Challenge on Val14 split. Enabling the LLM to invoke an emergency brake leads to improved performance, as it can avoid potential collisions.

Temp	Score	Collisions	TTC	Drivable	Comfort	Progress	Speed Limit	Direction
0.2	91.91	98.00	94.28	99.27	92.29	90.31	99.83	99.95
0.6	92.01	98.00	93.90	99.54	93.35	90.16	99.83	99.82
1.0	92.16	97.96	94.10	99.46	92.82	90.34	99.83	99.91
1.2	92.21	98.05	93.82	99.64	93.45	90.20	99.83	99.91
1.4	92.24	98.02	94.01	99.35	93.82	90.71	99.81	99.91
1.6	92.14	98.04	93.72	99.36	94.90	90.11	99.83	99.95
2.0	92.05	98.00	93.63	99.27	94.54	90.19	99.83	99.95

Table 5: Ablation Study of GPT-3 Temperature.

\textsc{GPT-3-Assist}_{\textsc{par}}

evaluated on nuPlan Closed-Loop Reactive Challenge on Val14 split. LLM-Assist achieves the best performance with a GPT-3 temperature of 1.4, showing that greater flexibility may allow for better planning.

Architecture	Score	Collisions	TTC	Drivable	Comfort	Progress	Speed Limit	Direction
$\textsc{GPT-3-Assist}_{\textsc{unc}}$	90.32	96.82	93.10	98.73	92.92	89.01	99.83	99.86
$\textsc{GPT-4-Assist}_{\textsc{unc}}$	90.46	96.68	93.19	98.73	94.55	89.32	99.83	99.86
$\textsc{GPT-3-Assist}_{\textsc{par}}$	92.16	97.96	94.10	99.46	92.82	90.34	99.83	99.91
$\textsc{GPT-4-Assist}_{\textsc{par}}$	91.11	97.91	93.74	99.36	95.55	88.94	99.83	99.77

Table 6: Ablation Study of LLM Architecture. LLM-Assist evaluated on nuPlan Closed-Loop Reactive Challenge on Val14 split. GPT-3 and GPT-4 achieve comparable performance when used as the LLM for LLM-Assist.

4 Experimental Setup

nuPlan Benchmark.

The nuPlan benchmark [5] is the world’s first large-scale planning benchmark for autonomous driving. In addition to releasing 1200 hours of annotated human driving data from 3 cities across the US and Asia, the benchmark outlines three planning challenges: open-loop, closed-loop non-reactive, and closed-loop reactive. In this paper, we focus on the two closed-loop challenges, as they more accurately reflect real-world driving [10]. We follow [10] and consider the val14 subset of the nuPlan benchmark [5]. It consists of $100$ scenarios of $14$ scenario types, totaling 1,114 scenarios.

Metrics.

Each column of Table 2, except for the first, shows a different binary metric for some aspect of driving. The scores represent the percentage of scenarios which a vehicle successfully navigated without violating the given constraint. Please see the supplementary material for the full details of each metric.

Baseline Method.

The score achieved by the base planner, i.e., PDM-Closed is shown in Table 2. As we can observe from Table 2, PDM-Closed is capable of successfully navigating the vast majority of scenarios in the nuPlan dataset. Our goal in this paper is investigating how to integrate an LLM-Planner into the mix to handle the safety critical scenarios that PDM-Closed fails at.

5 Results

5.1 Hyperparameter Search

The base planner, i.e., PDM-Closed [10] creates 15 trajectory proposals per time step for the ego vehicle, comprising combinations of 5 speed limit fractions and 3 lateral offsets. These trajectories are then evaluated using a constant velocity model. The choice of 15 proposals was to reduce the computational overhead for subsequent stages. However, to test whether a broader range of hyperparameters could enhance the planner’s performance, we conducted an experiment using varied hyperparameters: 6 lateral offsets, 3 fallback target velocities, 5 speed limit fractions, 3 minimum gaps to the lead agent, 3 headway times, 3 maximum accelerations, and 3 maximum decelerations which resulted in 8505 trajectory proposals at each time step. The results in Table 1 indicate that this extensive search for optimal hyperparameters diminished the effectiveness of the PDM planner. We hypothesize that this decline in performance could be linked to the reliance on constant velocity assumptions in the trajectory evaluation process. This suggests that a larger hyperparameter search space does not lead to performance improvement for the base planner. In fact, it harms performance while also worsening planning speed.

5.2 Prediction

We evaluate the efficacy of the base planner’s internal simulator at predicting whether a given trajectory will score highly according to the nuPlan benchmark, which we use as a proxy for successfully navigating a scene. We validate the accuracy of these predicted scores on the val14 subset [10] of nuPlan closed-loop reactive benchmark [5] by comparing the true average score achieved by PDM-Closed for each scenario to the respective predicted scores. Note, however, since we are interested in identifying the exact time step where the LLM should be invoked, we compare the minimum predicted score over all 150 time steps in a scenario to the ground truth score for the whole scenario. In Figure 4, we show ROC curves for predicting whether the ground-truth score is less than a given threshold. The results show that PDM-Closed’s internal simulation can relatively reliably predict when it will score poorly on a given scenario. This motivates the design of our combined rule-based and LLM-based planner. If the LLM-based planner can succeed when PDM-Closed predicts that it will score poorly, the result will be a superior planner capable of handling a much wider range of driving scenarios.

5.3 LLM-Assist Evaluation

As shown in Table 2, $\textsc{GPT-3-Assist}_{\textsc{par}}$ achieves SoTA performance on almost all metrics across both nuPlan Closed-Loop Challenges [5] on the Val14 Split [10]. Regarding safety, $\textsc{GPT-3-Assist}_{\textsc{par}}$ reduces the number of dangerous driving events by 11% relative to PDMClosed, the current SoTA. We define a dangerous driving event as any scenario in which the ego vehicle is involved in an actual or near collision, drives off the road, exceeds a safe limit for acceleration or jerk, or drives in the wrong direction. Comparing $\textsc{GPT-3-Assist}_{\textsc{unc}}$ to $\textsc{GPT-3-Assist}_{\textsc{par}}$ highlight the importance of our proposed approach – having the LLM select parameters for a base planner as opposed to directly generating a trajectory for the ego vehicle. For $\textsc{GPT-3-Assist}_{\textsc{par}}$ , the LLM was invoked a max of four times per time step.

In Figure 5, we show qualitative results where $\textsc{GPT-3-Assist}_{\textsc{par}}$ performs complex maneuvers to successfully handle safety-critical scenarios which PDMClosed fails at. In Figure 6, we show interesting reasoning outputs from $\textsc{GPT-3-Assist}_{\textsc{par}}$ . Note, how the reasoning results are grounded by the base planner’s parameters. Additional reasoning results are shown in Figures 7, 8, 9 and 9 of the supplementary material.

5.4 Ablations

For all ablation studies, the LLM was invoked a maximum of one time per time step.

Importance of Base Planner.

We evaluate a purely LLM-based planner (GPT-3) by providing the system prompt in Figure 2 at each time step in addition to scene information such as the history of vehicle, pedestrian, and object positions, headings, and speeds, and their current lane ID. The LLM needs to provide an 8-second trajectory for the ego car by generating 4 waypoints at 2-second intervals. It’s important to note the distinction between this GPT-3 planer and the $\textsc{LLM-Assist}_{\textsc{unc}}$ approach outlined in Section 3.2 – unlike $\textsc{LLM-Assist}_{\textsc{unc}}$ , which is invoked at selected time steps, this planner is invoked at every time step. From Table 3, we see that the LLM-based planner (GPT-3) achieves a reasonable performance $75\%,73\%,81\%$ for Collisions, TTC and Drivable metrics respectively. Notably, the GPT-3 achieved this score without explicit training on trajectory data, highlighting its ability to generalize and adapt to driving scenarios. That said, given the low scores relative to $\textsc{GPT-3-Assist}_{\textsc{par}}$ , it’s clear why our proposed approach of having the LLM supplement a base planner is a better way to leverage the reasoning capabilities of LLMs for planning.

Control over Emergency Break.

We explored the impact of granting control over the “emergency brake” function of PDM-Closed to the LLM. This function, typically activated in anticipation of collisions by PDM’s internal simulator, was managed by the LLM to test its efficacy in critical scenarios. As evidenced in our results in Table 4, this integration significantly enhanced performance metrics. Most notably, it improved collision rates, time to collision, and drivability metrics, demonstrating the LLM’s adeptness at judiciously deciding when to activate such crucial safety functionalities.

Temperature of LLMs.

We vary the temperature of GPT-3 for $\textsc{LLM-Assist}_{\textsc{unc}}$ and run the nuPlan Closed-Loop Reactive Challenge, and report the results in Table 5. GPT-3 performs best at a temperature of 1.4, indicating that allowing the LLM greater freedom may lead to better planning.

LLM Architecture and Timing Analysis.

We perform a comparative analysis of the performance of $\textsc{LLM-Assist}_{\textsc{unc}}$ and $\textsc{LLM-Assist}_{\textsc{par}}$ , utilizing both GPT-3 and GPT-4, with detailed results presented in Table 6. The findings reveal that while both GPT-3 and GPT-4 exhibit comparable effectiveness overall, subtle differences were observed: GPT-4 demonstrates a marginal edge in unconstrained settings, whereas GPT-3 shows slightly better performance in constrained ones. Additionally, we investigate the possibility of using an open-source LLM, Llama2-7B [28], as the LLM for LLM-Assist. The result of our investigation was that, without fine-tuning, Llama2-7B is not a suitable LLM for LLM-Assist, as we were unable to have it reliably produce outputs of any particular format, even with a few in-context examples. Nonetheless, we can still use Llama 2 to get a rough estimate of the run-time for our main results with GPT-3. For a request for a single set of parameters for the planner, Llama 2 takes 3 seconds.

6 Discussions

In the realm of autonomous driving, the potential of language models, particularly in robotic tasks, has been increasingly recognized. The introduction of Large Language Models (LLMs) for complex reasoning tasks, as highlighted in works like those by Brohan et al. [3] and Ahn et al. [1], and their application in agentic control as demonstrated by Wang et al. [31], marks a significant advancement in this field. This trend is expected to amplify as foundation models grow larger, enhancing the emergent abilities of LLMs for more effective reasoning and grounding in real-world scenarios.

However, there are notable limitations to our current approach. We rely on a text-only model that processes a parsed state, which, while abstract, cannot fully replace perception-based systems in terms of information richness and context. Additionally, the speed at which decisions must be made in autonomous driving poses a challenge. Current LLMs operate slower than required for time-sensitive decision-making. While our model does not directly address these speed constraints, we anticipate future advancements in LLMs to bring improvements in both capability and processing speed. Another critical concern is the tendency of LLMs to produce hallucinated outputs [17, 18]. Current directions of research such as certified reasoning [21] and retrieval feedback [35] seek to mitigate this, but significant work remains, especially in high-risk domains like autonomous driving where accuracy is paramount and human lives are at stake.

Looking forward, our research underscores the promising role of LLMs in enhancing the performance of rule-based planners in scenarios where they fall short. We posit that future research should focus on rectifying the existing flaws of LLMs. This includes improving their grounding, incorporating multiple modalities for richer contextual understanding, and enhancing their scalability and speed. Such advancements will not only bolster the efficacy of LLM-assisted planning systems but also expand the applicability of LLMs in various facets of autonomous navigation and beyond.

7 Conclusion

In this paper, we introduce LLM-Assist, a novel approach in closed-loop planning for autonomous driving that synergizes the advanced capabilities of Large Language Models (LLMs) with traditional rule-based methods. Harnessing the emergent commonsense reasoning nad cognitive abilities of recent LLMs, particularly GPT 3/4, LLM-Assist excels in navigating intricate driving scenarios where conventional methods are often insufficient. Our comprehensive evaluations on the nuPlan benchmark confirm LLM-Assist’s state-of-the-art performance in both reactive and non-reactive settings across numerous metrics of driving. Our work not only demonstrates the efficacy of integrating lanuage models into autonomous driving solutions but also underscores their significance and ability in refining and enhancing complex decision-making processes, setting a new benchmark for future developments in this field.

References

Ahn et al. [2022] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
Bacha et al. [2008] Andrew Bacha, Cheryl Bauman, Ruel Faruque, Michael Fleming, Chris Terwelp, Charles Reinholtz, Dennis Hong, Al Wicks, Thomas Alberi, David Anderson, et al. Odin: Team victortango’s entry in the darpa urban challenge. Journal of field Robotics, 25(8):467–492, 2008.
Brohan et al. [2023] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Caesar et al. [2021] Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles. arXiv preprint arXiv:2106.11810, 2021.
Chekroun et al. [2023] Raphael Chekroun, Thomas Gilles, Marin Toromanoff, Sascha Hornauer, and Fabien Moutarde. Mbappe: Mcts-built-around prediction for planning explicitly. arXiv preprint arXiv:2309.08452, 2023.
Chen et al. [2015] Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. Deepdriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE international conference on computer vision, pages 2722–2730, 2015.
Chen et al. [2023] Yuxiao Chen, Peter Karkus, Boris Ivanovic, Xinshuo Weng, and Marco Pavone. Tree-structured policy planning with learned behavior models. arXiv preprint arXiv:2301.11902, 2023.
Cui et al. [2021] Alexander Cui, Sergio Casas, Abbas Sadat, Renjie Liao, and Raquel Urtasun. Lookout: Diverse multi-future prediction and planning for self-driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16107–16116, 2021.
Dauner et al. [2023] Daniel Dauner, Marcel Hallgarten, Andreas Geiger, and Kashyap Chitta. Parting with misconceptions about learning-based vehicle motion planning. In Conference on Robot Learning (CoRL), 2023.
Fan et al. [2018] Haoyang Fan, Fan Zhu, Changchun Liu, Liangliang Zhang, Li Zhuang, Dong Li, Weicheng Zhu, Jiangtao Hu, Hongye Li, and Qi Kong. Baidu apollo em motion planner. arXiv preprint arXiv:1807.08048, 2018.
Hu et al. [2022] Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In European Conference on Computer Vision, pages 533–549. Springer, 2022.
Hu et al. [2023] Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023.
Huang et al. [2022] Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
Huang et al. [2023] Zhiyu Huang, Haochen Liu, and Chen Lv. Gameformer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving. arXiv preprint arXiv:2303.05760, 2023.
Leonard et al. [2008] John Leonard, Jonathan How, Seth Teller, Mitch Berger, Stefan Campbell, Gaston Fiore, Luke Fletcher, Emilio Frazzoli, Albert Huang, Sertac Karaman, et al. A perception-driven autonomous urban vehicle. Journal of Field Robotics, 25(10):727–774, 2008.
McKenna et al. [2023] Nick McKenna, Tianyi Li, Liang Cheng, Mohammad Javad Hosseini, Mark Johnson, and Mark Steedman. Sources of hallucination by large language models on inference tasks. arXiv preprint arXiv:2305.14552, 2023.
Mündler et al. [2023] Niels Mündler, **gxuan He, Slobodan Jenko, and Martin Vechev. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852, 2023.
OpenAI [2023] R OpenAI. Gpt-4 technical report. arXiv, pages 2303–08774, 2023.
Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
Poesia et al. [2023] Gabriel Poesia, Kanishk Gandhi, Eric Zelikman, and Noah D Goodman. Certified reasoning with language models. arXiv preprint arXiv:2306.04031, 2023.
Rhinehart et al. [2021] Nicholas Rhinehart, Jeff He, Charles Packer, Matthew A Wright, Rowan McAllister, Joseph E Gonzalez, and Sergey Levine. Contingencies from observations: Tractable contingency planning with learned behavior models. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13663–13669. IEEE, 2021.
Sadat et al. [2020] Abbas Sadat, Sergio Casas, Mengye Ren, Xinyu Wu, Pranaab Dhawan, and Raquel Urtasun. Perceive, predict, and plan: Safe motion planning through interpretable semantic representations. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16, pages 414–430. Springer, 2020.
Song et al. [2023] Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2998–3009, 2023.
Song et al. [2020] Haoran Song, Wenchao Ding, Yuxuan Chen, Shaojie Shen, Michael Yu Wang, and Qifeng Chen. Pip: Planning-informed trajectory prediction for autonomous driving. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pages 598–614. Springer, 2020.
Thrun et al. [2006] Sebastian Thrun, Mike Montemerlo, Hendrik Dahlkamp, David Stavens, Andrei Aron, James Diebel, Philip Fong, John Gale, Morgan Halpenny, Gabriel Hoffmann, et al. Stanley: The robot that won the darpa grand challenge. Journal of field Robotics, 23(9):661–692, 2006.
Touvron et al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
Touvron et al. [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
Treiber et al. [2000] Martin Treiber, Ansgar Hennecke, and Dirk Helbing. Congested traffic states in empirical observations and microscopic simulations. Physical review E, 62(2):1805, 2000.
Urmson et al. [2008] Chris Urmson, Joshua Anhalt, Drew Bagnell, Christopher Baker, Robert Bittner, MN Clark, John Dolan, Dave Duggins, Tugrul Galatali, Chris Geyer, et al. Autonomous driving in urban environments: Boss and the urban challenge. Journal of field Robotics, 25(8):425–466, 2008.
Wang et al. [2023] Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.
Wei et al. [2021] Bob Wei, Mengye Ren, Wenyuan Zeng, Ming Liang, Bin Yang, and Raquel Urtasun. Perceive, attend, and drive: Learning spatial attention for safe self-driving. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4875–4881. IEEE, 2021.
Wei et al. [2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
Yao et al. [2022] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
Yu et al. [2023] Wenhao Yu, Zhihan Zhang, Zhenwen Liang, Meng Jiang, and Ashish Sabharwal. Improving language models via plug-and-play retrieval feedback. arXiv preprint arXiv:2305.14002, 2023.
Zeng et al. [2019] Wenyuan Zeng, Wenjie Luo, Simon Suo, Abbas Sadat, Bin Yang, Sergio Casas, and Raquel Urtasun. End-to-end interpretable neural motion planner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8660–8669, 2019.
Zhai et al. [2023] Jiang-Tian Zhai, Ze Feng, Jihao Du, Yongqiang Mao, Jiang-Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing Ye, and **gdong Wang. Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes. arXiv preprint arXiv:2305.10430, 2023.

\thetitle

Supplementary Material

Challenge	Queries	Score	Collisions	TTC	Drivable	Comfort	Progress	Speed Limit	Direction
Closed-Loop Non-Reactive	0	92.51	98.05	93.11	99.55	95.19	91.75	99.83	99.95
	1	92.52	98.32	92.92	99.55	95.74	91.16	99.83	99.95
	2	92.82	98.41	93.47	99.27	95.10	92.12	99.83	99.95
	4	93.05	98.31	93.69	99.54	95.61	92.16	99.83	99.95
Closed-Loop Reactive	0	91.79	97.91	93.29	99.37	94.65	89.92	99.83	99.95
	1	92.16	97.96	94.10	99.46	92.82	90.34	99.83	99.91
	2	92.18	98.00	93.81	99.45	95.36	90.10	99.83	99.95
	4	92.20	98.18	93.62	99.64	94.72	90.07	99.83	99.95

Table 7: Ablation Study of Number of LLM Queries per Iteration.

\textsc{GPT-3-Assist}_{\textsc{par}}

evaluated on nuPlan Closed-Loop Challenges on Val14 split. Note, the rows with 0 queries denote PDMClosed [10], the current SoTA.

8 Ablation Study of Number of LLM Queries

As illustrated in Fig. 1, LLM-Assist queries the LLM planner multiple times at a given time step until a trajectory is proposed that has a predicted score that meets a predefined threshold or the number of queries per time step exceeds a predefined threshold. If the query threshold is exceeded, the trajectory with the highest predicted score is selected. In Tab. 7, we vary the number of allowed queries per time step and report the results. Note, since LLM-Assist uses PDMClosed [10], the current SoTA, as the base planner, the rows with 0 queries denote PDMClosed. The results show a clear trend – as the number of LLM queries increases, the performance of LLM-Assist improves. We also show some qualitative results of multiple queries in practice in Figure 9 and Figure 10.

9 Qualitative Results

In this section, we delve deeper into an array of qualitative outcomes, akin to those illustrated in Figure 5, to further demonstrate the efficacy of our method. This includes a selection of results from the multiple rounds of queries, a novel aspect introduced in this supplementary section Section 8. Specifically, through Figure 7, Figure 9, Figure 8, and Figure 10, we showcase a diverse array of scenarios by selecting examples from each permutation of 1 versus 4 queries and Reactive versus Non-Reactive settings. This approach provides a comprehensive view of the versatility and robustness of our method across different driving conditions.

10 Metric Definitions

For all metrics, we use the official nuPlan challenge [5] definitions. We paraphrase these definitions below.

Score.

For each scenario, a combined score for the driven trajectory is calculated using a hybrid hierarchical-weighted average of individual metric scores. The planner receives a zero score in scenarios where (a) an at-fault collision with a vehicle, pedestrian, or bicyclist occurs, (b) multiple at-fault collisions with objects (like cones) happen, (c) there’s a drivable area violation, (d) the ego vehicle enters oncoming traffic by more than 6 meters, or (e) insufficient progress is made by the ego. If there’s a single at-fault collision with an object, or the ego drives into oncoming traffic for more than 2 meters but less than 6 meters, the weighted average score of other metrics is halved. In all other cases, the score is simply the weighted average of other metrics.

Collisions.

A collision occurs when the bounding box of the ego vehicle intersects with the bounding box of another agent. Regardless of the duration of the collision, it is counted as one event, and the initial frame is used to determine the kinetic energy transferred during the collision. Following the collision, any tracks involved are excluded from metric assessments in subsequent frames.

Time-to-Collision.

TTC, or Time-to-Collision, estimates the time until the ego vehicle potentially collides with another track, based on their current trajectories. It’s calculated for tracks ahead, in cross traffic, or at the sides, particularly when the ego is changing lanes or in an intersection. TTC is determined by projecting the bounding boxes of the ego and other tracks forward at 0.1-second intervals, up to 3 seconds. The TTC is the earliest intersection time of these projections; if there’s no intersection, the TTC is deemed infinite.

Drivable.

The drivable area compliance metric tracks instances where the ego veers outside this area. A small deviation outside the drivable area is permissible due to the overestimation of the ego’s bounding box, with a maximum violation threshold of 0.3 meters. If any frame shows the ego’s bounding box corners exceeding this threshold distance from the nearest drivable area, the compliance score is reduced to 0; otherwise, it remains at 1.

Comfort.

The comfort of the ego vehicle’s trajectory is assessed by comparing key variables—minimum and maximum longitudinal accelerations, maximum lateral acceleration, yaw rate, yaw acceleration, longitudinal jerk, and overall jerk magnitude—to predefined thresholds. These thresholds are empirically set based on expert trajectory data (e.g., max longitudinal acceleration at 2.40 m/s ${}^{2}$ , max lateral acceleration at 4.89 m/s ${}^{2}$ ).

Progress.

To measure the ego vehicle’s progress in a scenario, it is compared to an expert driver’s route progress. This involves calculating the ego’s progress per frame along the same lanes and lane connectors used by the expert, summing this progress over the scenario. The ego-to-expert progress ratio is then derived by comparing the ego’s total progress to that of the expert. If the ego’s progress falls below a certain negative threshold (-0.1m) due to data noise, the ratio is set to 0. In cases where no expert route is defined, the ratio defaults to 1. Otherwise, the ratio is the minimum of 1 and the adjusted ego-to-expert progress comparison.

Speed Limit.

This metric checks if the ego vehicle’s speed surpasses the speed limit, which is determined from the lane or, for a lane connector, from the higher speed limit of connecting lanes. A speed limit violation is noted whenever the ego’s speed exceeds this limit.

Direction.

This metric is designed to penalize the ego vehicle if it enters oncoming traffic lanes. It calculates the movement of the ego’s center over a 1-second period in relation to the designated driving direction, based on the baselines of the lanes or lane-connectors associated with the ego. The score is assigned a value of 1 if the ego does not travel against the traffic flow by more than 2 meters. If the ego moves against the traffic flow exceeding 6 meters, the score is set to 0. For movements between these two thresholds, the score is adjusted to 0.5.