ActionReasoningBench: Reasoning about Actions with and without Ramification Constraints

Divij Handa ^∗
Arizona State University
Tempe, AZ
[email protected] &Pavel Dolin ^∗
Arizona State University
Tempe, AZ
[email protected] &Shrinidhi Kumbhar ^∗
Arizona State University
Tempe, AZ
[email protected] &Chitta Baral
Arizona State University
Tempe, AZ
[email protected] &Tran Cao Son
New Mexico State University
Las Cruces, NM
[email protected]

Abstract

Reasoning about actions and change (RAC) has historically driven the development of many early AI challenges, such as the frame problem, and many AI disciplines, including non-monotonic and commonsense reasoning. The role of RAC remains important even now, particularly for tasks involving dynamic environments, interactive scenarios, and commonsense reasoning. Despite the progress of Large Language Models (LLMs) in various AI domains, their performance on RAC is underexplored. To address this gap, we introduce a new benchmark, ActionReasoningBench, encompassing 13 domains and rigorously evaluating LLMs across eight different areas of RAC. These include - Object Tracking, Fluent Tracking, State Tracking, Action Executability, Effects of Actions, Numerical RAC, Hallucination Detection, and Composite Questions. Furthermore, we also investigate the indirect effect of actions due to ramification constraints for every domain. Finally, we evaluate our benchmark using open-sourced and commercial state-of-the-art LLMs, including GPT-4o, Gemini-1.0-Pro, Llama2-7b-chat, Llama2-13b-chat, Llama3-8b-instruct, Gemma-2b-instruct, and Gemma-7b-instruct. Our findings indicate that these models face significant challenges across all categories included in our benchmark.

^*^*footnotetext: These authors contributed equally to this work

1 Introduction

Reasoning about actions and change (RAC) is a cornerstone in artificial intelligence, tracing back to foundational work in the early 1960s (McCarthy et al., 1963). In the early days, the primary focus was on develo** logical systems to reason about actions and changes in the world. One of the significant challenges was to succinctly express the effects of an action on changeable properties of the world, known as fluents. For instance, consider the statement: “Moving an object from location X to location Y will result in the object being at location Y”. While direct effects on the affected fluents could be expressed, formalizing the impact on unaffected fluents posed a significant challenge, referred to as the “frame problem”. For example, “moving a block from the table to the chair, does not affect other objects on the table”. This challenge became even more complex when the descriptions involved relationships between fluents, leading to indirect effects or ramifications of actions. An example of this is the constraint “a block can’t be at two places at the same time”, which implies that moving a block from position X to position Y will result in the block no longer being at X.

It took multiple decades of research to create a comprehensive logical formalization that adequately addressed these issues. This resulted in the labor-intensive development of numerous handcrafted rules and logics that detail the effects and preconditions of actions (Reiter, 2001). Considering the importance of RAC in several reasoning tasks, it is no surprise that in recent years, the natural language processing (NLP) community has shown an interest in this area (He et al., 2023) (Spiliopoulou et al., 2022) (Banerjee et al., 2020). Moreover, LLM-based agents that perform complex decision tasks also involve actions (Kohli and Sun, 2024) (Zhou et al., 2024) (Zhao et al., 2024).

Translating rules into natural language reduces the manual effort of writing every rule in logic programming. Despite the benefits, the inherent complexities of language and the requirement to follow long reasoning chains pose substantial challenges for LLMs (Wang et al., 2024) (Mu et al., 2023). LLMs have been tested extensively on several reasoning domains, including planning (Valmeekam et al., 2024a) (Valmeekam et al., 2024b), commonsense reasoning (Sakaguchi et al., 2021) (Talmor et al., 2019) (Zhang et al., 2018), mathematical reasoning (Imani et al., 2023) (Ahn et al., 2024), temporal reasoning (Tan et al., 2023) (Aghzal et al., 2023), and logical reasoning (Parmar et al., 2024a) (Luo et al., 2024). However, despite their crucial role, LLMs on RAC are heavily under-explored. This study seeks to fill this gap by introducing a new benchmark, ActionReasoningBench, that pinpoints where LLMs struggle in RAC.

In our work, we utilize 13 domains from International Conference on Automated Planning and Scheduling (1998) (IPC) in creating ActionReasoningBench with the action-sequences varying from length 1 to 19. ActionReasoningBench is categorized into 8 distinct categories - Objects Tracking, Fluent Tracking, State Tracking, Action Executability, Effects of Actions, Numerical RAC, Hallucination Detection, and Composite Questions (Category details in Section 3.1). Each category evaluates a specific aspect of RAC. Additionally, we formulate ramification constraints for each domain, which introduces indirect effects of actions, thereby increasing the complexity (McIlraith, 2000) and enhancing the evaluation of LLMs reasoning capabilities. Lastly, we create an obfuscated variant of ActionReasoningBench that replaces the English names of objects and properties with a string of randomly generated characters. This obfuscated variant assesses whether the LLMs understand the rules described in the context or simply rely on their pre-training parametric knowledge.

Highlights of our benchmark, ActionReasoningBench, along with the comparison to previous benchmarks on RAC, are presented in table 1. We evaluate five open-source LLMs, encompassing three different LLM families, as well as two leading proprietary LLMs, GPT-4o (Achiam et al., 2023) and Gemini-1.0-Pro (Team et al., 2023). We conduct a comprehensive analysis of different categories in RAC along with the impact of increasing model size and various few-shot settings to observe the variations in the performance. This multifaceted approach not only highlights the capacities and limitations of LLMs in handling complex RAC tasks but also sheds light on their potential utility (Sharma, 2019) (Sap et al., 2019) and applicability in practical LLM-based agents (Yao et al., 2022) (Aksitov et al., 2023). Our evaluation reveals that LLMs struggle to comprehend State Tracking, Effects of Actions, Numerical RAC, and Composite Questions, with their performance decreasing even further as the length of action sequence increases. Furthermore, LLMs exhibit difficulty in reasoning when queried about negative fluents. We also investigate the effect of ramification on these categories, analyzing the varied responses to ramification within each category. The detailed results of our findings are discussed in Section 5.

2 Related Works

PlanBench TRAC (Ours) # of domains 2 1 13 Obfuscation of domains ✓ × ✓ # of queries 26k 15k 123k Binary Questions (T/F) × ✓ ✓ Free Answer ✓ × ✓ Max Action Sequence Length 48 3 19 Max # of objects 24 5 28 Plan Verification ✓ ✓ ✓ Executability of Actions ✓ ✓ ✓ Effects Checking (on state) ✓ ✓ ✓ Ramifications (all 8 question types) × × ✓ Composite Questions × × ✓ Effects Checking (on Fluents) × × ✓ Effect Checking (on Objects) × × ✓ Numerical Reasoning × × ✓ Hallucination Identification × × ✓ Subcategories of Fluents × × ✓

Table 1: Differences between ActionReasoningBench (Ours) and previous benchmarks on reasoning about actions and change. PlanBench (Valmeekam et al., 2024a) ; TRAC (He et al., 2023)

Reasoning about Actions and Change

Our work builds on an existing body of RAC literature. Works by He et al. (2023) and Banerjee et al. (2020) focus on creating RAC datasets from IPC and evaluating LLMs. Spiliopoulou et al. (2022) explores RAC within the scope of real-world physical attributes, intersecting with commonsense reasoning. RAC has also been investigated using multi-modal systems (Sampat et al., 2022a), (Sampat et al., 2022b).

Planning

Our study is inspired by several benchmarks developed using IPC to evaluate LLM in planning (Valmeekam et al., 2024a), (Valmeekam et al., 2024b).

Data Creation from Logic Programs

To create a robust dataset, we use deterministic logical solvers, following approaches similar to those used by Stein and Koller (2023) and Valmeekam et al. (2024a). For additional related literature on reasoning adjacent to RAC in Appendix A

3 About ActionReasoningBench

3.1 Question Categories

The questions are classified into five different categories as follows. Refer to Appendix I for examples of questions and prompts for each question category.

1.

Object Tracking - This category contains questions on the status of a particular object that is present in the domain. For example, in the Blocksworld domain, an objects-tracking question might be Is block b3 on the table and not clear?
2.

Effects of Actions - This category contains questions that inquire about the effect of taking a given action. For example, in the Mystery domain, an effects-of-action question can be From the current state, the vehicle v0 moves from location l1 to l0, and has fuel-level f6 and f5, which properties of the state will be true now?
3.

Fluent Tracking - This category contains questions about the fluents, i.e. properties of the domain, that are true or false for a given object. For instance, in the NPuzzle domain, a fluent-tracking question might be List all the neighbors of tile t_3.
4.

State Tracking - This category encompasses and extends the Fluent Tracking category. It involves querying about all the fluents present in the domain that are true or false at a given moment. For instance, in the Goldminer domain, a state-tracking question can be What are all the valid properties in this state?
5.

Action Executability - This category contains questions regarding the executability of a particular action in the given state. For instance, in the Miconic domain, an action-executability question can be List all executable actions present in the current state.
6.

Numerical RAC - Questions requiring a numerical response fall under this category. The questions can be from any of the 5 categories listed above. For example, in the Spanner domain, a Numerical-RAC question can be What are the number of executable actions in the current state?
7.

Hallucination Detection - This category includes identifying questions about objects, actions, or fluents (i.e. properties of the state) that are not mentioned in the domain description and are entirely fabricated. For example, in the Depots domain, a hallucination-detection question can be Given the following properties of the state, which one is not defined. Write None if all are defined. Crate2 is not on pallet4, crate1 is in truck2, crate3 is next to truck3.
8.

Composite Question - This category contains questions combining the above-mentioned categories. These questions require multiple reasoning steps as they combine up to 3 different categories. For example, in the Satellite domain, a composite question can be What are the derived properties of the state for satellite0 before the first infeasible action in the sequence? Write None if there are none

3.2 Questions Subcategories: Fluents and Static Properties

We further divide the question categories, Object Tracking, Effects of Actions, and Fluent Tracking into 4 subcategories depending on the fluent type as these categories fundamentally pertain to questions regarding fluents.

1.

Base Fluents - Fluents that are not dependent on any other fluent and can change due to an action are known as base fluents. For instance, bomb_at is a base fluent in the Goldminer domain, which defines whether a bomb is at a specific location or not.
2.

Base Fluents with Constraints - These fluents are a type of ramification constraint that depends on themselves rather than other fluents. For example, the fluent at present in the Depot domain depends only on itself, i.e., if a truck is at location l0, it can’t be at location l1. We include these constraints in the domain description such as "A truck can be only at one location at a time."
3.

Derived Fluents - Fluents that depend on other fluents, reflecting a level of dependency and interaction, are known as derived fluents. They also constitute a part of ramification constraint. For example, the fluent free in the Grippers domain is a derived fluent as it directly depends on the fluent carry. The relationship is a consequence of the following constraint: A robot’s gripper is said to be free if the robot is not carrying any object with its gripper.
4.

Static Properties - properties that do not change throughout, irrespective of any action taken, are known as static fluents. For example, the fluent connected in the Visitall domain represents whether two locations are connected or not. The actions can move the robot from one location to another, but the locations will always be connected regardless of any action.

For examples of questions highlighting the above-mentioned fluents, refer to Appendix J. Finally, we create questions with negative fluents for every fluent type, which allows us to evaluate the understanding of LLMs for negation present in RAC.

3.3 Dataset Structure and Variations

test train Action Executability 359 4841 Effects 1065 10085 Fluent Tracking 1730 21671 Hallucination 370 21086 Numerical Reasoning 360 19140 Object Tracking 1718 9876 State Tracking 254 2996 Composite Questions 360 15199 Base Fluents 1048 6847 Base Fluents + Const. 1162 12843 Derived Fluents 935 8760 Static Properties 1140 10160 Total Unique Questions 6216 104894 Action-Sequence Lengths 1,10,19 (test) 1,5,10,15,19 (train) Answer Types True/False and Free Ramifications With and Without Obfuscations With and Without Selected Domains 13 Domain Types Transport and Non Transp.

Table 2: Statistics of training and testing splits of ActionReasoningBench

Selected Domains

ActionReasoningBench ¹¹1 Code and data available at https://github.com/izuminka/reasoning_about_actions comprises 13 domains, handpicked from IPC, which offers benchmarks for state-of-the-art planning systems intending to facilitate research in planning. We focused on collecting deterministic domains, where every action has deterministic preconditions and effects. Since IPC contains a lot of domains involving transportation, we divided the 13 domains into 7 transportation and 6 non-transportation domains. More details about the domains are present in Appendix F.17.

Domain Descriptions with and without Ramification Constraints

Ramification constraint refers to the indirect effects of an action. Effective reasoning of these constraints is essential for a robust AI system capable of predicting and reasoning about the outcomes of actions. For instance, in the domain Goldminer, the robot’s arm is said to be empty if it is not holding a laser, stone, or gold. Examples of domains with and without ramification constraints are in appendix H.1 and H.2 respectively.

Action-Sequence Lengths

We generate questions with action-sequences of length 1, 5, 10, 15, and 19 to verify the action-following capabilities of LLMs.

Obfuscations

Since the IPC data is publicly available and talked about on the internet, LLMs might have an overlap of pre-training data with these domains. To assess the reasoning capabilities of the models, we obfuscate the objects, fluents, and actions present in the domain with a randomly generated string. This forces the LLM to rely on the user’s description rather than its parametric knowledge. Examples of obfuscated prompts are presented in Appendix K.4.

Answer Types

For all the categories mentioned in the previous section, we formulate two different types of questions based on the nature of the answer. First, a simple binary question, where the answer is either True or False. Second, we ask a subjective question, where the answer can be multiple objects, actions, or properties of the state. Examples of answer types are presented in Appendix I.

3.4 Data Creation

Refer to caption — Figure 1: Question generation pipeline for ActionReasoningBench. Stage 1: Computation and verification of PDDL instances and plans. Stage 2: Conversion of PDDL instances and plans to ASP. Stage 3: Computation of the action-state space via ASP and Python. Stage 4: Generation of natural language questions based on state-action space.

The question creation pipeline comprises four stages, as illustrated in Figure 1. The selected domain in the IPC is represented in the form of PDDL (refer to Appendix F.19 for examples). This PDDL is used to generate 10 initial and goal conditions. Subsequently, we utilize the PDDL solver and validator to obtain and validate the action-sequences necessary to reach the goal state. Next, we convert the PDDL domains and instances into ASP descriptions via templates. In the third stage, we employ ASP solvers to generate the action-state space and extract fluents for each state and all executable and inexecutable actions. Finally, the state-action data is converted into questions using a Python-based template. We introduce up to five natural language variants for every object, action, and fluent to enhance the linguistic diversity of the dataset. Additionally, we manually translate the domain descriptions from PDDL to natural language and incorporate ramification constraints to these domains.

3.5 Data Validation

The question-generation process utilizes traditional deterministic planners to accurately create the state space, ensuring action-sequences’ correctness and their effects. This state space is then transformed by a natural language converter, followed by manual validation by three independent annotators on a small subset of the data from every domain to ensure the data quality. Each question is scored on a scale from 1 to 5, assessing the soundness and comprehensibility of the natural language, resulting in an average score of 4.215. Additional details on data validation can be found in Appendix B. Furthermore, the domain description is reviewed by two domain experts to verify its accurate conversion from PDDL.

3.6 Data Splits

We divided the benchmark into two splits, one for training and the other for testing the models. Table 2 presents the breakdown of categories in the training and testing splits. Since multiple subcategories defined in section 3.2 are present in categories effects, fluent tracking, and object tracking, these categories dominate the training split of the dataset. For the test split, we ensured a balance in terms of action-sequence length, question categories, fluent subcategories, answer type (with equal proportions of true, false, and free-answer), and across all domains.

4 Experiments and Evaluation

We evaluate our benchmark, ActionReasoningBench, using 7 different LLMs with 4 different prompt settings. These evaluations encompass all combinations of varying action-sequence lengths, with and without ramification constraints and with and without obfuscation. Our evaluation uses 2 proprietary state-of-the-art LLMs, GPT-4o (Achiam et al., 2023) and Gemini-1.0-Pro (Team et al., 2023), alongside 5 open-source LLMs, Gemma-2b-instruct, Gemma-7b-instruct (Team et al., 2024), Llama2-7b-chat, Llama2-13b-chat (Touvron et al., 2023), and Llama3-8b-instruct (AI, 2024). Each LLM is tested with zero-shot, few-shot-1, few-shot-3, and few-shot-5 prompting techniques. The results are presented for binary (true/false) questions and free-form answers. We further fine-tuned Llama3-8b-instruct using the training data split described in Section 3.6. Given the limited context length of the open-source LLMs, we excluded examples exceeding the context length of 4096 tokens. Fine-tuning was performed separately for free-answer and binary questions. Refer to appendix D to see the fine-tuning details.

The evaluation of binary and free-form answers is conducted independently. We extracted "true" and "false" keywords from the response and compared them against the ground truth for binary answers. Since free-form answer evaluation can’t rely on exact string matching, human evaluation was employed for GPT-4o and Gemini-1.0-Pro. We further fine-tune RoBERTa (Liu et al., 2019) to evaluate all models. Further details regarding the evaluation process are provided in appendix C.2 and E.1. We additionally report the standard error of the mean (SEM) ²²2SEM= $\frac{\sigma}{\sqrt{n}}$ , where $\sigma$ is the standard deviation, and $n$ is the number of samples for all conducted experiments. All the experiments are done on 8 A100 GPUs.

5 Results and Discussion

In this section we highlight all the results and analysis performed using our benchmark, ActionReasoningBench, using the experimental setup defined in Section 4. Table 3 summarizes the performance of all LLMs on the binary questions. ³³3AVG denotes the average calculated across all categories for a given action-sequence length and model Table 4 summarizes the results of free-answer human-evaluation. Zero-shot, Free Answer evaluation on the Roberta Classifier, and results for obfuscations-ramifications combinations are presented in Appendix C. We present additional observations in Appendix C.3

Act. Question GPT-4o Gemini Pro LLaMa2 13b LLaMa3 8b LLaMa2 7b Gemma 7b LLaMa3 8b Seq. Categories few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 fine-tuned 1 Object Trk. ${79.62}_{1.76}$ ${69.33}_{2.01}$ ${55.15}_{2.17}$ ${43.62}_{2.16}$ ${47.14}_{2.18}$ ${49.14}_{2.18}$ ${61.33}_{2.13}$ Fluent Trk. ${83.39}_{1.6}$ ${68.75}_{1.99}$ ${45.67}_{2.14}$ ${40.81}_{2.11}$ ${42.91}_{2.12}$ ${55.51}_{2.13}$ ${58.64}_{2.11}$ State Trk. ${70.91}_{6.12}$ ${65.45}_{6.41}$ ${52.73}_{6.73}$ ${50.91}_{6.74}$ ${47.27}_{6.73}$ ${61.82}_{6.55}$ ${49.09}_{6.74}$ Action Exec. ${79.75}_{4.52}$ ${70.89}_{5.11}$ ${55.0}_{5.56}$ ${55.0}_{5.56}$ ${61.25}_{5.45}$ ${50.0}_{5.59}$ ${62.5}_{5.41}$ Effects ${59.75}_{2.75}$ ${59.69}_{2.74}$ ${49.06}_{2.79}$ ${41.56}_{2.76}$ ${54.37}_{2.78}$ ${46.52}_{2.81}$ ${45.62}_{2.78}$ Num. Reas. ${55.13}_{5.63}$ ${52.5}_{5.58}$ ${48.75}_{5.59}$ ${43.75}_{5.55}$ ${48.75}_{5.59}$ ${45.0}_{5.56}$ ${43.75}_{5.55}$ Hallucination ${93.67}_{2.74}$ ${82.5}_{4.25}$ ${56.25}_{5.55}$ ${70.0}_{5.12}$ ${58.75}_{5.5}$ ${58.23}_{5.55}$ ${56.25}_{5.55}$ Composite ${76.71}_{4.95}$ ${68.49}_{5.44}$ ${52.54}_{6.5}$ ${56.94}_{5.84}$ ${54.24}_{6.49}$ ${54.17}_{5.87}$ — AVG ${76.33}_{1.02}$ ${67.14}_{1.12}$ ${50.66}_{1.2}$ ${44.87}_{1.19}$ ${48.65}_{1.2}$ ${51.51}_{1.19}$ ${56.06}_{1.21}$ 10 Object Trk. ${77.26}_{1.8}$ ${66.79}_{2.02}$ ${55.62}_{2.13}$ ${47.89}_{2.14}$ ${47.88}_{2.14}$ ${50.09}_{2.14}$ ${64.04}_{2.06}$ Fluent Trk. ${81.78}_{1.66}$ ${62.36}_{2.08}$ ${46.49}_{2.14}$ ${38.19}_{2.09}$ ${39.48}_{2.1}$ ${50.75}_{2.16}$ ${58.49}_{2.12}$ State Trk. ${57.41}_{6.73}$ ${62.96}_{6.57}$ ${48.15}_{6.8}$ ${61.11}_{6.63}$ ${48.15}_{6.8}$ ${62.96}_{6.57}$ ${50.0}_{6.8}$ Action Exec. ${55.13}_{5.63}$ ${53.75}_{5.57}$ ${52.5}_{5.58}$ ${48.75}_{5.59}$ ${55.0}_{5.56}$ ${60.0}_{5.48}$ ${35.0}_{5.33}$ Effects ${56.96}_{2.79}$ ${61.56}_{2.72}$ ${48.12}_{2.79}$ ${43.75}_{2.77}$ ${50.94}_{2.79}$ ${45.62}_{2.78}$ ${48.12}_{2.79}$ Num. Reas. ${57.69}_{5.59}$ ${47.5}_{5.58}$ ${47.5}_{5.58}$ ${48.75}_{5.59}$ ${56.25}_{5.55}$ ${44.87}_{5.63}$ ${50.0}_{5.59}$ Hallucination ${83.33}_{4.22}$ ${75.0}_{4.84}$ ${50.0}_{5.59}$ ${62.5}_{5.41}$ ${57.5}_{5.53}$ ${57.5}_{5.53}$ ${50.0}_{5.59}$ Composite ${70.13}_{5.22}$ ${57.14}_{5.64}$ ${47.46}_{6.5}$ ${36.0}_{5.54}$ ${61.02}_{6.35}$ ${54.05}_{5.79}$ — AVG ${72.5}_{1.06}$ ${62.88}_{1.15}$ ${50.17}_{1.19}$ ${44.82}_{1.18}$ ${47.44}_{1.19}$ ${50.59}_{1.19}$ ${56.14}_{1.2}$ 19 Object Trk. ${74.29}_{1.91}$ ${67.42}_{2.04}$ ${55.49}_{2.16}$ ${50.57}_{2.18}$ ${48.48}_{2.17}$ ${47.16}_{2.17}$ ${66.29}_{2.06}$ Fluent Trk. ${73.63}_{1.95}$ ${58.75}_{2.17}$ ${45.33}_{2.2}$ ${43.39}_{2.19}$ ${42.02}_{2.18}$ ${50.88}_{2.21}$ ${60.12}_{2.16}$ State Trk. ${59.09}_{7.41}$ ${59.09}_{7.41}$ ${47.73}_{7.53}$ ${50.0}_{7.54}$ ${54.55}_{7.51}$ ${59.09}_{7.41}$ ${52.27}_{7.53}$ Action Exec. ${55.13}_{5.63}$ ${52.5}_{5.58}$ ${50.0}_{5.59}$ ${53.75}_{5.57}$ ${46.25}_{5.57}$ ${51.28}_{5.66}$ ${42.5}_{5.53}$ Effects ${53.97}_{2.81}$ ${54.89}_{2.79}$ ${51.1}_{2.81}$ ${40.06}_{2.75}$ ${53.31}_{2.8}$ ${46.3}_{2.83}$ ${51.42}_{2.81}$ Num. Reas. ${51.28}_{5.66}$ ${48.75}_{5.59}$ ${50.0}_{5.59}$ ${52.5}_{5.58}$ ${51.25}_{5.59}$ ${38.75}_{5.45}$ ${53.75}_{5.57}$ Hallucination ${88.61}_{3.57}$ ${78.75}_{4.57}$ ${58.75}_{5.5}$ ${67.5}_{5.24}$ ${61.25}_{5.45}$ ${56.96}_{5.57}$ ${47.5}_{5.58}$ Composite ${62.82}_{5.47}$ ${57.69}_{5.59}$ ${51.61}_{6.35}$ ${35.06}_{5.44}$ ${59.68}_{6.23}$ ${44.16}_{5.66}$ — AVG ${68.17}_{1.13}$ ${60.84}_{1.18}$ ${50.91}_{1.21}$ ${46.8}_{1.2}$ ${48.62}_{1.21}$ ${48.54}_{1.21}$ ${58.43}_{1.22}$

Table 3: Performance of LLMs on the binary questions (T/F) without obfuscations and ramifications. The results are split up by action-sequence lengths and question categories.

Performance of Smaller Open-Source LLMs Approaches Random

Table 3 reveals that smaller models, specifically Gemma-2b and Lamma-7b, consistently exhibit poor performance across all action-sequence lengths, with results often approaching random chance levels. This trend highlights the challenges these LLMs encounter in comprehending and responding to the complexities of the tasks. Additionally, further analysis reveals that Lamma-13b’s performance is similarly close to random (50%) in the few-shot-1 setting. There is a slight improvement with few-shot-5, as seen in Table 5; however, it remains only marginally above random. This incremental improvement indicates some level of adaptation or learning within specific frameworks, but overall, Lamma-13b struggles to achieve meaningful accuracy. These findings suggest that smaller open-source models are incapable of effectively performing RAC.

Fine-Tuning Improves Some Categories but Deteriorates Others

After fine-tuning Llama3-8b, we observed improvements in categories object tracking and fluent tracking. Conversely, there was a decline in performance in the categories effects of actions and hallucination detection. These results indicate that merely fine-tuning LLMs with more data might not address all the challenges in RAC.

Human Annotated Evaluated With Tuned RoBERTa Classifier Plan Question GPT-4o Gemini Pro GPT-4o Gemini Pro LLaMa2 13b LLaMa3 8b LLaMa2 7b Gemma 7b Len. Categories few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 19 Object Trk. ${85.0}_{7.98}$ ${80.0}_{8.94}$ ${79.49}_{6.47}$ ${43.59}_{7.94}$ ${52.5}_{7.9}$ ${57.5}_{7.82}$ ${40.0}_{7.75}$ ${37.5}_{7.65}$ Fluent Trk. ${30.0}_{10.25}$ ${20.0}_{8.94}$ ${9.09}_{4.33}$ ${11.11}_{4.68}$ ${4.26}_{2.94}$ ${4.26}_{2.94}$ ${10.64}_{4.5}$ ${12.77}_{4.87}$ State Trk. ${45.0}_{11.12}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ Action Exec. ${45.0}_{11.12}$ ${50.0}_{11.18}$ ${54.29}_{8.42}$ ${59.38}_{8.68}$ ${7.69}_{4.27}$ ${33.33}_{7.55}$ ${17.95}_{6.15}$ ${17.95}_{6.15}$ Effects ${30.0}_{10.25}$ ${10.0}_{6.71}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ Num. Reas. ${15.0}_{7.98}$ ${10.0}_{6.71}$ ${13.51}_{5.62}$ ${7.5}_{4.16}$ ${2.5}_{2.47}$ ${5.0}_{3.45}$ ${2.5}_{2.47}$ ${12.5}_{5.23}$ Hallucination ${40.0}_{10.95}$ ${45.0}_{11.12}$ ${40.43}_{7.16}$ ${46.81}_{7.28}$ ${12.77}_{4.87}$ ${21.28}_{5.97}$ ${17.02}_{5.48}$ ${17.39}_{5.59}$ Composite ${50.0}_{11.18}$ ${45.0}_{11.12}$ ${47.37}_{8.1}$ ${45.71}_{8.42}$ ${43.59}_{7.94}$ ${12.5}_{5.23}$ ${43.59}_{7.94}$ ${47.5}_{7.9}$ AVG ${42.5}_{3.91}$ ${32.5}_{3.7}$ ${34.36}_{3.15}$ ${30.41}_{3.12}$ ${12.18}_{1.99}$ ${18.45}_{2.36}$ ${13.65}_{2.09}$ ${15.19}_{2.18}$

Table 4: Performance of LLMs on Free Answers questions. Human annotation is performed for proprietary LLMs and RoBERTa evaluation on every LLM.

Obfuscations Lowers the Performance Across All Dimensions

Our analysis, as displayed in Table 3 (without obfuscations) compared to Table 10 (with obfuscations), reveals that introducing obfuscations leads to a decrease in performance for all the model across all question categories, action-sequence lengths, fluent types, and domain types. This decrease highlights the model’s reliance on the knowledge acquired through pre-training data and not comprehending the rules given in the context while prompting.

Numerical Reasoning is Near Random for All Action-sequence Lengths

An analysis of Figure 2 reveals that numerical reasoning tasks consistently gets around 50% accuracy across both transportation and non-transportation domains. This level of accuracy indicates a fundamental limitation in the models’ ability to numerically reason about actions. This inability to perform numerical reasoning is not confined to RAC but is also reflected in broader contexts (Ahn et al., 2024).

5.0.1 Increase in Few shots, Helps in Domain Understanding, But not in Reasoning

Question Gemini Pro LLaMa2 13b Gemma 7b Category zero shot few shot 1 few shot 5 zero shot few shot 1 few shot 5 zero shot few shot 1 few shot 5 Object Trk. ${69.51}_{2.0}$ ${67.42}_{2.04}$ ${67.47}_{2.2}$ ${56.06}_{2.16}$ ${55.49}_{2.16}$ ${60.14}_{4.17}$ ${56.06}_{2.16}$ ${47.16}_{2.17}$ ${52.23}_{2.56}$ Fluent Trk. ${58.56}_{2.17}$ ${58.75}_{2.17}$ ${61.04}_{2.31}$ ${39.88}_{2.16}$ ${45.33}_{2.2}$ ${55.43}_{5.18}$ ${39.88}_{2.16}$ ${50.88}_{2.21}$ ${53.64}_{2.69}$ State Trk. ${47.73}_{7.53}$ ${59.09}_{7.41}$ ${56.1}_{7.75}$ ${47.73}_{7.53}$ ${47.73}_{7.53}$ — ${47.73}_{7.53}$ ${59.09}_{7.41}$ ${57.89}_{11.33}$ Action Exec. ${51.25}_{5.59}$ ${52.5}_{5.58}$ ${51.52}_{6.15}$ ${50.0}_{5.59}$ ${50.0}_{5.59}$ ${61.54}_{13.49}$ ${50.0}_{5.59}$ ${51.28}_{5.66}$ ${54.0}_{7.05}$ Effects ${53.63}_{2.8}$ ${54.89}_{2.79}$ ${64.87}_{2.86}$ ${51.74}_{2.81}$ ${51.1}_{2.81}$ ${51.72}_{5.36}$ ${51.74}_{2.81}$ ${46.3}_{2.83}$ ${63.93}_{3.24}$ Num. Reas. ${53.75}_{5.57}$ ${48.75}_{5.59}$ ${50.0}_{5.98}$ ${51.25}_{5.59}$ ${50.0}_{5.59}$ ${46.15}_{9.78}$ ${51.25}_{5.59}$ ${38.75}_{5.45}$ ${48.33}_{6.45}$ Hallucination ${80.0}_{4.47}$ ${78.75}_{4.57}$ ${94.12}_{2.85}$ ${73.75}_{4.92}$ ${58.75}_{5.5}$ ${70.59}_{11.05}$ ${73.75}_{4.92}$ ${56.96}_{5.57}$ ${81.63}_{5.53}$ AVG ${60.82}_{1.18}$ ${60.84}_{1.18}$ ${64.3}_{1.27}$ ${48.0}_{1.2}$ ${50.91}_{1.21}$ ${56.57}_{2.57}$ ${49.8}_{1.2}$ ${48.54}_{1.21}$ ${56.2}_{1.48}$

Table 5: Performance of LLMs on binary questions (T/F) on non-obfuscated data and without ramification, for action-sequence length of 19. The performance is split up across prompting techniques. Note that due to the limited context length of Llama-13b, state-tracking analysis is missing

An analysis of Gemini-1.0-Pro, presented in Table 5, reveals specific performance trends across different evaluation categories. Notably, performance improvements surpassing the SEM are primarily observed within the effects and hallucination detection categories. The accuracy in these categories heavily relies on the model’s ability to understand the consequences of actions and accurately identify hallucinated fluents, objects, and actions. With additional examples, Gemini exhibits performance gains in these categories. Specifically, there is an increase of 10 percentage points in the effects category and 16 percentage points in hallucination detection. This indicates that the model’s understanding of domain-specific dynamics and its ability to detect non-existent entities improves considerably with increased exposure to relevant examples. Conversely, introducing additional examples does not yield similar benefits across all categories. This stagnation suggests a limitation in the model’s current architecture or training regimen, which fails to enhance its reasoning capabilities despite additional examples.

5.1 GPT-4o and Gemini-1.0-Pro Case Study

Due to the poor performance of open-source models, we focus on the two best-performing proprietary LLMs, GPT-4o and Gemini-1.0-Pro.

5.1.1 Observations on Action-Sequence Lengths

Performance Degrades with Increasing Action-Sequence Length (with and without obfuscations).

The data presented in Tables 3, 9, 10, and 11 reveal a consistent trend across all LLMs: performance degrades as action-sequence length increases, with the variations falling within SEM. This trend is consistent throughout obfuscations and ramifications.

Action-Executability is Affected Most Drastically by Increasing Action-Sequences Length.

As shown in Fig 2, for proprietary LLMs, the action executability category exhibits the most significant decrease in performance, followed by the state-tracking category. Conversely, categories such as object tracking and effects of actions show minimal degradation as action-sequence length increases.

5.1.2 Observations on Fluent Types

Fluent Type GPT-4o Gemini-1.0-Pro Baseline Baseline + R. O. Baseline O. Baseline + R. Baseline Baseline + R. O. Baseline O. Baseline + R. Base Fl ${68.22}_{2.51}$ ${68.9}_{2.5}$ ${61.05}_{2.63}$ ${58.43}_{2.66}$ ${61.34}_{2.63}$ ${63.95}_{2.59}$ ${58.43}_{2.66}$ ${61.05}_{2.63}$ Base Fl + Cnstr. ${72.51}_{2.32}$ ${76.14}_{2.21}$ ${64.08}_{2.48}$ ${68.82}_{2.4}$ ${64.88}_{2.47}$ ${64.25}_{2.48}$ ${56.84}_{2.56}$ ${59.52}_{2.54}$ Derived Fl ${62.37}_{2.82}$ ${63.64}_{2.79}$ ${53.2}_{2.9}$ ${56.27}_{2.89}$ ${60.94}_{2.83}$ ${62.46}_{2.83}$ ${58.59}_{2.86}$ ${62.96}_{2.8}$ Static Props ${72.89}_{2.4}$ ${71.59}_{2.43}$ ${67.83}_{2.52}$ ${66.67}_{2.55}$ ${57.39}_{2.66}$ ${56.52}_{2.67}$ ${55.94}_{2.67}$ ${55.94}_{2.67}$ Positive Fluents ${80.72}_{1.6}$ ${83.74}_{1.5}$ ${70.94}_{1.84}$ ${73.07}_{1.8}$ ${72.91}_{1.8}$ ${74.92}_{1.76}$ ${65.19}_{1.93}$ ${68.47}_{1.88}$ Negative Fluents ${68.17}_{2.06}$ ${66.28}_{2.09}$ ${59.84}_{2.16}$ ${59.65}_{2.17}$ ${54.0}_{2.2}$ ${54.99}_{2.2}$ ${54.78}_{2.2}$ ${56.34}_{2.19}$

Table 6: Performance of GPT-4o and Gemini-1.0-Pro on few shot 1 data with action-sequence length 19. Note that "O." stands for Obfuscated data, "+R" stands for data with ramifications.

Base Fluents with Constraints Outperform Other Fluents

Among all the subcategories defined in section 3.2, base fluents with constraints consistently exhibit the best performance across all the categories and LLMs. This suggests that the models excel at understanding self-dependent properties rather than those that rely on other properties.

Performance Degradation with Increasing Action-Sequence Lengths

Figure 7 illustrates the performance trends across different fluent types as action-sequence length increases. While base fluents show the highest initial performance at length 1, this performance sharply declines as action-sequence length increases. In contrast, questions involving static fluent consistently display the lowest performance across all action-sequence lengths. Meanwhile, questions with persistent fluents show a stable performance, with no significant decrease observed as the action-sequence length increases.

Models Struggle with Negative Fluents

In Table 6, it is evident that models perform better on questions involving positive fluents. A consistent performance gap of approximately 25% percentage points exists between positive and negative fluents across the base, static, and base with constraints subcategories. Performance on negative fluents is nearly random. For derived fluents, the gap between negative and positive fluents is negligible, with both achieving roughly 60% accuracy. These findings indicate that LLMs struggle with negated fluents. This aligns with the literature, which also reports weaker performance on tasks involving negation (Truong et al., 2023). Effective reasoning about negated fluents is crucial for accurately understanding state changes and determining the viability of actions. Failures in these areas, particularly in complex action-sequences, can lead to errors.

Models’ Inability to Recognize Null Effects of Actions on Static Properties

Figure 3 illustrates with statistical significance that model performance on static properties is hindered by their inability to comprehend the effects of actions (or lack thereof) on these properties. Specifically, models fail to recognize that certain state properties remain unchanged by any action, resulting in performance slightly below random chance levels. This misunderstanding causes considerable inaccuracies in tasks involving static properties, highlighting a fundamental limitation in current models’ ability to accurately interpret and predict states in dynamic contexts.

5.1.3 Observations on Free Answers vs True False

When comparing the performances of LLMs in table 3 (binary questions) with those in table 13 (free-answer questions), it is evident that the categories Fluent tracking and Hallucination Detection, which performed well on binary questions, exhibit poor performance on free-answer questions. Conversely, the category Composite Questions shows improved performance when evaluated on free-answer questions. It can also be observed that the RoBERTa model struggles to classify the free answer questions and is not able to evaluate with human level accuracy.

6 Conclusion

In this work, we introduce ActionReasoningBench, the first benchmark for evaluating LLMs across several aspects of RAC, including Object Tracking, Fluent Tracking, State Tracking, Action Executability, Effects of Actions, Numerical RAC, Hallucination Detection, and Composite Questions. We also introduce ramification constraints that account for the indirect effects of some actions. Using ActionReasoningBench, we find that only large models like GPT-4o and Gemini-1.0-Pro can perform some RAC tasks, achieving average performances of 42.5% and 32.5% on Free Answers, respectively. Furthermore, RAC performance degrades with increasing action-sequence length for the Action Executability. Numerical reasoning also performs poorly across all LLMs and action-sequence lengths. Additionally, LLMs struggle with fluents, especially in understanding negative fluents and the effects of actions on static properties. Although ramification constraints can enhance performance on base fluents, they do not affect derived fluents. Finally, obfuscations lead to a decrease in performance across all dataset dimensions for all models. We hope that ActionReasoningBench will serve as a valuable benchmark for the research community, facilitating the assessment of LLMs in various aspects of RAC.

References

(1) [2109.01653] CREAK: A Dataset for Commonsense Reasoning over Entity Knowledge. https://arxiv.longhoe.net/abs/2109.01653.
(2) Differentiable Open-Ended Commonsense Reasoning - ACL Anthology. https://aclanthology.org/2021.naacl-main.366/.
Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Aghzal et al. (2023) Mohamed Aghzal, Erion Plaku, and Ziyu Yao. 2023. Can large language models be good path planners? a benchmark and investigation on spatial-temporal reasoning. arXiv preprint arXiv:2310.03249.
Ahn et al. (2024) Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. 2024. Large language models for mathematical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157.
AI (2024) Meta AI. 2024. Introducing meta llama 3: The most capable openly available llm to date. https://ai.meta.com/blog/meta-llama-3/. Accessed: 2024-06-04.
Aksitov et al. (2023) Renat Aksitov, Sobhan Miryoosefi, Zonglin Li, Daliang Li, Sheila Babayan, Kavya Kopparapu, Zachary Fisher, Ruiqi Guo, Sushant Prakash, Pranesh Srinivasan, et al. 2023. Rest meets react: Self-improvement for multi-step reasoning llm agent. arXiv preprint arXiv:2312.10003.
Amini et al. (2019) Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Ye** Choi, and Hannaneh Hajishirzi. 2019. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms.
Banerjee et al. (2020) Pratyay Banerjee, Chitta Baral, Man Luo, Arindam Mitra, Kuntal Pal, Tran C. Son, and Neeraj Varshney. 2020. Can Transformers Reason About Effects of Actions?
Bisk et al. (2019) Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Ye** Choi. 2019. PIQA: Reasoning about Physical Commonsense in Natural Language.
Chen et al. (2022) Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. 2022. FinQA: A Dataset of Numerical Reasoning over Financial Data.
Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions.
Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge.
Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems.
Fei et al. (2023) Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Songyang Zhang, Kai Chen, Zongwen Shen, and Jidong Ge. 2023. LawBench: Benchmarking Legal Knowledge of Large Language Models.
Fikes and Nilsson (1971) Richard E Fikes and Nils J Nilsson. 1971. Strips: A new approach to the application of theorem proving to problem solving. Artificial intelligence, 2(3-4):189–208.
Geva et al. (2021) Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies.
Guan et al. (2023) Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. 2023. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. Advances in Neural Information Processing Systems, 36:79081–79094.
Guha et al. (2023) Neel Guha, Julian Nyarko, Daniel Ho, Christopher Ré, Adam Chilton, Aditya K, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan Choi, Kevin Tobia, Margaret Hagan, Megan Ma, Michael Livermore, Nikon Rasumov-Rahe, Nils Holzenberger, Noam Kolt, Peter Henderson, Sean Rehaag, Sharad Goel, Shang Gao, Spencer Williams, Sunny Gandhi, Tom Zur, Varun Iyer, and Zehua Li. 2023. LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models. Advances in Neural Information Processing Systems, 36:44123–44279.
Han et al. (2024) Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Wenfei Zhou, James Coady, David Peng, Yujie Qiao, Luke Benson, Lucy Sun, Alex Wardle-Solano, Hannah Szabo, Ekaterina Zubova, Matthew Burtell, Jonathan Fan, Yixin Liu, Brian Wong, Malcolm Sailor, Ansong Ni, Linyong Nan, Jungo Kasai, Tao Yu, Rui Zhang, Alexander R. Fabbri, Wojciech Kryscinski, Semih Yavuz, Ye Liu, Xi Victoria Lin, Shafiq Joty, Yingbo Zhou, Caiming Xiong, Rex Ying, Arman Cohan, and Dragomir Radev. 2024. FOLIO: Natural Language Reasoning with First-Order Logic.
Haslum et al. (2019) Patrik Haslum, Nir Lipovetzky, Daniele Magazzeni, Christian Muise, Ronald Brachman, Francesca Rossi, and Peter Stone. 2019. An introduction to the planning domain definition language, volume 13. Springer.
He et al. (2023) Weinan He, Canming Huang, Zhanhao Xiao, and Yongmei Liu. 2023. Exploring the capacity of pretrained language models for reasoning about actions and change. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4629–4643.
Huang and Chang (2022) Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403.
Huang et al. (2019) Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Ye** Choi. 2019. Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning.
Imani et al. (2023) Shima Imani, Liang Du, and Harsh Shrivastava. 2023. Mathprompter: Mathematical reasoning using large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 37–42.
International Conference on Automated Planning and Scheduling (1998) International Conference on Automated Planning and Scheduling. 1998. Icaps competitions. Accessed: 2024-04-25.
Kohli and Sun (2024) Harsh Kohli and Huan Sun. 2024. Cleared for takeoff? compositional & conditional reasoning may be the achilles heel to (flight-booking) language agents. arXiv preprint arXiv:2404.04237.
Koncel-Kedziorski et al. (2016) Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. 2016. MAWPS: A Math Word Problem Repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1152–1157, San Diego, California. Association for Computational Linguistics.
Lin et al. (2021) Bill Yuchen Lin, Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Xiang Ren, and William W. Cohen. 2021. Differentiable Open-Ended Commonsense Reasoning.
Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
Ling et al. (2017) Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems.
Liu et al. (2023) Hanmeng Liu, Zhiyang Teng, Leyang Cui, Chaoli Zhang, Qiji Zhou, and Yue Zhang. 2023. LogiCoT: Logical Chain-of-Thought Instruction-Tuning.
Liu et al. (2020) Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. 2020. LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning.
Liu et al. (2022) Tianyu Liu, Yizhe Zhang, Chris Brockett, Yi Mao, Zhifang Sui, Weizhu Chen, and Bill Dolan. 2022. A Token-level Reference-free Hallucination Detection Benchmark for Free-form Text Generation.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, **gfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Lourie et al. (2021) Nicholas Lourie, Ronan Le Bras, Chandra Bhagavatula, and Ye** Choi. 2021. UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark.
Luo et al. (2024) Man Luo, Shrinidhi Kumbhar, Ming shen, Mihir Parmar, Neeraj Varshney, Pratyay Banerjee, Somak Aditya, and Chitta Baral. 2024. Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models.
McCarthy et al. (1963) John McCarthy et al. 1963. Situations, actions, and causal laws. Comtex Scientific.
McIlraith (2000) Sheila A McIlraith. 2000. Integrating actions and state constraints: A closed-form solution to the ramification problem (sometimes). Artificial Intelligence, 116(1-2):87–121.
Miao et al. (2021) Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2021. Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.
Mu et al. (2023) Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Dan Hendrycks, and David Wagner. 2023. Can llms follow simple rules? arXiv preprint arXiv:2311.04235.
Parmar et al. (2024a) Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, and Chitta Baral. 2024a. Towards systematic evaluation of logical reasoning ability of large language models. arXiv preprint arXiv:2404.15522.
Parmar et al. (2024b) Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, and Chitta Baral. 2024b. Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models.
Patel et al. (2021) Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are NLP Models really able to Solve Simple Math Word Problems?
Reiter (2001) Raymond Reiter. 2001. Knowledge in action: logical foundations for specifying and implementing dynamical systems.
Rintanen (2004) Jussi Rintanen. 2004. Complexity of planning with partial observability. In ICAPS, volume 4, pages 345–354.
Sakaguchi et al. (2021) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Ye** Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
Sampat et al. (2022a) Shailaja Keyur Sampat, Pratyay Banerjee, Yezhou Yang, and Chitta Baral. 2022a. Learning Action-Effect Dynamics for Hypothetical Vision-Language Reasoning Task.
Sampat et al. (2022b) Shailaja Keyur Sampat, Maitreya Patel, Subhasish Das, Yezhou Yang, and Chitta Baral. 2022b. Reasoning about Actions over Visual and Linguistic Modalities: A Survey.
Sap et al. (2019) Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A Smith, and Ye** Choi. 2019. Atomic: An atlas of machine commonsense for if-then reasoning. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3027–3035.
Sharma (2019) Arpit Sharma. 2019. Using answer set programming for commonsense reasoning in the winograd schema challenge. Theory and Practice of Logic Programming, 19(5-6):1021–1037.
Spiliopoulou et al. (2022) Evangelia Spiliopoulou, Artidoro Pagnoni, Yonatan Bisk, and Eduard Hovy. 2022. Events realm: Event reasoning of entity states via language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1982–1997.
Stein and Koller (2023) Katharina Stein and Alexander Koller. 2023. Autoplanbench:: Automatically generating benchmarks for llm planners from pddl. arXiv preprint arXiv:2311.09830.
Tafjord et al. (2021) Oyvind Tafjord, Bhavana Dalvi Mishra, and Peter Clark. 2021. ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language.
Talmor et al. (2019) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158.
Tan et al. (2023) Qingyu Tan, Hwee Tou Ng, and Lidong Bing. 2023. Towards benchmarking and improving the temporal reasoning capability of large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14820–14835.
Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
Team et al. (2024) Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Truong et al. (2023) Thinh Hung Truong, Timothy Baldwin, Karin Verspoor, and Trevor Cohn. 2023. Language models are not naysayers: an analysis of language models on negation benchmarks. In Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (* SEM 2023), pages 101–114.
Valmeekam et al. (2024a) Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. 2024a. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. Advances in Neural Information Processing Systems, 36.
Valmeekam et al. (2024b) Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. 2024b. On the planning abilities of large language models-a critical investigation. Advances in Neural Information Processing Systems, 36.
Wang et al. (2024) Siyuan Wang, Zhongyu Wei, Ye** Choi, and Xiang Ren. 2024. Can llms reason with rules? logic scaffolding for stress-testing and improving llms. arXiv preprint arXiv:2402.11442.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering.
Yao et al. (2022) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
Zhang et al. (2018) Sheng Zhang, Xiaodong Liu, **g**g Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. 2018. Record: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885.
Zhao et al. (2024) Haiteng Zhao, Chang Ma, Guoyin Wang, **g Su, Lingpeng Kong, **g**g Xu, Zhi-Hong Deng, and Hongxia Yang. 2024. Empowering large language model agents through action learning. arXiv preprint arXiv:2402.15809.
Zhou et al. (2024) Qinhao Zhou, Zihan Zhang, Xiang Xiang, Ke Wang, Yuchuan Wu, and Yongbin Li. 2024. Enhancing the general agent capabilities of low-parameter llms through tuning and multi-branch reasoning. arXiv preprint arXiv:2403.19962.

Appendix

Limitations

While reasoning about actions is not tied to the English language, ActionReasoningBench is currently restricted to questions formulated in English. Despite our efforts to evaluate a range of models, including several open-source LLMs and two state-of-the-art proprietary LLMs, GPT-4o, and Gemini-1.0-Pro, our assessment did not extend to additional models with different architectures or training methodologies due to resource constraints.

Appendix A Additional Literature

A.1 LLMs for Reasoning

Reasoning is a fundamental cognitive process that allows individuals to analyze information, deduce conclusions, solve problems, and make decisions. It is crucial across various domains, from everyday decision-making to scientific research and policy formulation. Effective reasoning leads to better understanding, innovative solutions, and the ability to navigate complex situations, making it an essential skill for humans. As we continue to explore and enhance our own cognitive abilities, there has been a growing interest in emulating similar reasoning processes within artificial intelligence systems. This pursuit has led to significant advancements in the development of LLMs. With the advent of LLMs and their prowess in various natural language tasks, it is speculated that these models exhibit reasoning capabilities when they are sufficiently large (Huang and Chang, 2022), (Wei et al., 2022). However, to accurately assess the reasoning capabilities of these LLMs, it is crucial to establish benchmarks and frameworks that can evaluate their ability to reason across diverse contexts.

A.2 Legal Reasoning

Legal reasoning is a complex process essential to the legal profession, involving the application of legal rules to specific facts to resolve disputes and make decisions. It requires a deep understanding of laws, statutes, and case precedents and the ability to interpret these in various factual contexts. Notable benchmarks for legal reasoning include works by Guha et al. (2023) and Fei et al. (2023)

A.3 Logical Reasoning

Logical reasoning is an essential cognitive skill that involves the analysis and evaluation of arguments based on the principles of logic. It enables the identification of strong and weak arguments, detection of fallacies, and construction of coherent reasoning. Fundamental across disciplines like mathematics, science, philosophy, and law, logical reasoning allows for drawing conclusions from premises, problem-solving, and informed decision-making. Representative benchmarks for logical reasoning include works by Luo et al. (2024), Han et al. (2024), Liu et al. (2023),Liu et al. (2020),Parmar et al. (2024b), and Tafjord et al. (2021)

A.4 Arithmetic Reasoning

Arithmetic reasoning is the process of using mathematical principles to analyze and solve numerical problems. This skill extends beyond simple calculation to involve understanding numerical relationships, patterns, and sequences. Essential in fields such as finance, engineering, and everyday activities like budgeting, arithmetic reasoning allows individuals to interpret data, forecast outcomes, and make informed decisions. Representative benchmarks include works by Amini et al. (2019), Chen et al. (2022), Cobbe et al. (2021), Koncel-Kedziorski et al. (2016), Ling et al. (2017), Liu et al. (2022), Miao et al. (2021), and Patel et al. (2021)

A.5 Commonsense Reasoning

Commonsense reasoning involves using practical knowledge gained from daily life to make intuitive judgments and decisions. It encompasses understanding societal norms and basic physical principles, allowing individuals to efficiently navigate social and practical tasks. This type of reasoning is essential for adapting to new environments, predicting outcomes, and managing everyday activities, forming a fundamental part of human cognition. Representative benchmarks include works by 210901653CREAK (210), Bisk et al. (2019), Clark et al. (2019), Clark et al. (2018), Dif , Geva et al. (2021), Huang et al. (2019), Lin et al. (2021), Lourie et al. (2021), Mihaylov et al. (2018), Talmor et al. (2019), and Yang et al. (2018)

A.6 Reasoning about Actions and Change

Reasoning about actions and its effects is essential for both humans and AI systems. For humans, this capability enables us to plan, make decisions, and navigate complex relationships within our environment. Similarly, in the field of AI, reasoning about actions is crucial for enabling intelligent systems to effectively achieve goals, manage uncertainty, and adapt to the dynamic, ever-changing world. This ability is foundational for both humans and AI to interact safely and efficiently with their surroundings. Relevant works by Valmeekam et al. (2024a), Valmeekam et al. (2024b), and Guan et al. (2023) explore reasoning about actions but focus more on the planning capabilities of Large Language Models. Work by Banerjee et al. (2020) investigates the ability of Large Language Models to reason about actions only over four domains, and the synthetic data generated in their work only addresses three question types. While He et al. (2023) tests the ability of LLMS on RAC only on a variant of Blocksworld with four question types, out of which only two types focus more on reasoning about actions and their effects while the other two focus on planning.

Through our work, we provide an in-depth analysis of LLMs’ capability to reason about actions and their effects with and without ramification constraints, which the relevant works mentioned fall short of.

Appendix B Data Verification

Three annotators were given the task of giving a score from 1 to 5 based on how natural they feel the sentences present in the dataset. The annotators were given the following instructions along with the dataset

⬇

Rate the Prompts from 1 to 5, based on how natural they appear in English.

Table 7 shows the scores given by three annotators across all the domains present in ActionReasoningBench.

Domain	Annotator 1 Score	Annotator 2 Score	Annotator 3 Score	Average
Blocksworld	3.8	5.0	3.8	4.2
Depots	4.6	3.6	4.0	4.1
Driverlog	4.8	3.6	4.4	4.3
Goldminer	5.0	4.0	4.6	4.5
Grippers	4.6	4.0	4.6	4.4
Logistics	5.0	3.4	4.4	4.3
Miconic	5.0	4.4	4.0	4.5
Mystery	4.0	4.0	3.6	3.9
NPuzzle	3.0	3.6	4.6	3.7
Satellite	4.4	4.6	4.4	4.5
Spanner	5.0	5.0	3.8	4.6
Visitall	3.6	3.8	4.0	3.8
Zenotravel	4.6	4.0	3.8	4.1
Average	4.4	4.1	4.2	4.2

Table 7: Naturalness scores across all domains out of 5

Appendix C Additional Results

In this section, we provide more discussions and results performed on our benchmark.

C.1 Performance on Binary (True/False) Questions

Act. Question GPT-4o Gemini Pro LLaMa2 13b LLaMa3 8b LLaMa2 7b Gemma 7b Seq. Categories zero shot zero shot zero shot zero shot zero shot zero shot 1 Object Trk. ${74.48}_{1.9}$ ${70.86}_{1.98}$ ${60.19}_{2.14}$ ${41.52}_{2.15}$ ${45.71}_{2.17}$ ${60.19}_{2.14}$ 1 Fluent Trk. ${69.85}_{1.97}$ ${62.5}_{2.08}$ ${50.55}_{2.14}$ ${33.82}_{2.03}$ ${43.93}_{2.13}$ ${50.55}_{2.14}$ 1 State Trk. ${89.09}_{4.2}$ ${52.73}_{6.73}$ ${50.91}_{6.74}$ ${43.64}_{6.69}$ ${50.91}_{6.74}$ ${50.91}_{6.74}$ 1 Action Exec. ${80.0}_{4.47}$ ${63.29}_{5.42}$ ${50.0}_{5.59}$ ${46.25}_{5.57}$ ${48.75}_{5.59}$ ${50.0}_{5.59}$ 1 Effects ${63.44}_{2.69}$ ${54.37}_{2.78}$ ${51.88}_{2.79}$ ${41.88}_{2.76}$ ${54.37}_{2.78}$ ${51.88}_{2.79}$ 1 Num. Reas. ${55.0}_{5.56}$ ${53.75}_{5.57}$ ${53.75}_{5.57}$ ${48.75}_{5.59}$ ${50.0}_{5.59}$ ${53.75}_{5.57}$ 1 Hallucination ${88.75}_{3.53}$ ${83.75}_{4.12}$ ${67.5}_{5.24}$ ${57.5}_{5.53}$ ${50.0}_{5.59}$ ${67.5}_{5.24}$ 1 Composite ${71.25}_{5.06}$ ${57.5}_{5.53}$ ${3.75}_{2.12}$ ${38.75}_{5.45}$ ${0.0}_{0.0}$ ${46.25}_{5.57}$ 1 AVG ${71.37}_{1.08}$ ${63.58}_{1.15}$ ${52.44}_{1.19}$ ${40.42}_{1.17}$ ${45.35}_{1.19}$ ${54.37}_{1.19}$ 10 Object Trk. ${71.19}_{1.94}$ ${70.09}_{1.96}$ ${57.06}_{2.12}$ ${42.02}_{2.11}$ ${53.03}_{2.14}$ ${57.06}_{2.12}$ 10 Fluent Trk. ${68.63}_{1.99}$ ${60.15}_{2.1}$ ${45.94}_{2.14}$ ${37.27}_{2.08}$ ${42.07}_{2.12}$ ${45.94}_{2.14}$ 10 State Trk. ${90.74}_{3.94}$ ${53.7}_{6.79}$ ${48.15}_{6.8}$ ${37.04}_{6.57}$ ${50.0}_{6.8}$ ${48.15}_{6.8}$ 10 Action Exec. ${55.0}_{5.56}$ ${55.0}_{5.56}$ ${50.0}_{5.59}$ ${53.75}_{5.57}$ ${50.0}_{5.59}$ ${50.0}_{5.59}$ 10 Effects ${62.19}_{2.71}$ ${55.0}_{2.78}$ ${50.62}_{2.79}$ ${47.81}_{2.79}$ ${56.56}_{2.77}$ ${50.62}_{2.79}$ 10 Num. Reas. ${52.5}_{5.58}$ ${47.5}_{5.58}$ ${48.75}_{5.59}$ ${50.0}_{5.59}$ ${50.0}_{5.59}$ ${48.75}_{5.59}$ 10 Hallucination ${76.25}_{4.76}$ ${76.25}_{4.76}$ ${67.5}_{5.24}$ ${43.75}_{5.55}$ ${50.0}_{5.59}$ ${67.5}_{5.24}$ 10 Composite ${67.5}_{5.24}$ ${52.5}_{5.58}$ ${1.25}_{1.24}$ ${35.0}_{5.33}$ ${0.0}_{0.0}$ ${43.75}_{5.55}$ 10 AVG ${67.88}_{1.11}$ ${61.65}_{1.15}$ ${49.52}_{1.18}$ ${42.11}_{1.17}$ ${47.45}_{1.18}$ ${51.43}_{1.18}$ 19 Object Trk. ${68.94}_{2.01}$ ${69.51}_{2.0}$ ${56.06}_{2.16}$ ${40.72}_{2.14}$ ${47.54}_{2.17}$ ${56.06}_{2.16}$ 19 Fluent Trk. ${70.04}_{2.02}$ ${58.56}_{2.17}$ ${39.88}_{2.16}$ ${33.46}_{2.08}$ ${36.38}_{2.12}$ ${39.88}_{2.16}$ 19 State Trk. ${93.18}_{3.8}$ ${47.73}_{7.53}$ ${47.73}_{7.53}$ ${40.91}_{7.41}$ ${47.73}_{7.53}$ ${47.73}_{7.53}$ 19 Action Exec. ${52.5}_{5.58}$ ${51.25}_{5.59}$ ${50.0}_{5.59}$ ${53.75}_{5.57}$ ${50.0}_{5.59}$ ${50.0}_{5.59}$ 19 Effects ${56.78}_{2.78}$ ${53.63}_{2.8}$ ${51.74}_{2.81}$ ${47.32}_{2.8}$ ${50.16}_{2.81}$ ${51.74}_{2.81}$ 19 Num. Reas. ${52.5}_{5.58}$ ${53.75}_{5.57}$ ${51.25}_{5.59}$ ${47.5}_{5.58}$ ${50.0}_{5.59}$ ${51.25}_{5.59}$ 19 Hallucination ${87.5}_{3.7}$ ${80.0}_{4.47}$ ${73.75}_{4.92}$ ${55.0}_{5.56}$ ${50.0}_{5.59}$ ${73.75}_{4.92}$ 19 Composite ${68.75}_{5.18}$ ${51.25}_{5.59}$ ${1.25}_{1.24}$ ${33.75}_{5.29}$ ${2.5}_{1.75}$ ${40.0}_{5.48}$ 19 AVG ${66.98}_{1.13}$ ${60.82}_{1.18}$ ${48.0}_{1.2}$ ${41.03}_{1.19}$ ${42.95}_{1.19}$ ${49.8}_{1.2}$

Table 8: Zero Shot performance without obfuscations and ramifications across all question categories and action-sequence lengths

Act. Question GPT-4o Gemini Pro LLaMa2 13b LLaMa3 8b LLaMa2 7b Gemma 7b Seq. Categories few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 1 Object Trk. ${80.57}_{1.73}$ ${71.15}_{1.99}$ ${57.9}_{2.15}$ ${45.9}_{2.17}$ ${46.67}_{2.18}$ ${48.95}_{2.18}$ 1 Fluent Trk. ${85.48}_{1.51}$ ${66.98}_{2.03}$ ${45.51}_{2.15}$ ${41.0}_{2.12}$ ${42.51}_{2.14}$ ${53.14}_{2.14}$ 1 State Trk. ${70.91}_{6.12}$ ${67.27}_{6.33}$ ${52.73}_{6.73}$ ${47.27}_{6.73}$ ${54.55}_{6.71}$ ${58.18}_{6.65}$ 1 Action Exec. ${80.0}_{4.47}$ ${73.75}_{4.92}$ ${51.25}_{5.59}$ ${51.25}_{5.59}$ ${50.0}_{5.66}$ ${52.5}_{5.58}$ 1 Effects ${60.0}_{2.74}$ ${59.69}_{2.74}$ ${47.19}_{2.79}$ ${38.71}_{2.77}$ ${50.32}_{2.85}$ ${49.69}_{2.8}$ 1 Num. Reas. ${58.75}_{5.5}$ ${51.25}_{5.59}$ ${55.0}_{5.56}$ ${46.25}_{5.57}$ ${58.97}_{5.57}$ ${45.0}_{5.56}$ 1 Hallucination ${95.0}_{2.44}$ ${85.0}_{3.99}$ ${53.75}_{5.57}$ ${63.64}_{5.48}$ ${62.67}_{5.59}$ ${66.25}_{5.29}$ 1 Composite — — — — — — 1 AVG ${77.55}_{1.02}$ ${67.32}_{1.15}$ ${51.08}_{1.22}$ ${44.12}_{1.22}$ ${47.73}_{1.23}$ ${51.55}_{1.22}$ 10 Object Trk. ${78.72}_{1.75}$ ${67.53}_{2.01}$ ${55.45}_{2.14}$ ${51.56}_{2.14}$ ${51.39}_{2.15}$ ${49.36}_{2.14}$ 10 Fluent Trk. ${79.52}_{1.73}$ ${61.44}_{2.09}$ ${43.94}_{2.16}$ ${42.51}_{2.14}$ ${41.57}_{2.16}$ ${52.69}_{2.15}$ 10 State Trk. ${64.81}_{6.5}$ ${62.96}_{6.57}$ ${46.15}_{6.91}$ ${50.0}_{6.8}$ ${40.38}_{6.8}$ ${72.22}_{6.1}$ 10 Action Exec. ${57.5}_{5.53}$ ${51.25}_{5.59}$ ${48.75}_{5.59}$ ${50.63}_{5.62}$ ${49.37}_{5.62}$ ${55.0}_{5.56}$ 10 Effects ${57.5}_{2.76}$ ${57.81}_{2.76}$ ${45.94}_{2.79}$ ${40.84}_{2.79}$ ${54.05}_{2.84}$ ${44.38}_{2.78}$ 10 Num. Reas. ${52.5}_{5.58}$ ${48.75}_{5.59}$ ${53.75}_{5.57}$ ${48.1}_{5.62}$ ${51.28}_{5.66}$ ${55.0}_{5.56}$ 10 Hallucination ${86.25}_{3.85}$ ${77.5}_{4.67}$ ${43.75}_{5.55}$ ${57.69}_{5.59}$ ${59.72}_{5.78}$ ${55.7}_{5.59}$ 10 Composite — — — — — — 10 AVG ${72.66}_{1.08}$ ${62.43}_{1.18}$ ${48.78}_{1.22}$ ${46.73}_{1.22}$ ${48.7}_{1.23}$ ${51.03}_{1.21}$ 19 Object Trk. ${75.0}_{1.88}$ ${68.38}_{2.03}$ ${56.74}_{2.16}$ ${51.89}_{2.17}$ ${50.09}_{2.18}$ ${47.54}_{2.17}$ 19 Fluent Trk. ${74.71}_{1.92}$ ${59.57}_{2.17}$ ${42.24}_{2.23}$ ${42.51}_{2.21}$ ${38.97}_{2.26}$ ${50.0}_{2.21}$ 19 State Trk. ${65.91}_{7.15}$ ${59.09}_{7.41}$ ${45.45}_{7.51}$ ${38.64}_{7.34}$ ${43.18}_{7.47}$ ${68.18}_{7.02}$ 19 Action Exec. ${55.0}_{5.56}$ ${53.75}_{5.57}$ ${52.5}_{5.58}$ ${48.72}_{5.66}$ ${48.1}_{5.62}$ ${56.25}_{5.55}$ 19 Effects ${55.84}_{2.79}$ ${54.57}_{2.8}$ ${48.9}_{2.81}$ ${42.39}_{2.81}$ ${52.27}_{2.85}$ ${44.48}_{2.79}$ 19 Num. Reas. ${51.25}_{5.59}$ ${52.5}_{5.58}$ ${53.75}_{5.57}$ ${41.89}_{5.74}$ ${51.28}_{5.66}$ ${46.25}_{5.57}$ 19 Hallucination ${87.5}_{3.7}$ ${88.75}_{3.53}$ ${55.0}_{5.56}$ ${70.51}_{5.16}$ ${50.65}_{5.7}$ ${57.69}_{5.59}$ 19 Composite — — — — — — 19 AVG ${69.45}_{1.14}$ ${62.21}_{1.2}$ ${50.06}_{1.24}$ ${47.08}_{1.24}$ ${47.03}_{1.26}$ ${49.12}_{1.23}$

Table 9: Few-shot-1 performance with ramifications and without obfuscations across all question categories and action-sequence lengths

Act. Question GPT-4o Gemini Pro LLaMa2 13b LLaMa3 8b LLaMa2 7b Gemma 7b Seq. Categories few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 1 Object Trk. ${72.76}_{1.94}$ ${67.24}_{2.05}$ ${59.2}_{2.31}$ ${48.19}_{2.18}$ ${48.12}_{2.35}$ ${42.53}_{2.16}$ 1 Fluent Trk. ${75.74}_{1.84}$ ${61.03}_{2.09}$ ${50.98}_{2.47}$ ${41.93}_{2.13}$ ${39.71}_{2.42}$ ${54.63}_{2.14}$ 1 State Trk. ${70.91}_{6.12}$ ${50.91}_{6.74}$ ${50.0}_{25.0}$ ${51.02}_{7.14}$ ${50.0}_{25.0}$ ${49.09}_{6.74}$ 1 Action Exec. ${76.25}_{4.76}$ ${65.0}_{5.33}$ ${51.39}_{5.89}$ ${66.18}_{5.74}$ ${56.94}_{5.84}$ ${53.75}_{5.57}$ 1 Effects ${60.62}_{2.73}$ ${54.69}_{2.78}$ ${51.4}_{2.96}$ ${45.08}_{3.06}$ ${53.5}_{2.95}$ ${44.51}_{2.78}$ 1 Num. Reas. ${57.5}_{5.53}$ ${51.25}_{5.59}$ ${52.0}_{5.77}$ ${52.24}_{6.1}$ ${49.33}_{5.77}$ ${48.72}_{5.66}$ 1 Hallucination ${85.0}_{3.99}$ ${76.25}_{4.76}$ ${48.0}_{5.77}$ ${60.0}_{6.32}$ ${54.67}_{5.75}$ ${58.75}_{5.5}$ 1 Composite — — — — — — 1 AVG ${71.38}_{1.1}$ ${61.88}_{1.18}$ ${53.68}_{1.35}$ ${47.01}_{1.26}$ ${47.63}_{1.35}$ ${48.63}_{1.22}$ 10 Object Trk. ${71.01}_{1.94}$ ${66.79}_{2.02}$ ${54.91}_{2.41}$ ${50.64}_{2.14}$ ${41.82}_{2.38}$ ${47.51}_{2.14}$ 10 Fluent Trk. ${70.85}_{1.95}$ ${57.01}_{2.13}$ ${47.4}_{2.61}$ ${41.42}_{2.19}$ ${40.82}_{2.57}$ ${50.56}_{2.16}$ 10 State Trk. ${55.56}_{6.76}$ ${55.56}_{6.76}$ — ${47.92}_{7.21}$ — ${57.41}_{6.73}$ 10 Action Exec. ${57.5}_{5.53}$ ${52.5}_{5.58}$ ${46.97}_{6.14}$ ${48.57}_{5.97}$ ${57.58}_{6.08}$ ${60.0}_{5.48}$ 10 Effects ${55.31}_{2.78}$ ${53.12}_{2.79}$ ${51.35}_{3.11}$ ${46.38}_{3.0}$ ${51.74}_{3.1}$ ${46.54}_{2.8}$ 10 Num. Reas. ${55.0}_{5.56}$ ${48.75}_{5.59}$ ${50.0}_{6.45}$ ${42.65}_{6.0}$ ${46.67}_{6.44}$ ${50.0}_{5.59}$ 10 Hallucination ${81.25}_{4.36}$ ${73.75}_{4.92}$ ${47.76}_{6.1}$ ${66.2}_{5.61}$ ${58.21}_{6.03}$ ${55.7}_{5.59}$ 10 Composite — — — — — — 10 AVG ${66.61}_{1.14}$ ${59.55}_{1.19}$ ${50.92}_{1.42}$ ${47.13}_{1.25}$ ${45.54}_{1.41}$ ${49.7}_{1.22}$ 19 Object Trk. ${66.1}_{2.06}$ ${65.91}_{2.06}$ ${56.6}_{2.68}$ ${49.24}_{2.18}$ ${40.47}_{2.66}$ ${45.63}_{2.17}$ 19 Fluent Trk. ${63.23}_{2.13}$ ${53.11}_{2.2}$ ${47.56}_{3.33}$ ${39.39}_{2.27}$ ${36.0}_{3.2}$ ${50.69}_{2.22}$ 19 State Trk. ${56.82}_{7.47}$ ${47.73}_{7.53}$ — ${67.65}_{8.02}$ — ${52.27}_{7.53}$ 19 Action Exec. ${56.25}_{5.55}$ ${52.5}_{5.58}$ ${62.5}_{6.99}$ ${43.28}_{6.05}$ ${43.75}_{7.16}$ ${50.0}_{5.66}$ 19 Effects ${52.68}_{2.8}$ ${50.16}_{2.81}$ ${55.21}_{3.59}$ ${43.85}_{3.08}$ ${53.65}_{3.6}$ ${50.79}_{2.82}$ 19 Num. Reas. ${47.5}_{5.58}$ ${51.25}_{5.59}$ ${45.76}_{6.49}$ ${45.31}_{6.22}$ ${54.24}_{6.49}$ ${46.25}_{5.57}$ 19 Hallucination ${81.25}_{4.36}$ ${71.25}_{5.06}$ ${52.17}_{7.37}$ ${53.73}_{6.09}$ ${50.0}_{7.37}$ ${58.75}_{5.5}$ 19 Composite — — — — — — 19 AVG ${61.72}_{1.2}$ ${57.27}_{1.22}$ ${53.46}_{1.65}$ ${45.41}_{1.29}$ ${43.69}_{1.64}$ ${49.26}_{1.24}$

Table 10: Few-shot-1 performance with obfuscation and without ramification across all question categories and action-sequence lengths

Plan Question GPT-4o Gemini Pro LLaMa2 13b LLaMa3 8b LLaMa2 7b Gemma 7b Len. Categories few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 1 Object Trk. ${74.86}_{1.89}$ ${68.0}_{2.04}$ ${59.38}_{2.32}$ ${49.33}_{2.18}$ ${44.64}_{2.35}$ ${45.14}_{2.17}$ 1 Fluent Trk. ${77.39}_{1.79}$ ${61.4}_{2.09}$ ${47.68}_{2.54}$ ${42.3}_{2.18}$ ${40.98}_{2.5}$ ${55.58}_{2.14}$ 1 State Trk. ${74.07}_{5.96}$ ${50.91}_{6.74}$ ${25.0}_{21.65}$ ${54.35}_{7.34}$ ${50.0}_{25.0}$ ${54.55}_{6.71}$ 1 Action Exec. ${77.5}_{4.67}$ ${67.5}_{5.24}$ ${50.0}_{5.98}$ ${55.93}_{6.46}$ ${62.86}_{5.78}$ ${53.75}_{5.57}$ 1 Effects ${60.63}_{2.75}$ ${55.0}_{2.78}$ ${50.7}_{2.96}$ ${47.37}_{3.31}$ ${51.75}_{2.95}$ ${40.94}_{2.75}$ 1 Num. Reas. ${60.53}_{5.61}$ ${48.75}_{5.59}$ ${52.0}_{5.77}$ ${47.06}_{6.99}$ ${56.0}_{5.73}$ ${48.75}_{5.59}$ 1 Hallucination ${76.62}_{4.82}$ ${78.75}_{4.57}$ ${50.67}_{5.77}$ ${61.54}_{6.75}$ ${53.33}_{5.76}$ ${60.0}_{5.48}$ 1 Composite — — — — — — 1 AVG ${72.53}_{1.09}$ ${62.41}_{1.18}$ ${52.67}_{1.36}$ ${47.35}_{1.3}$ ${47.18}_{1.36}$ ${49.28}_{1.22}$ 10 Object Trk. ${70.64}_{1.95}$ ${66.42}_{2.02}$ ${55.45}_{2.39}$ ${45.87}_{2.13}$ ${42.69}_{2.38}$ ${48.81}_{2.14}$ 10 Fluent Trk. ${74.17}_{1.88}$ ${57.01}_{2.13}$ ${45.71}_{2.66}$ ${44.54}_{2.26}$ ${40.0}_{2.62}$ ${50.37}_{2.16}$ 10 State Trk. ${55.77}_{6.89}$ ${53.7}_{6.79}$ — ${28.85}_{6.28}$ — ${53.7}_{6.79}$ 10 Action Exec. ${59.74}_{5.59}$ ${50.0}_{5.59}$ ${43.94}_{6.11}$ ${47.46}_{6.5}$ ${53.03}_{6.14}$ ${48.75}_{5.59}$ 10 Effects ${55.52}_{2.83}$ ${54.69}_{2.78}$ ${49.81}_{3.11}$ ${46.26}_{3.41}$ ${54.05}_{3.1}$ ${49.38}_{2.79}$ 10 Num. Reas. ${56.41}_{5.61}$ ${48.75}_{5.59}$ ${55.0}_{6.42}$ ${42.0}_{6.98}$ ${41.67}_{6.36}$ ${57.5}_{5.53}$ 10 Hallucination ${79.22}_{4.62}$ ${76.25}_{4.76}$ ${53.73}_{6.09}$ ${69.81}_{6.31}$ ${53.73}_{6.09}$ ${57.5}_{5.53}$ 10 Composite — — — — — — 10 AVG ${67.78}_{1.14}$ ${59.67}_{1.19}$ ${50.77}_{1.42}$ ${45.68}_{1.3}$ ${45.42}_{1.42}$ ${50.38}_{1.21}$ 19 Object Trk. ${66.29}_{2.06}$ ${68.94}_{2.01}$ ${54.84}_{2.69}$ ${50.57}_{2.18}$ ${43.99}_{2.69}$ ${45.08}_{2.17}$ 19 Fluent Trk. ${66.15}_{2.09}$ ${56.03}_{2.19}$ ${47.03}_{3.25}$ ${43.84}_{2.42}$ ${33.05}_{3.06}$ ${50.0}_{2.22}$ 19 State Trk. ${60.47}_{7.46}$ ${47.73}_{7.53}$ — ${34.62}_{9.33}$ — ${59.09}_{7.41}$ 19 Action Exec. ${56.76}_{5.76}$ ${51.25}_{5.59}$ ${49.06}_{6.87}$ ${43.14}_{6.94}$ ${49.06}_{6.87}$ ${51.25}_{5.59}$ 19 Effects ${51.77}_{2.83}$ ${50.47}_{2.81}$ ${49.49}_{3.55}$ ${49.77}_{3.41}$ ${50.0}_{3.55}$ ${48.58}_{2.81}$ 19 Num. Reas. ${50.65}_{5.7}$ ${50.0}_{5.59}$ ${52.54}_{6.5}$ ${59.65}_{6.5}$ ${49.15}_{6.51}$ ${41.25}_{5.5}$ 19 Hallucination ${76.25}_{4.76}$ ${67.5}_{5.24}$ ${65.31}_{6.8}$ ${61.22}_{6.96}$ ${57.14}_{7.07}$ ${57.5}_{5.53}$ 19 Composite — — — — — — 19 AVG ${62.63}_{1.2}$ ${58.92}_{1.21}$ ${51.82}_{1.63}$ ${48.52}_{1.36}$ ${43.8}_{1.62}$ ${48.38}_{1.24}$

Table 11: Few-shot-1 performance with obfuscation and ramification across all question categories and action-sequence lengths

C.2 Performance on Free-Answer questions evaluated using Fine-tuned RoBERTa Classifier

Plan Question GPT-4o Gemini Pro LLaMa2 13b LLaMa3 8b LLaMa2 7b Gemma 7b Len. Categories zero shot zero shot zero shot zero shot zero shot zero shot 1 Object Trk. ${50.0}_{7.91}$ ${32.5}_{7.41}$ ${0.0}_{0.0}$ ${27.5}_{7.06}$ ${17.5}_{6.01}$ ${0.0}_{0.0}$ 1 Fluent Trk. ${31.25}_{8.19}$ ${0.0}_{0.0}$ ${3.12}_{3.08}$ ${3.12}_{3.08}$ ${0.0}_{0.0}$ ${3.12}_{3.08}$ 1 State Trk. ${16.13}_{6.61}$ ${21.62}_{6.77}$ ${0.0}_{0.0}$ ${2.56}_{2.53}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ 1 Action Exec. ${57.5}_{7.82}$ ${38.46}_{7.79}$ ${0.0}_{0.0}$ ${22.5}_{6.6}$ ${17.5}_{6.01}$ ${0.0}_{0.0}$ 1 Effects ${6.06}_{4.15}$ ${13.16}_{5.48}$ ${0.0}_{0.0}$ ${5.0}_{3.45}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ 1 Num. Reas. ${25.0}_{6.85}$ ${17.5}_{6.01}$ ${0.0}_{0.0}$ ${7.5}_{4.16}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ 1 Hallucination ${57.78}_{7.36}$ ${26.67}_{6.59}$ ${4.44}_{3.07}$ ${8.89}_{4.24}$ ${28.89}_{6.76}$ ${4.44}_{3.07}$ 1 Composite ${55.0}_{7.87}$ ${17.95}_{6.15}$ ${0.0}_{0.0}$ ${37.5}_{7.65}$ ${7.5}_{4.16}$ ${22.5}_{6.6}$ 1 AVG ${39.2}_{2.81}$ ${21.61}_{2.34}$ ${0.95}_{0.55}$ ${11.23}_{1.9}$ ${9.49}_{1.65}$ ${3.8}_{1.08}$ 10 Object Trk. ${55.0}_{7.87}$ ${47.5}_{7.9}$ ${2.5}_{2.47}$ ${40.0}_{7.75}$ ${17.5}_{6.01}$ ${2.5}_{2.47}$ 10 Fluent Trk. ${17.65}_{5.34}$ ${9.8}_{4.16}$ ${1.96}_{1.94}$ ${9.8}_{4.16}$ ${3.92}_{2.72}$ ${1.96}_{1.94}$ 10 State Trk. ${7.14}_{4.87}$ ${19.35}_{7.1}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${2.94}_{2.9}$ ${0.0}_{0.0}$ 10 Action Exec. ${47.5}_{7.9}$ ${42.42}_{8.6}$ ${0.0}_{0.0}$ ${37.5}_{7.65}$ ${17.5}_{6.01}$ ${0.0}_{0.0}$ 10 Effects ${3.12}_{3.08}$ ${7.14}_{4.87}$ ${0.0}_{0.0}$ ${2.63}_{2.6}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ 10 Num. Reas. ${22.5}_{6.6}$ ${17.5}_{6.01}$ ${5.0}_{3.45}$ ${10.0}_{4.74}$ ${2.5}_{2.47}$ ${5.0}_{3.45}$ 10 Hallucination ${63.16}_{7.83}$ ${23.68}_{6.9}$ ${0.0}_{0.0}$ ${13.16}_{5.48}$ ${28.95}_{7.36}$ ${0.0}_{0.0}$ 10 Composite ${65.0}_{7.54}$ ${22.22}_{6.93}$ ${0.0}_{0.0}$ ${15.0}_{5.65}$ ${12.5}_{5.23}$ ${12.5}_{5.23}$ 10 AVG ${36.25}_{2.73}$ ${23.57}_{2.46}$ ${1.25}_{0.62}$ ${16.37}_{2.21}$ ${10.59}_{1.72}$ ${2.8}_{0.92}$ 19 Object Trk. ${55.0}_{7.87}$ ${40.0}_{7.75}$ ${0.0}_{0.0}$ ${45.0}_{7.87}$ ${17.5}_{6.01}$ ${0.0}_{0.0}$ 19 Fluent Trk. ${8.51}_{4.07}$ ${2.13}_{2.1}$ ${0.0}_{0.0}$ ${2.13}_{2.1}$ ${12.77}_{4.87}$ ${0.0}_{0.0}$ 19 State Trk. ${19.05}_{8.57}$ ${27.78}_{10.56}$ ${0.0}_{0.0}$ ${7.14}_{4.87}$ ${3.57}_{3.51}$ ${0.0}_{0.0}$ 19 Action Exec. ${51.35}_{8.22}$ ${48.48}_{8.7}$ ${0.0}_{0.0}$ ${38.46}_{7.79}$ ${15.38}_{5.78}$ ${0.0}_{0.0}$ 19 Effects ${8.7}_{5.88}$ ${8.7}_{5.88}$ ${0.0}_{0.0}$ ${6.67}_{4.55}$ ${3.33}_{3.28}$ ${0.0}_{0.0}$ 19 Num. Reas. ${12.5}_{5.23}$ ${7.5}_{4.16}$ ${2.5}_{2.47}$ ${7.5}_{4.16}$ ${2.5}_{2.47}$ ${2.5}_{2.47}$ 19 Hallucination ${48.94}_{7.29}$ ${27.66}_{6.52}$ ${2.13}_{2.1}$ ${17.02}_{5.48}$ ${19.15}_{5.74}$ ${2.13}_{2.1}$ 19 Composite ${60.0}_{7.75}$ ${19.44}_{6.6}$ ${2.5}_{2.47}$ ${20.0}_{6.32}$ ${17.5}_{6.01}$ ${15.0}_{5.65}$ 19 AVG ${34.92}_{2.78}$ ${22.18}_{2.47}$ ${0.96}_{0.55}$ ${18.08}_{2.34}$ ${12.22}_{1.86}$ ${2.57}_{0.9}$

Table 12: Zero-shot performance without obfuscations and ramification across all question categories and action-sequence lengths

Plan Question GPT-4o Gemini Pro LLaMa2 13b LLaMa3 8b LLaMa2 7b Gemma 7b Len. Categories few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 1 Object Trk. ${80.0}_{6.32}$ ${48.72}_{8.0}$ ${30.0}_{7.25}$ ${55.0}_{7.87}$ ${30.0}_{7.25}$ ${32.5}_{7.41}$ 1 Fluent Trk. ${12.9}_{6.02}$ ${6.25}_{4.28}$ ${0.0}_{0.0}$ ${9.38}_{5.15}$ ${9.38}_{5.15}$ ${3.12}_{3.08}$ 1 State Trk. ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ 1 Action Exec. ${65.0}_{7.54}$ ${45.0}_{7.87}$ ${22.5}_{6.6}$ ${35.0}_{7.54}$ ${25.0}_{6.85}$ ${20.0}_{6.32}$ 1 Effects ${5.0}_{4.87}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ 1 Num. Reas. ${17.5}_{6.01}$ ${10.0}_{4.74}$ ${10.0}_{4.74}$ ${2.5}_{2.47}$ ${5.0}_{3.45}$ ${7.5}_{4.16}$ 1 Hallucination ${61.36}_{7.34}$ ${48.89}_{7.45}$ ${37.78}_{7.23}$ ${31.11}_{6.9}$ ${31.11}_{6.9}$ ${31.82}_{7.02}$ 1 Composite ${47.37}_{8.1}$ ${47.06}_{8.56}$ ${28.21}_{7.21}$ ${10.0}_{4.74}$ ${30.77}_{7.39}$ ${47.5}_{7.9}$ 1 AVG ${41.28}_{3.21}$ ${28.38}_{2.98}$ ${15.22}_{2.16}$ ${19.57}_{2.39}$ ${14.86}_{2.14}$ ${14.18}_{2.1}$ 10 Object Trk. ${74.36}_{6.99}$ ${42.5}_{7.82}$ ${35.0}_{7.54}$ ${50.0}_{7.91}$ ${30.0}_{7.25}$ ${22.5}_{6.6}$ 10 Fluent Trk. ${26.0}_{6.2}$ ${24.49}_{6.14}$ ${9.8}_{4.16}$ ${17.65}_{5.34}$ ${13.73}_{4.82}$ ${10.2}_{4.32}$ 10 State Trk. ${12.5}_{8.27}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ 10 Action Exec. ${45.0}_{7.87}$ ${51.43}_{8.45}$ ${22.5}_{6.6}$ ${45.0}_{7.87}$ ${25.0}_{6.85}$ ${17.5}_{6.01}$ 10 Effects ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${2.63}_{2.6}$ ${0.0}_{0.0}$ 10 Num. Reas. ${15.0}_{5.65}$ ${10.0}_{4.74}$ ${0.0}_{0.0}$ ${5.0}_{3.45}$ ${0.0}_{0.0}$ ${12.5}_{5.23}$ 10 Hallucination ${45.95}_{8.19}$ ${50.0}_{8.11}$ ${21.05}_{6.61}$ ${21.05}_{6.61}$ ${42.11}_{8.01}$ ${43.24}_{8.14}$ 10 Composite ${46.15}_{7.98}$ ${36.36}_{8.37}$ ${28.21}_{7.21}$ ${17.95}_{6.15}$ ${43.59}_{7.94}$ ${53.85}_{7.98}$ 10 AVG ${35.71}_{3.11}$ ${30.43}_{3.03}$ ${12.81}_{1.99}$ ${20.28}_{2.4}$ ${16.37}_{2.21}$ ${15.11}_{2.15}$ 19 Object Trk. ${79.49}_{6.47}$ ${43.59}_{7.94}$ ${52.5}_{7.9}$ ${57.5}_{7.82}$ ${40.0}_{7.75}$ ${37.5}_{7.65}$ 19 Fluent Trk. ${9.09}_{4.33}$ ${11.11}_{4.68}$ ${4.26}_{2.94}$ ${4.26}_{2.94}$ ${10.64}_{4.5}$ ${12.77}_{4.87}$ 19 State Trk. ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ 19 Action Exec. ${54.29}_{8.42}$ ${59.38}_{8.68}$ ${7.69}_{4.27}$ ${33.33}_{7.55}$ ${17.95}_{6.15}$ ${17.95}_{6.15}$ 19 Effects ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ 19 Num. Reas. ${13.51}_{5.62}$ ${7.5}_{4.16}$ ${2.5}_{2.47}$ ${5.0}_{3.45}$ ${2.5}_{2.47}$ ${12.5}_{5.23}$ 19 Hallucination ${40.43}_{7.16}$ ${46.81}_{7.28}$ ${12.77}_{4.87}$ ${21.28}_{5.97}$ ${17.02}_{5.48}$ ${17.39}_{5.59}$ 19 Composite ${47.37}_{8.1}$ ${45.71}_{8.42}$ ${43.59}_{7.94}$ ${12.5}_{5.23}$ ${43.59}_{7.94}$ ${47.5}_{7.9}$ 19 AVG ${34.36}_{3.15}$ ${30.41}_{3.12}$ ${12.18}_{1.99}$ ${18.45}_{2.36}$ ${13.65}_{2.09}$ ${15.19}_{2.18}$

Table 13: Few-shot-1 performance without obfuscations and ramifications across all question categories and action-sequence lengths

Plan Question GPT-4o Gemini Pro LLaMa2 13b LLaMa3 8b LLaMa2 7b Gemma 7b Len. Categories few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 1 Object Trk. ${75.0}_{6.85}$ ${47.37}_{8.1}$ ${25.0}_{6.85}$ ${47.5}_{7.9}$ ${27.5}_{7.06}$ ${27.5}_{7.06}$ 1 Fluent Trk. ${21.88}_{7.31}$ ${9.68}_{5.31}$ ${3.12}_{3.08}$ ${6.25}_{4.28}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ 1 State Trk. ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ 1 Action Exec. ${48.72}_{8.0}$ ${34.48}_{8.83}$ ${18.75}_{6.9}$ ${32.26}_{8.4}$ ${29.03}_{8.15}$ ${15.62}_{6.42}$ 1 Effects ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ 1 Num. Reas. ${27.5}_{7.06}$ ${10.26}_{4.86}$ ${7.5}_{4.16}$ ${10.53}_{4.98}$ ${5.41}_{3.72}$ ${17.5}_{6.01}$ 1 Hallucination ${60.0}_{7.3}$ ${48.89}_{7.45}$ ${17.78}_{5.7}$ ${27.27}_{6.71}$ ${15.38}_{5.78}$ ${28.89}_{6.76}$ 1 Composite — — — — — — 1 AVG ${40.0}_{3.2}$ ${27.01}_{3.06}$ ${10.45}_{1.87}$ ${17.8}_{2.35}$ ${11.16}_{1.99}$ ${13.43}_{2.08}$ 10 Object Trk. ${67.5}_{7.41}$ ${50.0}_{8.11}$ ${37.5}_{7.65}$ ${50.0}_{7.91}$ ${45.0}_{7.87}$ ${22.5}_{6.6}$ 10 Fluent Trk. ${26.0}_{6.2}$ ${12.0}_{4.6}$ ${7.84}_{3.76}$ ${13.73}_{4.82}$ ${14.58}_{5.09}$ ${13.73}_{4.82}$ 10 State Trk. ${11.76}_{7.81}$ ${0.0}_{0.0}$ ${3.03}_{2.98}$ ${2.94}_{2.9}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ 10 Action Exec. ${39.47}_{7.93}$ ${48.39}_{8.98}$ ${9.09}_{5.0}$ ${42.42}_{8.6}$ ${12.5}_{5.85}$ ${27.27}_{7.75}$ 10 Effects ${8.7}_{5.88}$ ${0.0}_{0.0}$ ${5.26}_{3.62}$ ${0.0}_{0.0}$ ${2.7}_{2.67}$ ${0.0}_{0.0}$ 10 Num. Reas. ${20.0}_{6.32}$ ${12.5}_{5.23}$ ${2.5}_{2.47}$ ${7.69}_{4.27}$ ${0.0}_{0.0}$ ${15.0}_{5.65}$ 10 Hallucination ${44.74}_{8.07}$ ${55.26}_{8.07}$ ${16.22}_{6.06}$ ${31.58}_{7.54}$ ${35.29}_{8.2}$ ${35.14}_{7.85}$ 10 Composite — — — — — — 10 AVG ${34.15}_{3.02}$ ${28.45}_{2.96}$ ${11.76}_{1.95}$ ${20.88}_{2.46}$ ${16.15}_{2.28}$ ${16.12}_{2.23}$ 19 Object Trk. ${82.5}_{6.01}$ ${48.72}_{8.0}$ ${42.5}_{7.82}$ ${65.0}_{7.54}$ ${40.0}_{7.75}$ ${42.5}_{7.82}$ 19 Fluent Trk. ${24.44}_{6.41}$ ${17.78}_{5.7}$ ${12.77}_{4.87}$ ${6.38}_{3.57}$ ${19.57}_{5.85}$ ${21.28}_{5.97}$ 19 State Trk. ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${3.7}_{3.63}$ ${0.0}_{0.0}$ 19 Action Exec. ${44.74}_{8.07}$ ${57.58}_{8.6}$ ${25.0}_{7.22}$ ${38.89}_{8.12}$ ${34.29}_{8.02}$ ${19.44}_{6.6}$ 19 Effects ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ 19 Num. Reas. ${22.5}_{6.6}$ ${12.5}_{5.23}$ ${5.13}_{3.53}$ ${12.82}_{5.35}$ ${2.7}_{2.67}$ ${15.0}_{5.65}$ 19 Hallucination ${44.68}_{7.25}$ ${48.94}_{7.29}$ ${19.15}_{5.74}$ ${20.0}_{5.96}$ ${22.22}_{6.2}$ ${21.28}_{5.97}$ 19 Composite — — — — — — 19 AVG ${38.24}_{3.15}$ ${32.46}_{3.1}$ ${16.17}_{2.26}$ ${21.51}_{2.52}$ ${18.85}_{2.43}$ ${18.66}_{2.38}$

Table 14: Few-shot-1 performance with ramification and without obfuscations across all question categories and action-sequence lengths

Plan Question GPT-4o Gemini Pro LLaMa2 13b LLaMa3 8b LLaMa2 7b Gemma 7b Len. Categories few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 1 Object Trk. ${47.5}_{7.9}$ ${35.0}_{7.54}$ ${17.65}_{6.54}$ ${32.5}_{7.41}$ ${26.47}_{7.57}$ ${28.21}_{7.21}$ 1 Fluent Trk. ${24.14}_{7.95}$ ${16.67}_{6.8}$ ${3.33}_{3.28}$ ${3.57}_{3.51}$ ${6.67}_{4.55}$ ${3.23}_{3.17}$ 1 State Trk. ${100.0}_{0.0}$ ${100.0}_{0.0}$ ${6.67}_{6.44}$ ${20.0}_{12.65}$ ${6.67}_{6.44}$ ${10.53}_{7.04}$ 1 Action Exec. ${47.22}_{8.32}$ ${52.94}_{8.56}$ ${16.13}_{6.61}$ ${25.81}_{7.86}$ ${21.21}_{7.12}$ ${28.95}_{7.36}$ 1 Effects ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${0.0}_{0.0}$ ${7.69}_{7.39}$ ${4.76}_{4.65}$ 1 Num. Reas. ${12.5}_{5.23}$ ${15.0}_{5.65}$ ${13.89}_{5.76}$ ${13.16}_{5.48}$ ${2.78}_{2.74}$ ${12.5}_{5.23}$ 1 Hallucination ${52.27}_{7.53}$ ${47.73}_{7.53}$ ${23.08}_{6.75}$ ${45.71}_{8.42}$ ${30.77}_{7.39}$ ${33.33}_{7.03}$ 1 Composite — — — — — — 1 AVG ${37.24}_{3.45}$ ${34.38}_{3.43}$ ${13.71}_{2.45}$ ${23.32}_{3.04}$ ${16.5}_{2.62}$ ${19.74}_{2.61}$ 10 Object Trk. ${60.0}_{7.75}$ ${42.5}_{7.82}$ ${30.3}_{8.0}$ ${47.5}_{7.9}$ ${24.24}_{7.46}$ ${15.0}_{5.65}$ 10 Fluent Trk. ${26.53}_{6.31}$ ${16.67}_{5.38}$ ${12.5}_{5.23}$ ${21.74}_{6.08}$ ${12.5}_{5.23}$ ${18.37}_{5.53}$ 10 State Trk. ${100.0}_{0.0}$ ${50.0}_{35.36}$ ${9.09}_{8.67}$ ${8.33}_{7.98}$ ${7.69}_{7.39}$ ${16.67}_{6.8}$ 10 Action Exec. ${46.67}_{9.11}$ ${53.57}_{9.42}$ ${22.73}_{8.93}$ ${46.15}_{9.78}$ ${39.13}_{10.18}$ ${14.71}_{6.07}$ 10 Effects — ${0.0}_{0.0}$ ${7.69}_{7.39}$ ${8.33}_{7.98}$ ${0.0}_{0.0}$ ${6.67}_{4.55}$ 10 Num. Reas. ${23.08}_{6.75}$ ${17.5}_{6.01}$ ${0.0}_{0.0}$ ${10.0}_{5.48}$ ${2.94}_{2.9}$ ${12.82}_{5.35}$ 10 Hallucination ${38.24}_{8.33}$ ${52.63}_{8.1}$ ${24.0}_{8.54}$ ${40.0}_{8.94}$ ${32.0}_{9.33}$ ${42.11}_{8.01}$ 10 Composite — — — — — — 10 AVG ${38.34}_{3.5}$ ${34.52}_{3.39}$ ${15.73}_{2.73}$ ${29.59}_{3.26}$ ${17.58}_{2.82}$ ${18.46}_{2.41}$ 19 Object Trk. ${55.0}_{7.87}$ ${33.33}_{7.55}$ ${25.93}_{8.43}$ ${55.0}_{7.87}$ ${14.81}_{6.84}$ ${20.51}_{6.47}$ 19 Fluent Trk. ${8.51}_{4.07}$ ${7.14}_{3.97}$ ${7.69}_{5.23}$ ${4.88}_{3.36}$ ${11.54}_{6.27}$ ${13.04}_{4.97}$ 19 State Trk. — ${0.0}_{0.0}$ ${20.0}_{17.89}$ ${12.5}_{11.69}$ ${0.0}_{0.0}$ ${18.18}_{8.22}$ 19 Action Exec. ${17.14}_{6.37}$ ${50.0}_{8.84}$ ${16.0}_{7.33}$ ${22.58}_{7.51}$ ${16.0}_{7.33}$ ${13.89}_{5.76}$ 19 Effects ${0.0}_{0.0}$ ${100.0}_{0.0}$ ${0.0}_{0.0}$ ${9.09}_{8.67}$ ${0.0}_{0.0}$ ${10.53}_{7.04}$ 19 Num. Reas. ${18.92}_{6.44}$ ${15.0}_{5.65}$ ${0.0}_{0.0}$ ${3.03}_{2.98}$ ${3.57}_{3.51}$ ${15.0}_{5.65}$ 19 Hallucination ${35.56}_{7.14}$ ${39.13}_{7.2}$ ${14.29}_{7.64}$ ${10.53}_{4.98}$ ${23.81}_{9.29}$ ${21.28}_{5.97}$ 19 Composite — — — — — — 19 AVG ${26.83}_{3.09}$ ${28.71}_{3.18}$ ${12.5}_{2.84}$ ${18.81}_{2.75}$ ${12.5}_{2.84}$ ${16.47}_{2.35}$

Table 15: Few-shot-1 performance with obfuscation and without ramification across all question categories and action-sequence lengths

Plan Question GPT-4o Gemini Pro LLaMa2 13b LLaMa3 8b LLaMa2 7b Gemma 7b Len. Categories few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 1 Object Trk. ${55.0}_{7.87}$ ${27.5}_{7.06}$ ${17.65}_{6.54}$ ${30.0}_{7.25}$ ${14.71}_{6.07}$ ${17.5}_{6.01}$ 1 Fluent Trk. ${16.67}_{6.8}$ ${10.34}_{5.66}$ ${0.0}_{0.0}$ ${3.33}_{3.28}$ ${0.0}_{0.0}$ ${3.23}_{3.17}$ 1 State Trk. ${100.0}_{0.0}$ ${100.0}_{0.0}$ ${14.29}_{9.35}$ ${10.0}_{9.49}$ ${25.0}_{10.83}$ ${3.85}_{3.77}$ 1 Action Exec. ${54.29}_{8.42}$ ${45.16}_{8.94}$ ${29.63}_{8.79}$ ${31.82}_{9.93}$ ${18.52}_{7.48}$ ${32.26}_{8.4}$ 1 Effects ${40.0}_{21.91}$ ${75.0}_{21.65}$ ${14.29}_{9.35}$ ${0.0}_{0.0}$ ${6.67}_{6.44}$ ${8.33}_{5.64}$ 1 Num. Reas. ${18.42}_{6.29}$ ${15.0}_{5.65}$ ${14.71}_{6.07}$ ${10.0}_{5.48}$ ${5.88}_{4.04}$ ${20.0}_{6.32}$ 1 Hallucination ${46.67}_{7.44}$ ${44.44}_{7.41}$ ${13.89}_{5.76}$ ${33.33}_{8.61}$ ${22.22}_{6.93}$ ${28.89}_{6.76}$ 1 Composite — — — — — — 1 AVG ${39.69}_{3.51}$ ${30.53}_{3.34}$ ${14.81}_{2.58}$ ${19.65}_{3.02}$ ${13.02}_{2.43}$ ${17.72}_{2.48}$ 10 Object Trk. ${62.5}_{7.65}$ ${42.5}_{7.82}$ ${30.3}_{8.0}$ ${47.5}_{7.9}$ ${33.33}_{8.21}$ ${12.5}_{5.23}$ 10 Fluent Trk. ${24.0}_{6.04}$ ${8.33}_{3.99}$ ${17.07}_{5.88}$ ${16.67}_{5.38}$ ${12.2}_{5.11}$ ${16.0}_{5.18}$ 10 State Trk. ${100.0}_{0.0}$ ${50.0}_{35.36}$ ${20.0}_{12.65}$ ${16.67}_{10.76}$ ${16.67}_{10.76}$ ${17.24}_{7.01}$ 10 Action Exec. ${31.03}_{8.59}$ ${55.56}_{9.56}$ ${21.74}_{8.6}$ ${23.81}_{9.29}$ ${12.5}_{6.75}$ ${16.67}_{6.8}$ 10 Effects ${57.14}_{18.7}$ ${66.67}_{19.25}$ ${0.0}_{0.0}$ ${25.0}_{12.5}$ ${0.0}_{0.0}$ ${10.71}_{5.85}$ 10 Num. Reas. ${10.53}_{4.98}$ ${12.5}_{5.23}$ ${2.94}_{2.9}$ ${3.57}_{3.51}$ ${0.0}_{0.0}$ ${12.5}_{5.23}$ 10 Hallucination ${45.45}_{8.67}$ ${51.35}_{8.22}$ ${22.73}_{8.93}$ ${28.0}_{8.98}$ ${27.27}_{9.5}$ ${35.14}_{7.85}$ 10 Composite — — — — — — 10 AVG ${35.68}_{3.4}$ ${32.5}_{3.31}$ ${16.95}_{2.82}$ ${24.19}_{3.14}$ ${15.0}_{2.66}$ ${17.32}_{2.37}$ 19 Object Trk. ${52.5}_{7.9}$ ${37.5}_{7.65}$ ${25.0}_{8.18}$ ${42.5}_{7.82}$ ${17.86}_{7.24}$ ${17.5}_{6.01}$ 19 Fluent Trk. ${27.27}_{6.71}$ ${20.0}_{6.32}$ ${7.14}_{4.87}$ ${15.62}_{6.42}$ ${7.14}_{4.87}$ ${21.28}_{5.97}$ 19 State Trk. ${50.0}_{35.36}$ ${50.0}_{35.36}$ ${20.0}_{17.89}$ ${33.33}_{15.71}$ ${16.67}_{15.21}$ ${18.18}_{8.22}$ 19 Action Exec. ${40.0}_{8.94}$ ${56.67}_{9.05}$ ${30.43}_{9.59}$ ${33.33}_{10.29}$ ${30.43}_{9.59}$ ${21.21}_{7.12}$ 19 Effects ${0.0}_{0.0}$ ${66.67}_{27.22}$ ${0.0}_{0.0}$ ${11.11}_{10.48}$ ${0.0}_{0.0}$ ${15.79}_{8.37}$ 19 Num. Reas. ${20.0}_{6.76}$ ${15.0}_{5.65}$ ${7.41}_{5.04}$ ${3.33}_{3.28}$ ${0.0}_{0.0}$ ${20.0}_{6.32}$ 19 Hallucination ${31.82}_{7.02}$ ${41.3}_{7.26}$ ${22.22}_{9.8}$ ${17.65}_{6.54}$ ${27.78}_{10.56}$ ${27.66}_{6.52}$ 19 Composite — — — — — — 19 AVG ${33.84}_{3.36}$ ${33.83}_{3.34}$ ${17.16}_{3.26}$ ${22.86}_{3.17}$ ${14.81}_{3.06}$ ${20.97}_{2.58}$

Table 16: Few-shot-1 performance with obfuscations and ramifications across all question categories and action-sequence lengths

C.3 Additional Observations

C.3.1 Observations on Domains Variation

State Tracking and Action Executability is better on the Transporation Domains

Analysis of Figure 4 reveals statistically significant trends concerning the performance of state tracking and action executability questions within transportation domains. These types of questions consistently show enhanced performance when situated in transportation-related contexts, as indicated by results falling outside the SEM.

While other tendencies are also observable, such as improved performance in object tracking, fluent tracking, and numerical reasoning within transportation domains, these trends remain within the SEM. This suggests that the differences are not statistically significant although there is a slight preference for transportation settings in these categories. Thus, while transportation domains specifically enhance model capabilities in state tracking and action executability, the influence on other categories of questions is less pronounced and should be interpreted cautiously.

Hallucination Detection is better on the Non-Transporation Domains

Figure 4 highlights a statistically significant trend wherein hallucination detection tasks exhibit better performance in non-transportation domains. This notable improvement suggests that models may be better tuned or more responsive to the complexities and nuances in scenarios outside of transportation contexts when it comes to recognizing and correcting hallucinations in the data.

Conversely, while there is a slight indication that the effects of actions are also better understood in non-transportation domains, this performance difference remains within the SEM. Therefore, although there is a tendency for improved action effect recognition in non-transportation settings, this finding is not statistically robust and cannot be conclusively affirmed. This underscores the variability in model performance across different domains and highlights the need for cautious interpretation and further validation.

Little statistical differences for Ramifications and Domain Types

Analysis of Figure 8 shows that the performance on questions involving ramifications versus those without ramifications falls statistically within the standard error of the mean (SEM). This indicates that the presence or absence of ramifications does not significantly impact overall model performance. Although there is a slight tendency for base and derived fluents to perform better in scenarios involving ramifications, this difference also remains within the SEM boundaries.

In our examination of behavior by domains as depicted in Figure 5, a similar pattern emerges. Base and derived fluents tend towards better performance in transportation-related scenarios, whereas static fluents tend to perform better in non-transportation contexts. However, all these observations remain statistically within the standard error of the mean (SEM), indicating that these trends do not represent significant deviations. For persistent fluents, there is a noticeable, albeit slight, inclination towards better performance in transportation scenarios. Nonetheless, given the proximity of these results to the SEM threshold, these findings should be approached cautiously.

C.3.2 Fluents

C.4 Ramifications

Appendix D Fine-tuning Results and Plots

In this section, we describe the fine-tuning performed on the training split of ActionReasoningBench described in section 3.6. We finetune Llama3-8b for 2 epochs with AdamW optimizer and a batch size of 4. Fig 9 and 10 shows the loss over time for binary and free-answer fine-tuning respectively.

Appendix E Free Answers Evaluation Details

E.1 Metrics Tests

Despite the widespread use of Rouge (Lin, 2004) in natural language generation tasks, there are several inherent limitations that undermine its effectiveness as a reliable metric for our task. We highlight some examples below to support our claim.

⬇

Test Case 1: Fails on simple paraphrasing

block b1 is on b2

block b1 is on top of b2

ROUGE-1: {’r’: 0.7142857142857143, ’p’: 1.0, ’f’: 0.8333333284722222}

ROUGE-2: {’r’: 0.5, ’p’: 0.75, ’f’: 0.5999999952}

ROUGE-L: {’r’: 0.7142857142857143, ’p’: 1.0, ’f’: 0.8333333284722222}

block b1 is on block b2

b1 is on b2

ROUGE-1: {’r’: 1.0, ’p’: 0.8, ’f’: 0.8888888839506174}

ROUGE-2: {’r’: 0.6666666666666666, ’p’: 0.4, ’f’: 0.49999999531250006}

ROUGE-L: {’r’: 1.0, ’p’: 0.8, ’f’: 0.8888888839506174}

Test Case 2: Fails when object names interchanged

block b1 is on block b2

block b2 is on block b1

ROUGE-1: {’r’: 1.0, ’p’: 1.0, ’f’: 0.999999995}

ROUGE-2: {’r’: 0.8, ’p’: 0.8, ’f’: 0.7999999950000002}

ROUGE-L: {’r’: 0.6, ’p’: 0.6, ’f’: 0.5999999950000001}

Test Case 3: Fails when asked about multiple object type names

Question: What is the object type of block b1, block b2, block b3 and block b4

label: block

response: block, block, block and block

ROUGE-1: {’r’: 0.3333333333333333, ’p’: 1.0, ’f’: 0.4999999962500001}

ROUGE-2: {’r’: 0.0, ’p’: 0.0, ’f’: 0.0}

ROUGE-L: {’r’: 0.3333333333333333, ’p’: 1.0, ’f’: 0.4999999962500001}

Even though the response is not as exact as label, the repsonse is correct.

Test Case 4: Fails when object type changed with same object name

block b1 is under block b2

ball b1 is under ball b2

ROUGE-1: {’r’: 0.8, ’p’: 0.8, ’f’: 0.7999999950000002}

ROUGE-2: {’r’: 0.4, ’p’: 0.4, ’f’: 0.3999999950000001}

ROUGE-L: {’r’: 0.8, ’p’: 0.8, ’f’: 0.7999999950000002}

Test Case 5: Fails on simple negation

block b1 is not on block b2

block b1 is on top of block b2

ROUGE-1: {’r’: 0.7142857142857143, ’p’: 0.8333333333333334, ’f’: 0.7692307642603551}

ROUGE-2: {’r’: 0.42857142857142855, ’p’: 0.5, ’f’: 0.4615384565680473}

ROUGE-L: {’r’: 0.7142857142857143, ’p’: 0.8333333333333334, ’f’: 0.7692307642603551}

Test Case 6: Fails on simple negation with paraphrasing

block b1 is on block b2

block b2 is not under block b1, but on top

ROUGE-1: {’r’: 0.4444444444444444, ’p’: 0.8, ’f’: 0.5714285668367348}

ROUGE-2: {’r’: 0.1111111111111111, ’p’: 0.2, ’f’: 0.14285713826530627}

ROUGE-L: {’r’: 0.3333333333333333, ’p’: 0.6, ’f’: 0.4285714239795918}

Appendix F Dataset Creation

F.1 Domain, grid-visit-all example

⬇

(define (domain grid-visit-all)

(:requirements :ty**)

(:types place - object)

(:predicates (connected ?x ?y - place)

(at-robot ?x - place)

(visited ?x - place))

(:action move

:parameters (?curpos ?nextpos - place)

:precondition (and

(at-robot ?curpos)

(connected ?curpos ?nextpos))

:effect (and

(at-robot ?nextpos)

(not (at-robot ?curpos))

(visited ?nextpos))))

Going line by line: $(:requirements:ty**)$ indicates that objects involved in the domain are of different types. $(:typesplace-object)$ defines what types of objects we are working with; in this case, it is just a place. $(:predicates...$ define predicates in the problem that take in objects. Predicates can only be true or false. And finally, we have $actions$ ; in this example, there is a single action, but there can be many in general. Each action has $:parameters$ that define variables of various $types$ , $:precondition$ , which defines a set of $predicates$ that need to be true in order to execute the action, and $effect$ , which is a set of predicates that will be true after the action is executed.

F.2 Instance, grid-visit-all example

An instance is composed of objects, initial and final conditions. This is a trivial problem instance, as initial and goal conditions are the same.

⬇

(define (problem grid-1)

(:domain grid-visit-all)

(:objects loc-x1-y0 - place )

(:init (at-robot loc-x1-y0)

(visited loc-x1-y0))

(:goal (and (visited loc-x1-y0))))

F.3 Blocksworld

The Blocksworld involves a set of blocks, a hand that can manipulate blocks one at a time, and a table where the blocks stay. The objective is to re-stack an initial set of blocks to a desired configuration using a hand.

F.4 Logistics

The Logistics is a transportation domain that involves of transport of packages using 2 types of mobiles: trucks and airplanes. Package locations are located within cities. Each city has a unique airport. The objective is to move packages using mobiles from initial to goal locations. Those locations can be in various cities.

F.5 Depots

The Depots is a transportation domain that is a combination of Blocksworld and Logistics. The transportation aspects are the same as in the Logistics domain, but one can stack packages like in Blocksworld.

F.6 Driverlog

The Driverlog is a transportation domain that involves the transportation of packages using trucks. It involves packages, trucks, locations and drivers.

F.7 Goldminer

The Goldminer involves an agent, bombs, a laser, gold, and a mine organized as a grid. Each cell in the grid represents soft or hard rock. An agent can use bombs or a laser to penetrate the soft rock to uncover gold. The bomb cannot destroy gold, but the laser can. The objective is to collect gold.

F.8 Gripper

The Gripper is a transportation domain that involves a single agent with 2 "grippers" to pick up and put down objects. The objective is to move objects from initial locations to goal locations.

F.9 Miconic

Miconic is a transportation domain that involves an elevator, floors, and passengers. The objective is to transport passengers from the initial to goal floors using an elevator.

F.10 Mystery

The Mystery is a transportation domain with fuel restrictions. It involves locations, mobiles, fuel and fuel constraints, and cargo. The objective is to organize proper logistics.

F.11 N-puzzle

A common variation of this domain is a familiar "15 puzzle", a slide square puzzle with 15 tiles numbered 1-15 and one open slot. The objective is to rearrange the tiles in numerical order. N-puzzle is a generalization to N tiles.

F.12 Satellite

This domain involves the scheduling of satellites to gather information about space phenomena. It involves satellites, instruments, image modes, pointing directions, locations, instrument capabilities, calibration target functions, initial goal pointing directions, and image objectives.

F.13 Spanner

This domain involves an agent, a location, spanners (tools) and nuts to be tightened. The objective is to pick up spanners and tighten the nuts. The caveat is that only one spanner can be used to tighten one nut.

F.14 Visit-All

This very simple domain involves an agent that has to visit all cells in an $N\times N$ grid.

F.15 Zenotravel

Zenotravel is a transportation domain that involves airplanes, locations, fuel levels, and passengers. The objective is to transport passengers from initial to goal locations and not run out of fuel.

F.16 Additional Statistics

test-true_false_answer test-free_answer train-true_false_answer train-free_answer action_executability 240 119 2360 2481 effects 957 108 8243 1842 fluent_tracking 1600 130 15801 5870 hallucination 240 130 7660 13426 numerical_reasoning 240 120 12760 6380 object_tracking 1598 120 8696 1180 state_tracking 153 101 1147 1849

Table 17: dataset split

F.17 Domains

Domains Info Instances Info Other Domain Description # fluents # actions <exec>_std <effects-wo-ram>_std <effects-wi-ram>_std # object types <objects>_std <actions from a state>_std <log(state space size)>_std Domain type IPC Year State Complexity Blocksworld 5 4 2.25 ± 0.83 4.5 ± 0.5 1 8.1 ± 0.83 132.6 ± 26.67 92.46 ± 3.96 2000 $O(2^{N^{2}+2N})$ Depots 6 5 3.4 ± 1.36 4 ± 1.67 6 25.5 ± 2.25 4361.8 ± 1052.02 158.68 ± 4.57 2002 $O(2^{4N^{2}+2N})$ DriverLog 5 6 2.33 ± 0.47 2.33 ± 0.47 4 19.1 ± 2.02 1120.8 ± 503.46 131.37 ± 9.19 2002 $O(2^{5N^{2}-N})$ GoldMiner 12 7 3 ± 0.53 2.86 ± 0.83 1 15.3 ± 4.29 772.8 ± 451.06 123.57 ± 10.11 2018 $O(2^{N^{2}+6N})$ Grippers 4 3 2 ± 0.82 2.67 ± 0.47 4 15.8 ± 1.47 245.8 ± 161.38 101.14 ± 10.99 1998 $O(2^{N^{3}+3N^{2}})$ Logistics 3 6 2 ± 0.58 2 ± 0 5 18 ± 2.57 623.0 ± 312.62 119.21 ± 11.73 1998 $O(2^{3N^{2}})$ Miconic 8 4 2.25 ± 0.43 1.75 ± 0.43 2 17.1 ± 1.58 279.4 ± 58.04 106.6 ± 4.08 2000 $O(2^{3N^{2}+4N})$ Mystery 7 3 4 ± 0 4 ± 0 5 24.2 ± 1.33 413.2 ± 154.92 113.05 ± 7.48 1998 $O(2^{7N^{2}-3N})$ NPuzzle 3 1 3 ± 0 4 ± 0 2 17 ± 0 576.0 ± 0.0 120.77 ± 0.0 2018 $O(2^{2N^{2}})$ Satellite 8 5 2.8 ± 1.47 1.8 ± 0.75 4 24.7 ± 3.85 1234.8 ± 490.17 133.48 ± 8.65 2002 $O(2^{5N^{2}+3N})$ Spanner 6 3 3 ± 1.41 2.33 ± 0.47 4 22.3 ± 0.46 455.6 ± 23.83 116.29 ± 0.97 2011 $O(2^{3N^{2}+2N})$ Visitall 3 1 2 ± 0 3 ± 0 1 23.9 ± 3.01 556.4 ± 148.19 119.5 ± 4.73 2014 $O(2^{N^{2}+N})$ ZenoTravel 4 5 2.8 ± 0.75 2.8 ± 0.98 4 20.1 ± 1.04 4340.0 ± 1282.95 158.03 ± 7.03 2002 $O(2^{5N^{2}-N})$

Table 18: Description of the dataset. Note: N denotes the number of objects in each object type. Ex in domain Miconic there are N passengers and N floors

F.18 Planning Description

Simple planning involves finding a sequence of actions that will take the world from a given initial state to a state that satisfies the goal conditions. Predicates (as defined in PDDL; also referred to as "fluents" in the reasoning about actions community) are properties of the world, such as $on(a,b)$ meaning $a$ is on $b$ , and a state of the world describes what fluents are true and what are not true in that world. A (planning) domain description $D$ describes the fluents in that domain, the actions in that domain, and how the actions impact the fluents in that domain. In formal planning languages, often the language PDDL is used to describe domains. Given a domain description $D$ , the semantics of the description language defines a transition function $\Phi_{D}:states\times actions\rightarrow states$ , where $\Phi_{D}(s,a)=s^{\prime}$ means that execution of the action $a$ in the state $s$ results in the state $s^{\prime}$ . If $\Phi_{D}(s,a)$ is undefined, then $a$ is said to be inexecutable in $s$ . A simple goal is a classical formula whose truth value can be evaluated with respect to a state. A planning instance consists of the triplet $(D,s_{0},g)$ , where $D$ is a domain description, $s_{0}$ is an initial state, and $g$ is a goal. Given a planning instance $(D,s_{0},g)$ , a plan is any sequence of actions $a_{1},\dots,a_{n}$ where the state $\Phi_{D}(a_{n},\Phi_{D}(a_{n-1},\ldots\Phi_{D}(a_{1},s_{0})\ldots))$ is defined and $g$ evaluates to true with respect to that state. The plan generation task that we are concerned with is the task where $D$ , $s_{0}$ , and $g$ , described in natural language, are together given as input, and the model has to generate a plan $a_{1},\dots,a_{n}$ .

F.19 PDDL

"The Planning Domain Definition Language (PDDL) is a formal knowledge representation language designed to express planning models … a de-facto standard input language for many planning systems, although it is not the only modeling language for planning. Several variants of PDDL have emerged that capture planning problem of different natures and complexities, with a focus on deterministic problems." (Haslum et al., 2019). One can use PDDL to describe the planning domain and a planning problem instance (objects, initial, and goal conditions). An example of Visit-All domain and a problem instance described in PDDL can be seen in Appendix F. This particular domain example belongs to PDDL’s subset, a language called "STRIPS" (Stanford Research Institute Problem Solver) (Fikes and Nilsson, 1971). In addition to STRIPS, this domain description is "typed": objects involved in a problem have type and subtype definitions. In this study, we restrict our attention only to typed "STRIPS" domains and instances. Those include typed objects and predicates that operate on the objects and actions with preconditions and effects and a problem instance with typed objects and initial and goal states.

Appendix G Calculating State Space

In the following section, we calculate state-space complexity for the domains present in ActionReasoningBench. A higher state-space complexity refers to a complex domain in traditional AI (Rintanen, 2004). Therefore, we calculate whether the LLMs also struggle more in the “harder” domains. Fig 11 represents the plot between the accuracy and state-space complexity of the domains. The plot reveals that LLMs function differently than traditional AI solvers.

G.1 Blocksworld

The number of fluents in a state is defined by predicates:

•

on(b1,b2)
•

notable(b)
•

clear(b)
•

holding(b)
•

handempty

The complexity of a state is, where $N$ is the number of objects

O(2^{N^{2}+2N})

(1)

G.2 Depots

•

(at ?x - locatable ?y - place)
•

(on ?x - crate ?y - surface)
•

(in ?x - crate ?y - truck)
•

(lifting ?x - hoist ?y - crate)
•

(available ?x - hoist)
•

(clear ?x - surface))

O(2^{4N^{2}+2N})

(2)

G.3 Driverlog

•

(at ?obj - locatable ?loc - location)
•

(in ?obj1 - obj ?obj - truck)
•

(driving ?d - driver ?v - truck)
•

(link ?x ?y - location)
•

(path ?x ?y - location)
•

(empty ?v - truck)

O(2^{5N^{2}-N})

(3)

G.4 Goldminer

•

(connected ?x - LOC ?y - LOC)
•

(robot-at ?x - LOC)
•

(bomb-at ?x - LOC )
•

(laser-at ?x - LOC)
•

(soft-rock-at ?x - LOC)
•

(hard-rock-at ?x - LOC)
•

(gold-at ?x - LOC)
•

(clear ?x - LOC)
•

(arm-empty)
•

(holds-bomb)
•

(holds-laser)
•

(holds-gold)

O(2^{N^{2}+6N})

(4)

G.5 Grippers

•

(carry ?r - robot ?o - object ?g - gripper)
•

(at-robby ?r - robot ?x - room)
•

(at ?o - object ?x - room)
•

(free ?r - robot ?g - gripper)

O(2^{N^{3}+3N^{2}})

(5)

G.6 Logistics

•

(in-city ?loc - place ?city - city)
•

(at ?obj - physobj ?loc - place)
•

(in ?pkg - package ?veh - vehicle)

O(2^{3N^{2}})

(6)

G.7 Miconic

•

(origin ?person - passenger ?floor - floor)
•

(destin ?person - passenger ?floor - floor)
•

(above ?floor1 - floor ?floor2 - floor)
•

(boarded ?person - passenger)
•

(not-boarded ?person - passenger)
•

(served ?person - passenger)
•

(not-served ?person - passenger)
•

(lift-at ?floor - floor)

O(2^{3N^{2}+4N})

(7)

G.8 Mystery

•

(at ?v - movable ?l - location)
•

(has-fuel ?l - location ?f - fuel)
•

(in ?c - cargo ?v - vehicle)
•

(has-space ?v - vehicle ?s - space)
•

(conn ?l1 ?l2 - location)
•

(fuel-neighbor ?f1 ?f2 - fuel)
•

(space-neighbor ?s1 ?s2 - space)

O(2^{7N^{2}-3N})

(8)

G.9 N-Puzzle

•

(at ?tile - tile ?position - position)
•

(neighbor ?p1 - position ?p2 - position)
•

(empty ?position - position)

O(2^{2N^{2}})

(9)

G.10 Satellite

•

(on-board ?i - instrument ?s - satellite)
•

(supports ?i - instrument ?m - mode)
•

(pointing ?s - satellite ?d - direction)
•

(have-image ?d - direction ?m - mode)
•

(calibration-target ?i - instrument ?d - direction)
•

(power-avail ?s - satellite)
•

(power-on ?i - instrument)
•

(calibrated ?i - instrument)

O(2^{5N^{2}+3N})

(10)

G.11 Spanner

•

(at ?m - locatable ?l - location)
•

(carrying ?m - man ?s - spanner)
•

(link ?l1 - location ?l2 - location)
•

(useable ?s - spanner)
•

(tightened ?n - nut)
•

(loose ?n - nut))

O(2^{3N^{2}+2N})

(11)

G.12 Visitall

•

(connected ?x ?y - place)
•

(at-robot ?x - place)
•

(visited ?x - place)

O(2^{N^{2}+N})

(12)

G.13 Zenotravel

•

(at ?x - (either person aircraft) ?c - city)
•

(in ?p - person ?a - aircraft)
•

(fuel-level ?a - aircraft ?l - level)
•

(next ?l1 ?l2 - flevel)

O(2^{5N^{2}-N})

novyuecllio a qbyyxzqvdh is only possible if that qbyyxzqvdh is clear, wtuwjwbuja the gcbwvwyvkv, and the egpbpdtalq is empty. ovyuecllio the qbyyxzqvdh leads to the qbyyxzqvdh being casqqrrojp. xskgihccqt the qbyyxzqvdh can only be executed if the qbyyxzqvdh is being casqqrrojp. xskgihccqt the qbyyxzqvdh causes the qbyyxzqvdh to be wtuwjwbuja the gcbwvwyvkv. A qbyyxzqvdh can be oscckwdtoh wtuwjwbuja the second qbyyxzqvdh if it is being casqqrrojp and the second qbyyxzqvdh is clear. By oscckwdtoh the first qbyyxzqvdh wtuwjwbuja the second, it causes the first qbyyxzqvdh to be wtuwjwbuja top of the second qbyyxzqvdh. The qbyyxzqvdh can also be wxqdwukszo from the top of the second qbyyxzqvdh only if the egpbpdtalq is empty and the first qbyyxzqvdh is clear and wtuwjwbuja top of the second qbyyxzqvdh.

Wxqdwukszo the first qbyyxzqvdh from the second causes first qbyyxzqvdh to be casqqrrojp A qbyyxzqvdh is said to be clear if it is not being casqqrrojp and there are no qbyyxzqvdhs that are wtuwjwbuja top of it. The egpbpdtalq is said to be empty if and only if it is not casqqrrojp any qbyyxzqvdh. The qbyyxzqvdh can only be at one place at a time.

A state is a set of valid properties. Properties may or may not involve negations. Properties of the state can be of 4 types: base, derived, persistent, and static. Base properties of the state are properties that don’t depend wtuwjwbuja other properties. In this domain, they are: a qbyyxzqvdh is wtuwjwbuja the gcbwvwyvkv and a qbyyxzqvdh is not wtuwjwbuja the gcbwvwyvkv. Derived properties of the state are properties that depend wtuwjwbuja other properties. In this domain, they are: a qbyyxzqvdh is clear, egpbpdtalq is empty, a qbyyxzqvdh is not clear and egpbpdtalq is not empty. Self constraint properties of the state are properties that depend wtuwjwbuja themselves. In this domain, they are: a qbyyxzqvdh is being casqqrrojp, a qbyyxzqvdh is wtuwjwbuja another qbyyxzqvdh, a qbyyxzqvdh is not being casqqrrojp and a qbyyxzqvdh is not wtuwjwbuja another qbyyxzqvdh. Static properties of the state are properties that don’t change under any action. There are no static properties of the state in this domain.

H.4 With Obfuscation Without Ramifications

⬇

[DOMAIN DESCRIPTION]

novyuecllio a qbyyxzqvdh is only possible if that qbyyxzqvdh is clear, wtuwjwbuja the gcbwvwyvkv, and the egpbpdtalq is empty. By ovyuecllio that qbyyxzqvdh, it makes that qbyyxzqvdh not present wtuwjwbuja the gcbwvwyvkv and not clear. It also leads to the qbyyxzqvdh being casqqrrojp and makes the egpbpdtalq not empty. xskgihccqt the qbyyxzqvdh can only be executed if the qbyyxzqvdh is being casqqrrojp. xskgihccqt the qbyyxzqvdh causes that qbyyxzqvdh to be clear and wtuwjwbuja the gcbwvwyvkv. It also causes the egpbpdtalq to be not casqqrrojp the qbyyxzqvdh and makes the egpbpdtalq empty. A qbyyxzqvdh can be oscckwdtoh wtuwjwbuja the second qbyyxzqvdh if it is being casqqrrojp and the second qbyyxzqvdh is clear. By oscckwdtoh the first qbyyxzqvdh wtuwjwbuja the second, it causes the first qbyyxzqvdh to clear and wtuwjwbuja top of the second qbyyxzqvdh. Meanwhile, the second qbyyxzqvdh is not clear, and the egpbpdtalq becomes empty as it is not casqqrrojp the qbyyxzqvdh. The qbyyxzqvdh can also be wxqdwukszo from the top of the second qbyyxzqvdh only if the egpbpdtalq is empty and the first qbyyxzqvdh is clear and wtuwjwbuja top of the second qbyyxzqvdh. Wxqdwukszo the first qbyyxzqvdh from the second causes the second qbyyxzqvdh to be clear. The first qbyyxzqvdh is now being casqqrrojp, not clear, and not wtuwjwbuja top of the second qbyyxzqvdh. Furthermore, the egpbpdtalq is not empty.

A state is a set of valid properties. Properties may or may not involve negations. Properties of the state can be of 4 types: base, derived, persistent, and static. Base properties of the state are properties that don’t depend wtuwjwbuja other properties. In this domain, they are: a qbyyxzqvdh is wtuwjwbuja the gcbwvwyvkv and a qbyyxzqvdh is not wtuwjwbuja the gcbwvwyvkv. Derived properties of the state are properties that depend wtuwjwbuja other properties. In this domain, they are: a qbyyxzqvdh is clear, egpbpdtalq is empty, a qbyyxzqvdh is not clear and egpbpdtalq is not empty. Self constraint properties of the state are properties that depend wtuwjwbuja themselves. In this domain, they are: a qbyyxzqvdh is being casqqrrojp, a qbyyxzqvdh is wtuwjwbuja another qbyyxzqvdh, a qbyyxzqvdh is not being casqqrrojp and a qbyyxzqvdh is not wtuwjwbuja another qbyyxzqvdh. Static properties of the state are properties that don’t change under any action. There are no static properties of the state in this domain.

Appendix I Examples of Questions

I.1 Object Tracking

I.1.1 True/False Question, Action-Sequence Length 1

⬇

[DOMAIN_DESCRIPTION]:

{domain description}

[INITIAL_CONDITIONS]:

Block b1 is on the table, block b2 is clear, block b2 is on the table, block b3 is clear, block b3 is placed on top of block b7, block b4 is placed on top of block b1, block b5 is clear, block b5 is placed on top of block b4, block b6 is located on the table, block b7 is on top of block b6 and hand is empty.

[Question]:

Given the initial condition, the following actions are performed: from top of block b7, block b3 is unstacked to reach the current state. In this state, is it True or False that the following properties of the state are correct for b1: block b1 is located on the table?

[Answer]:

True

I.1.2 Free Answer Question, Action-Sequence Length 1

⬇

I.4.1 True/False Question, Action-Sequence Length 1

⬇

[DOMAIN_DESCRIPTION]

{domain description}

[INITIAL_CONDITIONS]

Block b1 is placed on top of block b4, block b2 is clear, block b2 is on block b6, block b3 is clear, block b3 is on top of block b5, block b4 is located on the table, block b5 is on top of block b7, block b6 is located on the table, block b7 is placed on top of block b1 and hand is empty.

[QUESTION]

Given the initial condition, the following actions are planned to be performed: block b5 is unstacked from top of block b6. Is it possible to execute it, True or False?

[ANSWER]

False

I.4.2 Free Answer Question, Action-Sequence Length 1

⬇

I.8.1 True/False Question, Action-Sequence Length 1

⬇

[DOMAIN_DESCRIPTION]

{domain description}

[INITIAL_CONDITIONS]

Block b1 is on the table, block b2 is clear, block b2 is on top of block b4, block b3 is on block b8, block b4 is on the table, block b5 is on the table, block b6 is clear, block b6 is placed on top of block b3, block b7 is clear, block b7 is placed on top of block b1, block b8 is placed on top of block b5 and hand is not holding anything.

[QUESTION]

Given the initial condition, the following actions are planned to be performed: block b2 is unstacked from block b4 to reach the current state. Are the following properties of the state true for b7 before the first inexecutable action in the sequence? Block b1 is not on top of block b7, block b2 is on block b7, block b3 is not on top of block b7, block b4 is not placed on top of block b7, block b5 is not placed on top of block b7, block b6 is not on top of block b7, block b7 is not being held by the hand, block b7 is not on block b3, block b7 is not on block b8, block b7 is not on top of block b1, block b7 is not on top of block b2, block b7 is not on top of block b5, block b7 is not placed on top of block b6, block b7 is placed on top of block b4 and block b8 is not on top of block b7.

[ANSWER]

True

I.8.2 Free Answer Question, Action-Sequence Length 1

⬇

[DOMAIN DESCRIPTION]

{domain description}

[INITIAL CONDITIONS]

[QUESTION]

Given the initial condition, the following actions are planned to be performed: block b2 is unstacked from block b4 to reach the current state. Some of the actions may not be executable. What is the state before the first infeasible action in the sequence? Write None if there are none.

[ANSWER]:

None

Appendix J Fluent Types Questions

All of the below examples are from the goldminer domain. Each of the fluent types are defined in 3.2. Below are the fluent types for the goldminer domain. ⬇ Base Fluents = {bomb_at, laser_at, soft_rock_at, hard_rock_at, holds_bomb, holds_laser, holds_gold} Base Fluents with Constraints = {robot_at, gold_at} Derived Fluents = {arm_empty, clear} Static Properties = {connected}

J.1 Base Fluents

J.1.1 Object Tracking

⬇

[QUESTION]

⬇

[QUESTION]

[ANSWER]: TRUE

Appendix K Prompts

K.1 Without Obfuscation With Ramifications

K.1.1 Few Shot 1

⬇

[DOMAIN DESCRIPTION]

A surface can be a pallet or a crate. A truck can be driven from one location to another only if it is present at the first location. Driving the truck makes the truck to be present at the final destination. A hoist can lift a crate from a surface only when the hoist and the crate are present in the same location, the hoist is available, the crate is present on the surface, and the crate is clear.Lifting causes the hoist to lift the crate. Drop** the crate is possible only if the hoist, crate, and surface are present in the same location, the hoist is lifting the crate, and the surface is clear. Drop** the crate causes it to be on top of the surface and be present at the location where it was dropped. It also causes the hoist to not lift the crate. Loading the crate on a truck can be executed only when the crate and truck are present in the same location and the hoist is lifting the crate. Loading the crate onto the truck causes the hoist to be not lifting the crate, and the crate to be in the truck. Unloading the crate from the truck can be done only if the hoist and truck are in the same place, the hoist is available, and the crate is present in the truck. Unloading the crate from the truck causes the crate to be not in the truck, and the hoist to be lifting the crate. A hoist is available if and only if it does not lift anything (any crate). A surface is clear if and only if no crates are on top of that surface. A crate is clear if and only if no hoist is lifting that crate. A truck can be only at one location. A crate can only be at one location. A crate can only be on top of one surface.

A state is a set of valid properties. Properties may or may not involve negations. Properties of the state can be of 4 types: base, derived, persistent, and static. Base properties of the state are properties that don’t depend on other properties. There are no base properties of the state in this domain. Derived properties of the state are properties that depend on other properties. In this domain, they are: a surface is clear, hoist is available, a surface is not clear and hoist is not available. Self constraint properties of the state are properties that depend on themselves. In this domain, they are: at, on, in, a hoist is lifting a crate, not at, not on, not in and a hoist is not lifting a crate. Static properties of the state are properties that don’t change under any action. There are no static properties of the state in this domain.

[EXAMPLE 1]

[INITIAL CONDITIONS]

Initially, Crate0 is located at depot0, crate0 is on pallet0, crate1 is located at depot0, crate1 is on top of crate0, crate2 can be found located at depot1, crate2 is clear, crate2 is on pallet1, crate3 is clear, crate3 is on crate1, crate4 is at distributor2, crate4 is clear, depot0 is where crate3 is located, depot0 is where truck0 is located, depot1 is where hoist1 is located, depot1 is where pallet1 is located, distributor0 is where hoist2 is located, distributor0 is where pallet2 is located, distributor1 is where hoist3 is located, distributor2 is where pallet4 is located, hoist0 is at depot0, hoist0 is available, hoist1 is available for work, hoist2 is available for work, hoist3 is available for work, hoist4 is available, hoist4 is located at distributor2, pallet0 can be found located at depot0, pallet2 is clear of any crates, pallet3 is at distributor1, pallet3 is clear of any crates, pallet4 has crate4 on it and truck1 is located at depot0.

[QUESTION]

Given the initial condition, the following actions are performed: at depot1, hoist1 lifts crate2 off pallet1 to reach the current state. In this state, is it True or False that the following properties of the state are correct for pallet4: pallet4 is not clear?

Just provide your answer as TRUE/FALSE.

[ANSWER]: True

Based on the above examples, answer the below question:

[INITIAL CONDITIONS]

Initially, Crate1 is at distributor2, crate1 is clear of any crates, crate1 is on crate0, crate2 is at depot0, crate2 is clear, crate2 is on pallet0, crate3 can be found located at depot2, crate3 is clear, crate3 is on pallet2, depot0 is where pallet0 is located, depot2 is where pallet2 is located, distributor0 is where hoist3 is located, distributor1 is where hoist4 is located, distributor2 is where crate0 is located, distributor3 is where pallet6 is located, hoist0 can be found located at depot0, hoist0 is accessible, hoist1 is at depot1, hoist1 is available for work, hoist2 can be found located at depot2, hoist2 is available for work, hoist3 is available for work, hoist4 is accessible, hoist5 can be found located at distributor2, hoist5 is available, hoist6 can be found located at distributor3, hoist6 is accessible, pallet1 can be found located at depot1, pallet1 is clear of any crates, pallet3 is clear, pallet3 is located at distributor0, pallet4 can be found located at distributor1, pallet4 is clear, pallet5 has crate0 on it, pallet5 is at distributor2, pallet6 is clear of any crates, truck0 is at distributor2, truck1 is at depot1 and truck2 can be found located at depot2.

[QUESTION]

Given the initial condition, the following actions are performed: from depot1, truck1 is driven to depot0 to reach the current state. In this state, is it True or False that the following properties of the state are correct for pallet4: pallet4 is clear of any crates?.

Just provide your answer as TRUE/FALSE.

[ANSWER]:

K.1.2 Few Shot 5

⬇

[DOMAIN DESCRIPTION]

{domain description}

[EXAMPLE 1]

[INITIAL CONDITIONS]

{initial conditions}

[QUESTION]

{question}

[ANSWER]:

{answer}

[EXAMPLE 5]

[INITIAL CONDITIONS]

{initial conditions}

[QUESTION]

{question}

[ANSWER]:

{answer}

Based on the above examples, answer the below question:

[INITIAL CONDITIONS]

[QUESTION]

Just provide your answer as TRUE/FALSE.

[ANSWER]:

K.2 Without Obfuscations, without ramifications

K.2.1 Few Shot 1

⬇

[DOMAIN DESCRIPTION]

A surface can be a pallet or a crate. A truck can be driven from one location to another only if it is present at the first location. Driving the truck makes the truck to be present at the final destination and not at the starting location. A hoist can lift a crate from a surface only when the hoist and the crate are present in the same location, the hoist is available, the crate is present on the surface, and the crate is clear. Lifting causes the hoist to lift the crate, the crate to be not in that location, not clear, and not on top of the previous surface, which is now clear. Moreover, lifting the crate causes the hoist to be not available. Drop** the crate is possible only if the hoist, crate, and surface are present in the same location, the hoist is lifting the crate, and the surface is clear. Drop** the crate causes it to be on top of the surface, be present in the location where it was dropped, and be clear. It also causes the hoist to become available, and not lift the crate. The surface on which the crate was dropped becomes not clear. Loading the crate on a truck can be executed only when the crate and truck are present in the same location, and the hoist is lifting the crate. Loading the crate onto the truck causes the hoist to be available and not lifting the crate, and crate to be in the truck. Unloading the crate from the truck can be done only if the hoist and truck are in the same place, the hoist is available, and the crate is present in the truck. Unloading the crate from the truck causes the crate to be not in the truck, and the hoist to be lifting the crate and not available.

A state is a set of valid properties. Properties may or may not involve negations. Properties of the state can be of 4 types: base, derived, persistent, and static. Base properties of the state are properties that don’t depend on other properties. There are no base properties of the state in this domain. Derived properties of the state are properties that depend on other properties. In this domain, they are: a surface is clear, hoist is available, a surface is not clear and hoist is not available. Self constraint properties of the state are properties that depend on themselves. In this domain, they are: at, on, in, a hoist is lifting a crate, not at, not on, not in and a hoist is not lifting a crate. Static properties of the state are properties that don’t change under any action. There are no static properties of the state in this domain.

[EXAMPLE 1]

[INITIAL CONDITIONS]

[QUESTION]

Given the initial condition, the following actions are performed: crate2 is lifted from pallet1 at depot1 by hoist1 to reach the current state. In this state, is it True or False that the following properties of the state are correct for pallet2: pallet2 is not clear?

Just provide your answer as TRUE/FALSE.

[ANSWER]: False

Based on the above examples, answer the below question:

[INITIAL CONDITIONS]

[QUESTION]

Just provide your answer as TRUE/FALSE.

[ANSWER]:

K.2.2 Few Shot 5

⬇

[DOMAIN DESCRIPTION]

{domain description}

[EXAMPLE 1]

[INITIAL CONDITIONS]

{initial conditions}

[QUESTION]

{question}

[ANSWER]:

{answer}

[EXAMPLE 5]

[INITIAL CONDITIONS]

{initial conditions}

[QUESTION]

{question}

[ANSWER]:

{answer}

Based on the above examples, answer the below question:

[INITIAL CONDITIONS]

[QUESTION]

Just provide your answer as TRUE/FALSE.

[ANSWER]:

K.3 With Obfuscations, with ramifications

K.3.1 Few Shot 1

⬇

[DOMAIN DESCRIPTION]

A fshxjwxean can be a tzrwjuotxz or a pjrluufopq. A nblmdziyqf can be jzmscukkyy from one location to another only if it is present cmgnqlveog the first location. Jzmscukkyy the nblmdziyqf makes the nblmdziyqf to be present cmgnqlveog the final destination. A suhmddooyi can aeaygzpsjc a pjrluufopq from a fshxjwxean only when the suhmddooyi and the pjrluufopq are present qxkqlxlezx the same location, the suhmddooyi is available, the pjrluufopq is present fpqyitbzqq the fshxjwxean, and the pjrluufopq is clear.Lifting causes the suhmddooyi to aeaygzpsjc the pjrluufopq. Uckhudtpif the pjrluufopq is possible only if the suhmddooyi, pjrluufopq, and fshxjwxean are present qxkqlxlezx the same location, the suhmddooyi is aeaygzpsjc the pjrluufopq, and the fshxjwxean is clear. Uckhudtpif the pjrluufopq causes it to be fpqyitbzqq top of the fshxjwxean and be present cmgnqlveog the location where it was uckhudtpif. It also causes the suhmddooyi to not aeaygzpsjc the pjrluufopq. Gjqgfjtbnf the pjrluufopq fpqyitbzqq a nblmdziyqf can be executed only when the pjrluufopq and nblmdziyqf are present qxkqlxlezx the same location and the suhmddooyi is aeaygzpsjc the pjrluufopq. Gjqgfjtbnf the pjrluufopq onto the nblmdziyqf causes the suhmddooyi to be not aeaygzpsjc the pjrluufopq, and the pjrluufopq to be qxkqlxlezx the nblmdziyqf. Gpztfzvsux the pjrluufopq from the nblmdziyqf can be done only if the suhmddooyi and nblmdziyqf are qxkqlxlezx the same place, the suhmddooyi is available, and the pjrluufopq is present qxkqlxlezx the nblmdziyqf. Gpztfzvsux the pjrluufopq from the nblmdziyqf causes the pjrluufopq to be not qxkqlxlezx the nblmdziyqf, and the suhmddooyi to be aeaygzpsjc the pjrluufopq. A suhmddooyi is available if and only if it does not aeaygzpsjc anything (any crate). A fshxjwxean is clear if and only if no pjrluufopq are fpqyitbzqq top of that fshxjwxean. A pjrluufopq is clear if and only if no suhmddooyi is aeaygzpsjc that pjrluufopq. A nblmdziyqf can be only cmgnqlveog one location. A pjrluufopq can only be cmgnqlveog one location. A pjrluufopq can only be fpqyitbzqq top of one fshxjwxean.

A state is a set of valid properties. Properties may or may not involve negations. Properties of the state can be of 4 types: base, derived, persistent, and static. Base properties of the state are properties that don’t depend fpqyitbzqq other properties. There are no base properties of the state qxkqlxlezx this domain. Derived properties of the state are properties that depend fpqyitbzqq other properties. Qxkqlxlezx this domain, they are: a fshxjwxean is clear, suhmddooyi is available, a fshxjwxean is not clear and suhmddooyi is not available. Self constraint properties of the state are properties that depend fpqyitbzqq themselves. Qxkqlxlezx this domain, they are: cmgnqlveog, fpqyitbzqq, qxkqlxlezx, a suhmddooyi is aeaygzpsjc a pjrluufopq, not cmgnqlveog, not fpqyitbzqq, not qxkqlxlezx and a suhmddooyi is not aeaygzpsjc a pjrluufopq. Static properties of the state are properties that don’t change under any action. There are no static properties of the state qxkqlxlezx this domain.

[EXAMPLE 1]

[INITIAL CONDITIONS]

Initially, Pjrluufopq0 is located cmgnqlveog viwwhelzlg0, pjrluufopq0 is fpqyitbzqq tzrwjuotxz0, pjrluufopq1 is located cmgnqlveog viwwhelzlg0, pjrluufopq1 is fpqyitbzqq top of pjrluufopq0, pjrluufopq2 can be found located cmgnqlveog viwwhelzlg1, pjrluufopq2 is clear, pjrluufopq2 is fpqyitbzqq tzrwjuotxz1, pjrluufopq3 is clear, pjrluufopq3 is fpqyitbzqq pjrluufopq1, pjrluufopq4 is cmgnqlveog vjbtrtgdsx2, pjrluufopq4 is clear, viwwhelzlg0 is where pjrluufopq3 is located, viwwhelzlg0 is where nblmdziyqf0 is located, viwwhelzlg1 is where suhmddooyi1 is located, viwwhelzlg1 is where tzrwjuotxz1 is located, vjbtrtgdsx0 is where suhmddooyi2 is located, vjbtrtgdsx0 is where tzrwjuotxz2 is located, vjbtrtgdsx1 is where suhmddooyi3 is located, vjbtrtgdsx2 is where tzrwjuotxz4 is located, suhmddooyi0 is cmgnqlveog viwwhelzlg0, suhmddooyi0 is available, suhmddooyi1 is available for work, suhmddooyi2 is available for work, suhmddooyi3 is available for work, suhmddooyi4 is available, suhmddooyi4 is located cmgnqlveog vjbtrtgdsx2, tzrwjuotxz0 can be found located cmgnqlveog viwwhelzlg0, tzrwjuotxz2 is clear of any pjrluufopq, tzrwjuotxz3 is cmgnqlveog vjbtrtgdsx1, tzrwjuotxz3 is clear of any pjrluufopq, tzrwjuotxz4 has pjrluufopq4 fpqyitbzqq it and nblmdziyqf1 is located cmgnqlveog viwwhelzlg0.

[QUESTION]

Given the initial condition, the following actions are performed: cmgnqlveog viwwhelzlg1, suhmddooyi1 aeaygzpsjc pjrluufopq2 off tzrwjuotxz1 to reach the current state. Qxkqlxlezx this state, is it True or False that the following properties of the state are correct for tzrwjuotxz4: tzrwjuotxz4 is not clear?

Just provide your answer as TRUE/FALSE.

[ANSWER]: True

Based fpqyitbzqq the above examples, answer the below question:

[INITIAL CONDITIONS]

Initially, Pjrluufopq1 is cmgnqlveog vjbtrtgdsx2, pjrluufopq1 is clear of any pjrluufopq, pjrluufopq1 is fpqyitbzqq pjrluufopq0, pjrluufopq2 is cmgnqlveog viwwhelzlg0, pjrluufopq2 is clear, pjrluufopq2 is fpqyitbzqq tzrwjuotxz0, pjrluufopq3 can be found located cmgnqlveog viwwhelzlg2, pjrluufopq3 is clear, pjrluufopq3 is fpqyitbzqq tzrwjuotxz2, viwwhelzlg0 is where tzrwjuotxz0 is located, viwwhelzlg2 is where tzrwjuotxz2 is located, vjbtrtgdsx0 is where suhmddooyi3 is located, vjbtrtgdsx1 is where suhmddooyi4 is located, vjbtrtgdsx2 is where pjrluufopq0 is located, vjbtrtgdsx3 is where tzrwjuotxz6 is located, suhmddooyi0 can be found located cmgnqlveog viwwhelzlg0, suhmddooyi0 is accessible, suhmddooyi1 is cmgnqlveog viwwhelzlg1, suhmddooyi1 is available for work, suhmddooyi2 can be found located cmgnqlveog viwwhelzlg2, suhmddooyi2 is available for work, suhmddooyi3 is available for work, suhmddooyi4 is accessible, suhmddooyi5 can be found located cmgnqlveog vjbtrtgdsx2, suhmddooyi5 is available, suhmddooyi6 can be found located cmgnqlveog vjbtrtgdsx3, suhmddooyi6 is accessible, tzrwjuotxz1 can be found located cmgnqlveog viwwhelzlg1, tzrwjuotxz1 is clear of any pjrluufopq, tzrwjuotxz3 is clear, tzrwjuotxz3 is located cmgnqlveog vjbtrtgdsx0, tzrwjuotxz4 can be found located cmgnqlveog vjbtrtgdsx1, tzrwjuotxz4 is clear, tzrwjuotxz5 has pjrluufopq0 fpqyitbzqq it, tzrwjuotxz5 is cmgnqlveog vjbtrtgdsx2, tzrwjuotxz6 is clear of any pjrluufopq, nblmdziyqf0 is cmgnqlveog vjbtrtgdsx2, nblmdziyqf1 is cmgnqlveog viwwhelzlg1 and nblmdziyqf2 can be found located cmgnqlveog viwwhelzlg2.

[QUESTION]

Given the initial condition, the following actions are performed: from viwwhelzlg1, nblmdziyqf1 is jzmscukkyy to viwwhelzlg0 to reach the current state. Qxkqlxlezx this state, is it True or False that the following properties of the state are correct for tzrwjuotxz4: tzrwjuotxz4 is clear of any pjrluufopq?.

Just provide your answer as TRUE/FALSE.

[ANSWER]:

K.3.2 Few Shot 5

⬇

[DOMAIN DESCRIPTION]

{domain description}

[EXAMPLE 1]

[INITIAL CONDITIONS]

{initial conditions}

[QUESTION]

{question}

[ANSWER]:

{answer}

[EXAMPLE 5]

[INITIAL CONDITIONS]

{initial conditions}

[QUESTION]

{question}

[ANSWER]:

{answer}

Based fpqyitbzqq the above examples, answer the below question:

[INITIAL CONDITIONS]

[QUESTION]

Just provide your answer as TRUE/FALSE.

[ANSWER]:

K.4 With Obfuscations, Without ramifications

K.4.1 Few Shot 1

⬇

[DOMAIN DESCRIPTION]

A fshxjwxean can be a tzrwjuotxz or a pjrluufopq. A nblmdziyqf can be jzmscukkyy from one location to another only if it is present cmgnqlveog the first location. Jzmscukkyy the nblmdziyqf makes the nblmdziyqf to be present cmgnqlveog the final destination and not cmgnqlveog the starting location. A suhmddooyi can aeaygzpsjc a pjrluufopq from a fshxjwxean only when the suhmddooyi and the pjrluufopq are present qxkqlxlezx the same location, the suhmddooyi is available, the pjrluufopq is present fpqyitbzqq the fshxjwxean, and the pjrluufopq is clear. Aeaygzpsjc causes the suhmddooyi to aeaygzpsjc the pjrluufopq, the pjrluufopq to be not qxkqlxlezx that location, not clear, and not fpqyitbzqq top of the previous fshxjwxean, which is now clear. Moreover, aeaygzpsjc the pjrluufopq causes the suhmddooyi to be not available. Uckhudtpif the pjrluufopq is possible only if the suhmddooyi, pjrluufopq, and fshxjwxean are present qxkqlxlezx the same location, the suhmddooyi is aeaygzpsjc the pjrluufopq, and the fshxjwxean is clear. Uckhudtpif the pjrluufopq causes it to be fpqyitbzqq top of the fshxjwxean, be present qxkqlxlezx the location where it was uckhudtpif, and be clear. It also causes the suhmddooyi to become available, and not aeaygzpsjc the pjrluufopq. The fshxjwxean fpqyitbzqq which the pjrluufopq was uckhudtpif becomes not clear. Gjqgfjtbnf the pjrluufopq fpqyitbzqq a nblmdziyqf can be executed only when the pjrluufopq and nblmdziyqf are present qxkqlxlezx the same location, and the suhmddooyi is aeaygzpsjc the pjrluufopq. Gjqgfjtbnf the pjrluufopq onto the nblmdziyqf causes the suhmddooyi to be available and not aeaygzpsjc the pjrluufopq, and pjrluufopq to be qxkqlxlezx the nblmdziyqf. Gpztfzvsux the pjrluufopq from the nblmdziyqf can be done only if the suhmddooyi and nblmdziyqf are qxkqlxlezx the same place, the suhmddooyi is available, and the pjrluufopq is present qxkqlxlezx the nblmdziyqf. Gpztfzvsux the pjrluufopq from the nblmdziyqf causes the pjrluufopq to be not qxkqlxlezx the nblmdziyqf, and the suhmddooyi to be aeaygzpsjc the pjrluufopq and not available.

A state is a set of valid properties. Properties may or may not involve negations. Properties of the state can be of 4 types: base, derived, persistent, and static. Base properties of the state are properties that don’t depend fpqyitbzqq other properties. There are no base properties of the state qxkqlxlezx this domain. Derived properties of the state are properties that depend fpqyitbzqq other properties. Qxkqlxlezx this domain, they are: a fshxjwxean is clear, suhmddooyi is available, a fshxjwxean is not clear and suhmddooyi is not available. Self constraint properties of the state are properties that depend fpqyitbzqq themselves. Qxkqlxlezx this domain, they are: cmgnqlveog, fpqyitbzqq, qxkqlxlezx, a suhmddooyi is aeaygzpsjc a pjrluufopq, not cmgnqlveog, not fpqyitbzqq, not qxkqlxlezx and a suhmddooyi is not aeaygzpsjc a pjrluufopq. Static properties of the state are properties that don’t change under any action. There are no static properties of the state qxkqlxlezx this domain.

[EXAMPLE 1]

[INITIAL CONDITIONS]

[QUESTION]

Given the initial condition, the following actions are performed: pjrluufopq2 is aeaygzpsjc from tzrwjuotxz1 cmgnqlveog viwwhelzlg1 by suhmddooyi1 to reach the current state. Qxkqlxlezx this state, is it True or False that the following properties of the state are correct for tzrwjuotxz2: tzrwjuotxz2 is not clear?

Just provide your answer as TRUE/FALSE.

[ANSWER]: False

Based fpqyitbzqq the above examples, answer the below question:

[INITIAL CONDITIONS]

[QUESTION]

Just provide your answer as TRUE/FALSE.

[ANSWER]

K.4.2 Few Shot 5

⬇

[DOMAIN DESCRIPTION]

{domain description}

[EXAMPLE 1]

[INITIAL CONDITIONS]

{initial conditions}

[QUESTION]

{question}

[ANSWER]:

{answer}

[EXAMPLE 5]

[INITIAL CONDITIONS]

{initial conditions}

[QUESTION]

{question}

[ANSWER]:

{answer}

Based fpqyitbzqq the above examples, answer the below question:

[INITIAL CONDITIONS]

[QUESTION]

Just provide your answer as TRUE/FALSE.

[ANSWER]: