ActionReasoningBench: Reasoning about Actions with and without Ramification Constraints
Abstract
Reasoning about actions and change (RAC) has historically driven the development of many early AI challenges, such as the frame problem, and many AI disciplines, including non-monotonic and commonsense reasoning. The role of RAC remains important even now, particularly for tasks involving dynamic environments, interactive scenarios, and commonsense reasoning. Despite the progress of Large Language Models (LLMs) in various AI domains, their performance on RAC is underexplored. To address this gap, we introduce a new benchmark, ActionReasoningBench, encompassing 13 domains and rigorously evaluating LLMs across eight different areas of RAC. These include - Object Tracking, Fluent Tracking, State Tracking, Action Executability, Effects of Actions, Numerical RAC, Hallucination Detection, and Composite Questions. Furthermore, we also investigate the indirect effect of actions due to ramification constraints for every domain. Finally, we evaluate our benchmark using open-sourced and commercial state-of-the-art LLMs, including GPT-4o, Gemini-1.0-Pro, Llama2-7b-chat, Llama2-13b-chat, Llama3-8b-instruct, Gemma-2b-instruct, and Gemma-7b-instruct. Our findings indicate that these models face significant challenges across all categories included in our benchmark.
1 Introduction
Reasoning about actions and change (RAC) is a cornerstone in artificial intelligence, tracing back to foundational work in the early 1960s (McCarthy et al., 1963). In the early days, the primary focus was on develo** logical systems to reason about actions and changes in the world. One of the significant challenges was to succinctly express the effects of an action on changeable properties of the world, known as fluents. For instance, consider the statement: “Moving an object from location X to location Y will result in the object being at location Y”. While direct effects on the affected fluents could be expressed, formalizing the impact on unaffected fluents posed a significant challenge, referred to as the “frame problem”. For example, “moving a block from the table to the chair, does not affect other objects on the table”. This challenge became even more complex when the descriptions involved relationships between fluents, leading to indirect effects or ramifications of actions. An example of this is the constraint “a block can’t be at two places at the same time”, which implies that moving a block from position X to position Y will result in the block no longer being at X.
It took multiple decades of research to create a comprehensive logical formalization that adequately addressed these issues. This resulted in the labor-intensive development of numerous handcrafted rules and logics that detail the effects and preconditions of actions (Reiter, 2001). Considering the importance of RAC in several reasoning tasks, it is no surprise that in recent years, the natural language processing (NLP) community has shown an interest in this area (He et al., 2023) (Spiliopoulou et al., 2022) (Banerjee et al., 2020). Moreover, LLM-based agents that perform complex decision tasks also involve actions (Kohli and Sun, 2024) (Zhou et al., 2024) (Zhao et al., 2024).
Translating rules into natural language reduces the manual effort of writing every rule in logic programming. Despite the benefits, the inherent complexities of language and the requirement to follow long reasoning chains pose substantial challenges for LLMs (Wang et al., 2024) (Mu et al., 2023). LLMs have been tested extensively on several reasoning domains, including planning (Valmeekam et al., 2024a) (Valmeekam et al., 2024b), commonsense reasoning (Sakaguchi et al., 2021) (Talmor et al., 2019) (Zhang et al., 2018), mathematical reasoning (Imani et al., 2023) (Ahn et al., 2024), temporal reasoning (Tan et al., 2023) (Aghzal et al., 2023), and logical reasoning (Parmar et al., 2024a) (Luo et al., 2024). However, despite their crucial role, LLMs on RAC are heavily under-explored. This study seeks to fill this gap by introducing a new benchmark, ActionReasoningBench, that pinpoints where LLMs struggle in RAC.
In our work, we utilize 13 domains from International Conference on Automated Planning and Scheduling (1998) (IPC) in creating ActionReasoningBench with the action-sequences varying from length 1 to 19. ActionReasoningBench is categorized into 8 distinct categories - Objects Tracking, Fluent Tracking, State Tracking, Action Executability, Effects of Actions, Numerical RAC, Hallucination Detection, and Composite Questions (Category details in Section 3.1). Each category evaluates a specific aspect of RAC. Additionally, we formulate ramification constraints for each domain, which introduces indirect effects of actions, thereby increasing the complexity (McIlraith, 2000) and enhancing the evaluation of LLMs reasoning capabilities. Lastly, we create an obfuscated variant of ActionReasoningBench that replaces the English names of objects and properties with a string of randomly generated characters. This obfuscated variant assesses whether the LLMs understand the rules described in the context or simply rely on their pre-training parametric knowledge.
Highlights of our benchmark, ActionReasoningBench, along with the comparison to previous benchmarks on RAC, are presented in table 1. We evaluate five open-source LLMs, encompassing three different LLM families, as well as two leading proprietary LLMs, GPT-4o (Achiam et al., 2023) and Gemini-1.0-Pro (Team et al., 2023). We conduct a comprehensive analysis of different categories in RAC along with the impact of increasing model size and various few-shot settings to observe the variations in the performance. This multifaceted approach not only highlights the capacities and limitations of LLMs in handling complex RAC tasks but also sheds light on their potential utility (Sharma, 2019) (Sap et al., 2019) and applicability in practical LLM-based agents (Yao et al., 2022) (Aksitov et al., 2023). Our evaluation reveals that LLMs struggle to comprehend State Tracking, Effects of Actions, Numerical RAC, and Composite Questions, with their performance decreasing even further as the length of action sequence increases. Furthermore, LLMs exhibit difficulty in reasoning when queried about negative fluents. We also investigate the effect of ramification on these categories, analyzing the varied responses to ramification within each category. The detailed results of our findings are discussed in Section 5.
2 Related Works
Reasoning about Actions and Change
Our work builds on an existing body of RAC literature. Works by He et al. (2023) and Banerjee et al. (2020) focus on creating RAC datasets from IPC and evaluating LLMs. Spiliopoulou et al. (2022) explores RAC within the scope of real-world physical attributes, intersecting with commonsense reasoning. RAC has also been investigated using multi-modal systems (Sampat et al., 2022a), (Sampat et al., 2022b).
Planning
Data Creation from Logic Programs
3 About ActionReasoningBench
3.1 Question Categories
The questions are classified into five different categories as follows. Refer to Appendix I for examples of questions and prompts for each question category.
-
1.
Object Tracking - This category contains questions on the status of a particular object that is present in the domain. For example, in the Blocksworld domain, an objects-tracking question might be Is block b3 on the table and not clear?
-
2.
Effects of Actions - This category contains questions that inquire about the effect of taking a given action. For example, in the Mystery domain, an effects-of-action question can be From the current state, the vehicle v0 moves from location l1 to l0, and has fuel-level f6 and f5, which properties of the state will be true now?
-
3.
Fluent Tracking - This category contains questions about the fluents, i.e. properties of the domain, that are true or false for a given object. For instance, in the NPuzzle domain, a fluent-tracking question might be List all the neighbors of tile t_3.
-
4.
State Tracking - This category encompasses and extends the Fluent Tracking category. It involves querying about all the fluents present in the domain that are true or false at a given moment. For instance, in the Goldminer domain, a state-tracking question can be What are all the valid properties in this state?
-
5.
Action Executability - This category contains questions regarding the executability of a particular action in the given state. For instance, in the Miconic domain, an action-executability question can be List all executable actions present in the current state.
-
6.
Numerical RAC - Questions requiring a numerical response fall under this category. The questions can be from any of the 5 categories listed above. For example, in the Spanner domain, a Numerical-RAC question can be What are the number of executable actions in the current state?
-
7.
Hallucination Detection - This category includes identifying questions about objects, actions, or fluents (i.e. properties of the state) that are not mentioned in the domain description and are entirely fabricated. For example, in the Depots domain, a hallucination-detection question can be Given the following properties of the state, which one is not defined. Write None if all are defined. Crate2 is not on pallet4, crate1 is in truck2, crate3 is next to truck3.
-
8.
Composite Question - This category contains questions combining the above-mentioned categories. These questions require multiple reasoning steps as they combine up to 3 different categories. For example, in the Satellite domain, a composite question can be What are the derived properties of the state for satellite0 before the first infeasible action in the sequence? Write None if there are none
3.2 Questions Subcategories: Fluents and Static Properties
We further divide the question categories, Object Tracking, Effects of Actions, and Fluent Tracking into 4 subcategories depending on the fluent type as these categories fundamentally pertain to questions regarding fluents.
-
1.
Base Fluents - Fluents that are not dependent on any other fluent and can change due to an action are known as base fluents. For instance, bomb_at is a base fluent in the Goldminer domain, which defines whether a bomb is at a specific location or not.
-
2.
Base Fluents with Constraints - These fluents are a type of ramification constraint that depends on themselves rather than other fluents. For example, the fluent at present in the Depot domain depends only on itself, i.e., if a truck is at location l0, it can’t be at location l1. We include these constraints in the domain description such as "A truck can be only at one location at a time."
-
3.
Derived Fluents - Fluents that depend on other fluents, reflecting a level of dependency and interaction, are known as derived fluents. They also constitute a part of ramification constraint. For example, the fluent free in the Grippers domain is a derived fluent as it directly depends on the fluent carry. The relationship is a consequence of the following constraint: A robot’s gripper is said to be free if the robot is not carrying any object with its gripper.
-
4.
Static Properties - properties that do not change throughout, irrespective of any action taken, are known as static fluents. For example, the fluent connected in the Visitall domain represents whether two locations are connected or not. The actions can move the robot from one location to another, but the locations will always be connected regardless of any action.
For examples of questions highlighting the above-mentioned fluents, refer to Appendix J. Finally, we create questions with negative fluents for every fluent type, which allows us to evaluate the understanding of LLMs for negation present in RAC.
3.3 Dataset Structure and Variations
Selected Domains
ActionReasoningBench 111 Code and data available at https://github.com/izuminka/reasoning_about_actions comprises 13 domains, handpicked from IPC, which offers benchmarks for state-of-the-art planning systems intending to facilitate research in planning. We focused on collecting deterministic domains, where every action has deterministic preconditions and effects. Since IPC contains a lot of domains involving transportation, we divided the 13 domains into 7 transportation and 6 non-transportation domains. More details about the domains are present in Appendix F.17.
Domain Descriptions with and without Ramification Constraints
Ramification constraint refers to the indirect effects of an action. Effective reasoning of these constraints is essential for a robust AI system capable of predicting and reasoning about the outcomes of actions. For instance, in the domain Goldminer, the robot’s arm is said to be empty if it is not holding a laser, stone, or gold. Examples of domains with and without ramification constraints are in appendix H.1 and H.2 respectively.
Action-Sequence Lengths
We generate questions with action-sequences of length 1, 5, 10, 15, and 19 to verify the action-following capabilities of LLMs.
Obfuscations
Since the IPC data is publicly available and talked about on the internet, LLMs might have an overlap of pre-training data with these domains. To assess the reasoning capabilities of the models, we obfuscate the objects, fluents, and actions present in the domain with a randomly generated string. This forces the LLM to rely on the user’s description rather than its parametric knowledge. Examples of obfuscated prompts are presented in Appendix K.4.
Answer Types
For all the categories mentioned in the previous section, we formulate two different types of questions based on the nature of the answer. First, a simple binary question, where the answer is either True or False. Second, we ask a subjective question, where the answer can be multiple objects, actions, or properties of the state. Examples of answer types are presented in Appendix I.
3.4 Data Creation
The question creation pipeline comprises four stages, as illustrated in Figure 1. The selected domain in the IPC is represented in the form of PDDL (refer to Appendix F.19 for examples). This PDDL is used to generate 10 initial and goal conditions. Subsequently, we utilize the PDDL solver and validator to obtain and validate the action-sequences necessary to reach the goal state. Next, we convert the PDDL domains and instances into ASP descriptions via templates. In the third stage, we employ ASP solvers to generate the action-state space and extract fluents for each state and all executable and inexecutable actions. Finally, the state-action data is converted into questions using a Python-based template. We introduce up to five natural language variants for every object, action, and fluent to enhance the linguistic diversity of the dataset. Additionally, we manually translate the domain descriptions from PDDL to natural language and incorporate ramification constraints to these domains.
3.5 Data Validation
The question-generation process utilizes traditional deterministic planners to accurately create the state space, ensuring action-sequences’ correctness and their effects. This state space is then transformed by a natural language converter, followed by manual validation by three independent annotators on a small subset of the data from every domain to ensure the data quality. Each question is scored on a scale from 1 to 5, assessing the soundness and comprehensibility of the natural language, resulting in an average score of 4.215. Additional details on data validation can be found in Appendix B. Furthermore, the domain description is reviewed by two domain experts to verify its accurate conversion from PDDL.
3.6 Data Splits
We divided the benchmark into two splits, one for training and the other for testing the models. Table 2 presents the breakdown of categories in the training and testing splits. Since multiple subcategories defined in section 3.2 are present in categories effects, fluent tracking, and object tracking, these categories dominate the training split of the dataset. For the test split, we ensured a balance in terms of action-sequence length, question categories, fluent subcategories, answer type (with equal proportions of true, false, and free-answer), and across all domains.
4 Experiments and Evaluation
We evaluate our benchmark, ActionReasoningBench, using 7 different LLMs with 4 different prompt settings. These evaluations encompass all combinations of varying action-sequence lengths, with and without ramification constraints and with and without obfuscation. Our evaluation uses 2 proprietary state-of-the-art LLMs, GPT-4o (Achiam et al., 2023) and Gemini-1.0-Pro (Team et al., 2023), alongside 5 open-source LLMs, Gemma-2b-instruct, Gemma-7b-instruct (Team et al., 2024), Llama2-7b-chat, Llama2-13b-chat (Touvron et al., 2023), and Llama3-8b-instruct (AI, 2024). Each LLM is tested with zero-shot, few-shot-1, few-shot-3, and few-shot-5 prompting techniques. The results are presented for binary (true/false) questions and free-form answers. We further fine-tuned Llama3-8b-instruct using the training data split described in Section 3.6. Given the limited context length of the open-source LLMs, we excluded examples exceeding the context length of 4096 tokens. Fine-tuning was performed separately for free-answer and binary questions. Refer to appendix D to see the fine-tuning details.
The evaluation of binary and free-form answers is conducted independently. We extracted "true" and "false" keywords from the response and compared them against the ground truth for binary answers. Since free-form answer evaluation can’t rely on exact string matching, human evaluation was employed for GPT-4o and Gemini-1.0-Pro. We further fine-tune RoBERTa (Liu et al., 2019) to evaluate all models. Further details regarding the evaluation process are provided in appendix C.2 and E.1. We additionally report the standard error of the mean (SEM) 222SEM=, where is the standard deviation, and is the number of samples for all conducted experiments. All the experiments are done on 8 A100 GPUs.
5 Results and Discussion
In this section we highlight all the results and analysis performed using our benchmark, ActionReasoningBench, using the experimental setup defined in Section 4. Table 3 summarizes the performance of all LLMs on the binary questions. 333AVG denotes the average calculated across all categories for a given action-sequence length and model Table 4 summarizes the results of free-answer human-evaluation. Zero-shot, Free Answer evaluation on the Roberta Classifier, and results for obfuscations-ramifications combinations are presented in Appendix C. We present additional observations in Appendix C.3
Performance of Smaller Open-Source LLMs Approaches Random
Table 3 reveals that smaller models, specifically Gemma-2b and Lamma-7b, consistently exhibit poor performance across all action-sequence lengths, with results often approaching random chance levels. This trend highlights the challenges these LLMs encounter in comprehending and responding to the complexities of the tasks. Additionally, further analysis reveals that Lamma-13b’s performance is similarly close to random (50%) in the few-shot-1 setting. There is a slight improvement with few-shot-5, as seen in Table 5; however, it remains only marginally above random. This incremental improvement indicates some level of adaptation or learning within specific frameworks, but overall, Lamma-13b struggles to achieve meaningful accuracy. These findings suggest that smaller open-source models are incapable of effectively performing RAC.
Fine-Tuning Improves Some Categories but Deteriorates Others
After fine-tuning Llama3-8b, we observed improvements in categories object tracking and fluent tracking. Conversely, there was a decline in performance in the categories effects of actions and hallucination detection. These results indicate that merely fine-tuning LLMs with more data might not address all the challenges in RAC.
Obfuscations Lowers the Performance Across All Dimensions
Our analysis, as displayed in Table 3 (without obfuscations) compared to Table 10 (with obfuscations), reveals that introducing obfuscations leads to a decrease in performance for all the model across all question categories, action-sequence lengths, fluent types, and domain types. This decrease highlights the model’s reliance on the knowledge acquired through pre-training data and not comprehending the rules given in the context while prompting.
Numerical Reasoning is Near Random for All Action-sequence Lengths
An analysis of Figure 2 reveals that numerical reasoning tasks consistently gets around 50% accuracy across both transportation and non-transportation domains. This level of accuracy indicates a fundamental limitation in the models’ ability to numerically reason about actions. This inability to perform numerical reasoning is not confined to RAC but is also reflected in broader contexts (Ahn et al., 2024).
5.0.1 Increase in Few shots, Helps in Domain Understanding, But not in Reasoning
An analysis of Gemini-1.0-Pro, presented in Table 5, reveals specific performance trends across different evaluation categories. Notably, performance improvements surpassing the SEM are primarily observed within the effects and hallucination detection categories. The accuracy in these categories heavily relies on the model’s ability to understand the consequences of actions and accurately identify hallucinated fluents, objects, and actions. With additional examples, Gemini exhibits performance gains in these categories. Specifically, there is an increase of 10 percentage points in the effects category and 16 percentage points in hallucination detection. This indicates that the model’s understanding of domain-specific dynamics and its ability to detect non-existent entities improves considerably with increased exposure to relevant examples. Conversely, introducing additional examples does not yield similar benefits across all categories. This stagnation suggests a limitation in the model’s current architecture or training regimen, which fails to enhance its reasoning capabilities despite additional examples.
5.1 GPT-4o and Gemini-1.0-Pro Case Study
Due to the poor performance of open-source models, we focus on the two best-performing proprietary LLMs, GPT-4o and Gemini-1.0-Pro.
5.1.1 Observations on Action-Sequence Lengths
Performance Degrades with Increasing Action-Sequence Length (with and without obfuscations).
Action-Executability is Affected Most Drastically by Increasing Action-Sequences Length.
As shown in Fig 2, for proprietary LLMs, the action executability category exhibits the most significant decrease in performance, followed by the state-tracking category. Conversely, categories such as object tracking and effects of actions show minimal degradation as action-sequence length increases.
5.1.2 Observations on Fluent Types
Fluent Type GPT-4o Gemini-1.0-Pro Baseline Baseline + R. O. Baseline O. Baseline + R. Baseline Baseline + R. O. Baseline O. Baseline + R. Base Fl Base Fl + Cnstr. Derived Fl Static Props Positive Fluents Negative Fluents
Base Fluents with Constraints Outperform Other Fluents
Among all the subcategories defined in section 3.2, base fluents with constraints consistently exhibit the best performance across all the categories and LLMs. This suggests that the models excel at understanding self-dependent properties rather than those that rely on other properties.
Performance Degradation with Increasing Action-Sequence Lengths
Figure 7 illustrates the performance trends across different fluent types as action-sequence length increases. While base fluents show the highest initial performance at length 1, this performance sharply declines as action-sequence length increases. In contrast, questions involving static fluent consistently display the lowest performance across all action-sequence lengths. Meanwhile, questions with persistent fluents show a stable performance, with no significant decrease observed as the action-sequence length increases.
Models Struggle with Negative Fluents
In Table 6, it is evident that models perform better on questions involving positive fluents. A consistent performance gap of approximately 25% percentage points exists between positive and negative fluents across the base, static, and base with constraints subcategories. Performance on negative fluents is nearly random. For derived fluents, the gap between negative and positive fluents is negligible, with both achieving roughly 60% accuracy. These findings indicate that LLMs struggle with negated fluents. This aligns with the literature, which also reports weaker performance on tasks involving negation (Truong et al., 2023). Effective reasoning about negated fluents is crucial for accurately understanding state changes and determining the viability of actions. Failures in these areas, particularly in complex action-sequences, can lead to errors.
Models’ Inability to Recognize Null Effects of Actions on Static Properties
Figure 3 illustrates with statistical significance that model performance on static properties is hindered by their inability to comprehend the effects of actions (or lack thereof) on these properties. Specifically, models fail to recognize that certain state properties remain unchanged by any action, resulting in performance slightly below random chance levels. This misunderstanding causes considerable inaccuracies in tasks involving static properties, highlighting a fundamental limitation in current models’ ability to accurately interpret and predict states in dynamic contexts.
5.1.3 Observations on Free Answers vs True False
When comparing the performances of LLMs in table 3 (binary questions) with those in table 13 (free-answer questions), it is evident that the categories Fluent tracking and Hallucination Detection, which performed well on binary questions, exhibit poor performance on free-answer questions. Conversely, the category Composite Questions shows improved performance when evaluated on free-answer questions. It can also be observed that the RoBERTa model struggles to classify the free answer questions and is not able to evaluate with human level accuracy.
6 Conclusion
In this work, we introduce ActionReasoningBench, the first benchmark for evaluating LLMs across several aspects of RAC, including Object Tracking, Fluent Tracking, State Tracking, Action Executability, Effects of Actions, Numerical RAC, Hallucination Detection, and Composite Questions. We also introduce ramification constraints that account for the indirect effects of some actions. Using ActionReasoningBench, we find that only large models like GPT-4o and Gemini-1.0-Pro can perform some RAC tasks, achieving average performances of 42.5% and 32.5% on Free Answers, respectively. Furthermore, RAC performance degrades with increasing action-sequence length for the Action Executability. Numerical reasoning also performs poorly across all LLMs and action-sequence lengths. Additionally, LLMs struggle with fluents, especially in understanding negative fluents and the effects of actions on static properties. Although ramification constraints can enhance performance on base fluents, they do not affect derived fluents. Finally, obfuscations lead to a decrease in performance across all dataset dimensions for all models. We hope that ActionReasoningBench will serve as a valuable benchmark for the research community, facilitating the assessment of LLMs in various aspects of RAC.
References
- (1) [2109.01653] CREAK: A Dataset for Commonsense Reasoning over Entity Knowledge. https://arxiv.longhoe.net/abs/2109.01653.
- (2) Differentiable Open-Ended Commonsense Reasoning - ACL Anthology. https://aclanthology.org/2021.naacl-main.366/.
- Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Aghzal et al. (2023) Mohamed Aghzal, Erion Plaku, and Ziyu Yao. 2023. Can large language models be good path planners? a benchmark and investigation on spatial-temporal reasoning. arXiv preprint arXiv:2310.03249.
- Ahn et al. (2024) Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. 2024. Large language models for mathematical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157.
- AI (2024) Meta AI. 2024. Introducing meta llama 3: The most capable openly available llm to date. https://ai.meta.com/blog/meta-llama-3/. Accessed: 2024-06-04.
- Aksitov et al. (2023) Renat Aksitov, Sobhan Miryoosefi, Zonglin Li, Daliang Li, Sheila Babayan, Kavya Kopparapu, Zachary Fisher, Ruiqi Guo, Sushant Prakash, Pranesh Srinivasan, et al. 2023. Rest meets react: Self-improvement for multi-step reasoning llm agent. arXiv preprint arXiv:2312.10003.
- Amini et al. (2019) Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Ye** Choi, and Hannaneh Hajishirzi. 2019. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms.
- Banerjee et al. (2020) Pratyay Banerjee, Chitta Baral, Man Luo, Arindam Mitra, Kuntal Pal, Tran C. Son, and Neeraj Varshney. 2020. Can Transformers Reason About Effects of Actions?
- Bisk et al. (2019) Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Ye** Choi. 2019. PIQA: Reasoning about Physical Commonsense in Natural Language.
- Chen et al. (2022) Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. 2022. FinQA: A Dataset of Numerical Reasoning over Financial Data.
- Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions.
- Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge.
- Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems.
- Fei et al. (2023) Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Songyang Zhang, Kai Chen, Zongwen Shen, and Jidong Ge. 2023. LawBench: Benchmarking Legal Knowledge of Large Language Models.
- Fikes and Nilsson (1971) Richard E Fikes and Nils J Nilsson. 1971. Strips: A new approach to the application of theorem proving to problem solving. Artificial intelligence, 2(3-4):189–208.
- Geva et al. (2021) Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies.
- Guan et al. (2023) Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. 2023. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. Advances in Neural Information Processing Systems, 36:79081–79094.
- Guha et al. (2023) Neel Guha, Julian Nyarko, Daniel Ho, Christopher Ré, Adam Chilton, Aditya K, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan Choi, Kevin Tobia, Margaret Hagan, Megan Ma, Michael Livermore, Nikon Rasumov-Rahe, Nils Holzenberger, Noam Kolt, Peter Henderson, Sean Rehaag, Sharad Goel, Shang Gao, Spencer Williams, Sunny Gandhi, Tom Zur, Varun Iyer, and Zehua Li. 2023. LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models. Advances in Neural Information Processing Systems, 36:44123–44279.
- Han et al. (2024) Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Wenfei Zhou, James Coady, David Peng, Yujie Qiao, Luke Benson, Lucy Sun, Alex Wardle-Solano, Hannah Szabo, Ekaterina Zubova, Matthew Burtell, Jonathan Fan, Yixin Liu, Brian Wong, Malcolm Sailor, Ansong Ni, Linyong Nan, Jungo Kasai, Tao Yu, Rui Zhang, Alexander R. Fabbri, Wojciech Kryscinski, Semih Yavuz, Ye Liu, Xi Victoria Lin, Shafiq Joty, Yingbo Zhou, Caiming Xiong, Rex Ying, Arman Cohan, and Dragomir Radev. 2024. FOLIO: Natural Language Reasoning with First-Order Logic.
- Haslum et al. (2019) Patrik Haslum, Nir Lipovetzky, Daniele Magazzeni, Christian Muise, Ronald Brachman, Francesca Rossi, and Peter Stone. 2019. An introduction to the planning domain definition language, volume 13. Springer.
- He et al. (2023) Weinan He, Canming Huang, Zhanhao Xiao, and Yongmei Liu. 2023. Exploring the capacity of pretrained language models for reasoning about actions and change. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4629–4643.
- Huang and Chang (2022) Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403.
- Huang et al. (2019) Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Ye** Choi. 2019. Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning.
- Imani et al. (2023) Shima Imani, Liang Du, and Harsh Shrivastava. 2023. Mathprompter: Mathematical reasoning using large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 37–42.
- International Conference on Automated Planning and Scheduling (1998) International Conference on Automated Planning and Scheduling. 1998. Icaps competitions. Accessed: 2024-04-25.
- Kohli and Sun (2024) Harsh Kohli and Huan Sun. 2024. Cleared for takeoff? compositional & conditional reasoning may be the achilles heel to (flight-booking) language agents. arXiv preprint arXiv:2404.04237.
- Koncel-Kedziorski et al. (2016) Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. 2016. MAWPS: A Math Word Problem Repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1152–1157, San Diego, California. Association for Computational Linguistics.
- Lin et al. (2021) Bill Yuchen Lin, Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Xiang Ren, and William W. Cohen. 2021. Differentiable Open-Ended Commonsense Reasoning.
- Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Ling et al. (2017) Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems.
- Liu et al. (2023) Hanmeng Liu, Zhiyang Teng, Leyang Cui, Chaoli Zhang, Qiji Zhou, and Yue Zhang. 2023. LogiCoT: Logical Chain-of-Thought Instruction-Tuning.
- Liu et al. (2020) Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. 2020. LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning.
- Liu et al. (2022) Tianyu Liu, Yizhe Zhang, Chris Brockett, Yi Mao, Zhifang Sui, Weizhu Chen, and Bill Dolan. 2022. A Token-level Reference-free Hallucination Detection Benchmark for Free-form Text Generation.
- Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, **gfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Lourie et al. (2021) Nicholas Lourie, Ronan Le Bras, Chandra Bhagavatula, and Ye** Choi. 2021. UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark.
- Luo et al. (2024) Man Luo, Shrinidhi Kumbhar, Ming shen, Mihir Parmar, Neeraj Varshney, Pratyay Banerjee, Somak Aditya, and Chitta Baral. 2024. Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models.
- McCarthy et al. (1963) John McCarthy et al. 1963. Situations, actions, and causal laws. Comtex Scientific.
- McIlraith (2000) Sheila A McIlraith. 2000. Integrating actions and state constraints: A closed-form solution to the ramification problem (sometimes). Artificial Intelligence, 116(1-2):87–121.
- Miao et al. (2021) Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2021. Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.
- Mu et al. (2023) Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Dan Hendrycks, and David Wagner. 2023. Can llms follow simple rules? arXiv preprint arXiv:2311.04235.
- Parmar et al. (2024a) Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, and Chitta Baral. 2024a. Towards systematic evaluation of logical reasoning ability of large language models. arXiv preprint arXiv:2404.15522.
- Parmar et al. (2024b) Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, and Chitta Baral. 2024b. Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models.
- Patel et al. (2021) Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are NLP Models really able to Solve Simple Math Word Problems?
- Reiter (2001) Raymond Reiter. 2001. Knowledge in action: logical foundations for specifying and implementing dynamical systems.
- Rintanen (2004) Jussi Rintanen. 2004. Complexity of planning with partial observability. In ICAPS, volume 4, pages 345–354.
- Sakaguchi et al. (2021) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Ye** Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
- Sampat et al. (2022a) Shailaja Keyur Sampat, Pratyay Banerjee, Yezhou Yang, and Chitta Baral. 2022a. Learning Action-Effect Dynamics for Hypothetical Vision-Language Reasoning Task.
- Sampat et al. (2022b) Shailaja Keyur Sampat, Maitreya Patel, Subhasish Das, Yezhou Yang, and Chitta Baral. 2022b. Reasoning about Actions over Visual and Linguistic Modalities: A Survey.
- Sap et al. (2019) Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A Smith, and Ye** Choi. 2019. Atomic: An atlas of machine commonsense for if-then reasoning. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3027–3035.
- Sharma (2019) Arpit Sharma. 2019. Using answer set programming for commonsense reasoning in the winograd schema challenge. Theory and Practice of Logic Programming, 19(5-6):1021–1037.
- Spiliopoulou et al. (2022) Evangelia Spiliopoulou, Artidoro Pagnoni, Yonatan Bisk, and Eduard Hovy. 2022. Events realm: Event reasoning of entity states via language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1982–1997.
- Stein and Koller (2023) Katharina Stein and Alexander Koller. 2023. Autoplanbench:: Automatically generating benchmarks for llm planners from pddl. arXiv preprint arXiv:2311.09830.
- Tafjord et al. (2021) Oyvind Tafjord, Bhavana Dalvi Mishra, and Peter Clark. 2021. ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language.
- Talmor et al. (2019) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158.
- Tan et al. (2023) Qingyu Tan, Hwee Tou Ng, and Lidong Bing. 2023. Towards benchmarking and improving the temporal reasoning capability of large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14820–14835.
- Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Team et al. (2024) Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295.
- Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Truong et al. (2023) Thinh Hung Truong, Timothy Baldwin, Karin Verspoor, and Trevor Cohn. 2023. Language models are not naysayers: an analysis of language models on negation benchmarks. In Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (* SEM 2023), pages 101–114.
- Valmeekam et al. (2024a) Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. 2024a. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. Advances in Neural Information Processing Systems, 36.
- Valmeekam et al. (2024b) Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. 2024b. On the planning abilities of large language models-a critical investigation. Advances in Neural Information Processing Systems, 36.
- Wang et al. (2024) Siyuan Wang, Zhongyu Wei, Ye** Choi, and Xiang Ren. 2024. Can llms reason with rules? logic scaffolding for stress-testing and improving llms. arXiv preprint arXiv:2402.11442.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
- Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering.
- Yao et al. (2022) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
- Zhang et al. (2018) Sheng Zhang, Xiaodong Liu, **g**g Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. 2018. Record: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885.
- Zhao et al. (2024) Haiteng Zhao, Chang Ma, Guoyin Wang, **g Su, Lingpeng Kong, **g**g Xu, Zhi-Hong Deng, and Hongxia Yang. 2024. Empowering large language model agents through action learning. arXiv preprint arXiv:2402.15809.
- Zhou et al. (2024) Qinhao Zhou, Zihan Zhang, Xiang Xiang, Ke Wang, Yuchuan Wu, and Yongbin Li. 2024. Enhancing the general agent capabilities of low-parameter llms through tuning and multi-branch reasoning. arXiv preprint arXiv:2403.19962.
Appendix
Limitations
While reasoning about actions is not tied to the English language, ActionReasoningBench is currently restricted to questions formulated in English. Despite our efforts to evaluate a range of models, including several open-source LLMs and two state-of-the-art proprietary LLMs, GPT-4o, and Gemini-1.0-Pro, our assessment did not extend to additional models with different architectures or training methodologies due to resource constraints.
Appendix A Additional Literature
A.1 LLMs for Reasoning
Reasoning is a fundamental cognitive process that allows individuals to analyze information, deduce conclusions, solve problems, and make decisions. It is crucial across various domains, from everyday decision-making to scientific research and policy formulation. Effective reasoning leads to better understanding, innovative solutions, and the ability to navigate complex situations, making it an essential skill for humans. As we continue to explore and enhance our own cognitive abilities, there has been a growing interest in emulating similar reasoning processes within artificial intelligence systems. This pursuit has led to significant advancements in the development of LLMs. With the advent of LLMs and their prowess in various natural language tasks, it is speculated that these models exhibit reasoning capabilities when they are sufficiently large (Huang and Chang, 2022), (Wei et al., 2022). However, to accurately assess the reasoning capabilities of these LLMs, it is crucial to establish benchmarks and frameworks that can evaluate their ability to reason across diverse contexts.
A.2 Legal Reasoning
Legal reasoning is a complex process essential to the legal profession, involving the application of legal rules to specific facts to resolve disputes and make decisions. It requires a deep understanding of laws, statutes, and case precedents and the ability to interpret these in various factual contexts. Notable benchmarks for legal reasoning include works by Guha et al. (2023) and Fei et al. (2023)
A.3 Logical Reasoning
Logical reasoning is an essential cognitive skill that involves the analysis and evaluation of arguments based on the principles of logic. It enables the identification of strong and weak arguments, detection of fallacies, and construction of coherent reasoning. Fundamental across disciplines like mathematics, science, philosophy, and law, logical reasoning allows for drawing conclusions from premises, problem-solving, and informed decision-making. Representative benchmarks for logical reasoning include works by Luo et al. (2024), Han et al. (2024), Liu et al. (2023),Liu et al. (2020),Parmar et al. (2024b), and Tafjord et al. (2021)
A.4 Arithmetic Reasoning
Arithmetic reasoning is the process of using mathematical principles to analyze and solve numerical problems. This skill extends beyond simple calculation to involve understanding numerical relationships, patterns, and sequences. Essential in fields such as finance, engineering, and everyday activities like budgeting, arithmetic reasoning allows individuals to interpret data, forecast outcomes, and make informed decisions. Representative benchmarks include works by Amini et al. (2019), Chen et al. (2022), Cobbe et al. (2021), Koncel-Kedziorski et al. (2016), Ling et al. (2017), Liu et al. (2022), Miao et al. (2021), and Patel et al. (2021)
A.5 Commonsense Reasoning
Commonsense reasoning involves using practical knowledge gained from daily life to make intuitive judgments and decisions. It encompasses understanding societal norms and basic physical principles, allowing individuals to efficiently navigate social and practical tasks. This type of reasoning is essential for adapting to new environments, predicting outcomes, and managing everyday activities, forming a fundamental part of human cognition. Representative benchmarks include works by 210901653CREAK (210), Bisk et al. (2019), Clark et al. (2019), Clark et al. (2018), Dif , Geva et al. (2021), Huang et al. (2019), Lin et al. (2021), Lourie et al. (2021), Mihaylov et al. (2018), Talmor et al. (2019), and Yang et al. (2018)
A.6 Reasoning about Actions and Change
Reasoning about actions and its effects is essential for both humans and AI systems. For humans, this capability enables us to plan, make decisions, and navigate complex relationships within our environment. Similarly, in the field of AI, reasoning about actions is crucial for enabling intelligent systems to effectively achieve goals, manage uncertainty, and adapt to the dynamic, ever-changing world. This ability is foundational for both humans and AI to interact safely and efficiently with their surroundings. Relevant works by Valmeekam et al. (2024a), Valmeekam et al. (2024b), and Guan et al. (2023) explore reasoning about actions but focus more on the planning capabilities of Large Language Models. Work by Banerjee et al. (2020) investigates the ability of Large Language Models to reason about actions only over four domains, and the synthetic data generated in their work only addresses three question types. While He et al. (2023) tests the ability of LLMS on RAC only on a variant of Blocksworld with four question types, out of which only two types focus more on reasoning about actions and their effects while the other two focus on planning.
Through our work, we provide an in-depth analysis of LLMs’ capability to reason about actions and their effects with and without ramification constraints, which the relevant works mentioned fall short of.
Appendix B Data Verification
Three annotators were given the task of giving a score from 1 to 5 based on how natural they feel the sentences present in the dataset. The annotators were given the following instructions along with the dataset
Table 7 shows the scores given by three annotators across all the domains present in ActionReasoningBench.
Domain | Annotator 1 Score | Annotator 2 Score | Annotator 3 Score | Average |
Blocksworld | 3.8 | 5.0 | 3.8 | 4.2 |
Depots | 4.6 | 3.6 | 4.0 | 4.1 |
Driverlog | 4.8 | 3.6 | 4.4 | 4.3 |
Goldminer | 5.0 | 4.0 | 4.6 | 4.5 |
Grippers | 4.6 | 4.0 | 4.6 | 4.4 |
Logistics | 5.0 | 3.4 | 4.4 | 4.3 |
Miconic | 5.0 | 4.4 | 4.0 | 4.5 |
Mystery | 4.0 | 4.0 | 3.6 | 3.9 |
NPuzzle | 3.0 | 3.6 | 4.6 | 3.7 |
Satellite | 4.4 | 4.6 | 4.4 | 4.5 |
Spanner | 5.0 | 5.0 | 3.8 | 4.6 |
Visitall | 3.6 | 3.8 | 4.0 | 3.8 |
Zenotravel | 4.6 | 4.0 | 3.8 | 4.1 |
Average | 4.4 | 4.1 | 4.2 | 4.2 |
Appendix C Additional Results
In this section, we provide more discussions and results performed on our benchmark.
C.1 Performance on Binary (True/False) Questions
Act. Question GPT-4o Gemini Pro LLaMa2 13b LLaMa3 8b LLaMa2 7b Gemma 7b Seq. Categories zero shot zero shot zero shot zero shot zero shot zero shot 1 Object Trk. 1 Fluent Trk. 1 State Trk. 1 Action Exec. 1 Effects 1 Num. Reas. 1 Hallucination 1 Composite 1 AVG 10 Object Trk. 10 Fluent Trk. 10 State Trk. 10 Action Exec. 10 Effects 10 Num. Reas. 10 Hallucination 10 Composite 10 AVG 19 Object Trk. 19 Fluent Trk. 19 State Trk. 19 Action Exec. 19 Effects 19 Num. Reas. 19 Hallucination 19 Composite 19 AVG
Act. Question GPT-4o Gemini Pro LLaMa2 13b LLaMa3 8b LLaMa2 7b Gemma 7b Seq. Categories few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 1 Object Trk. 1 Fluent Trk. 1 State Trk. 1 Action Exec. 1 Effects 1 Num. Reas. 1 Hallucination 1 Composite — — — — — — 1 AVG 10 Object Trk. 10 Fluent Trk. 10 State Trk. 10 Action Exec. 10 Effects 10 Num. Reas. 10 Hallucination 10 Composite — — — — — — 10 AVG 19 Object Trk. 19 Fluent Trk. 19 State Trk. 19 Action Exec. 19 Effects 19 Num. Reas. 19 Hallucination 19 Composite — — — — — — 19 AVG
Act. Question GPT-4o Gemini Pro LLaMa2 13b LLaMa3 8b LLaMa2 7b Gemma 7b Seq. Categories few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 1 Object Trk. 1 Fluent Trk. 1 State Trk. 1 Action Exec. 1 Effects 1 Num. Reas. 1 Hallucination 1 Composite — — — — — — 1 AVG 10 Object Trk. 10 Fluent Trk. 10 State Trk. — — 10 Action Exec. 10 Effects 10 Num. Reas. 10 Hallucination 10 Composite — — — — — — 10 AVG 19 Object Trk. 19 Fluent Trk. 19 State Trk. — — 19 Action Exec. 19 Effects 19 Num. Reas. 19 Hallucination 19 Composite — — — — — — 19 AVG
Plan Question GPT-4o Gemini Pro LLaMa2 13b LLaMa3 8b LLaMa2 7b Gemma 7b Len. Categories few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 1 Object Trk. 1 Fluent Trk. 1 State Trk. 1 Action Exec. 1 Effects 1 Num. Reas. 1 Hallucination 1 Composite — — — — — — 1 AVG 10 Object Trk. 10 Fluent Trk. 10 State Trk. — — 10 Action Exec. 10 Effects 10 Num. Reas. 10 Hallucination 10 Composite — — — — — — 10 AVG 19 Object Trk. 19 Fluent Trk. 19 State Trk. — — 19 Action Exec. 19 Effects 19 Num. Reas. 19 Hallucination 19 Composite — — — — — — 19 AVG
C.2 Performance on Free-Answer questions evaluated using Fine-tuned RoBERTa Classifier
Plan Question GPT-4o Gemini Pro LLaMa2 13b LLaMa3 8b LLaMa2 7b Gemma 7b Len. Categories zero shot zero shot zero shot zero shot zero shot zero shot 1 Object Trk. 1 Fluent Trk. 1 State Trk. 1 Action Exec. 1 Effects 1 Num. Reas. 1 Hallucination 1 Composite 1 AVG 10 Object Trk. 10 Fluent Trk. 10 State Trk. 10 Action Exec. 10 Effects 10 Num. Reas. 10 Hallucination 10 Composite 10 AVG 19 Object Trk. 19 Fluent Trk. 19 State Trk. 19 Action Exec. 19 Effects 19 Num. Reas. 19 Hallucination 19 Composite 19 AVG
Plan Question GPT-4o Gemini Pro LLaMa2 13b LLaMa3 8b LLaMa2 7b Gemma 7b Len. Categories few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 1 Object Trk. 1 Fluent Trk. 1 State Trk. 1 Action Exec. 1 Effects 1 Num. Reas. 1 Hallucination 1 Composite 1 AVG 10 Object Trk. 10 Fluent Trk. 10 State Trk. 10 Action Exec. 10 Effects 10 Num. Reas. 10 Hallucination 10 Composite 10 AVG 19 Object Trk. 19 Fluent Trk. 19 State Trk. 19 Action Exec. 19 Effects 19 Num. Reas. 19 Hallucination 19 Composite 19 AVG
Plan Question GPT-4o Gemini Pro LLaMa2 13b LLaMa3 8b LLaMa2 7b Gemma 7b Len. Categories few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 1 Object Trk. 1 Fluent Trk. 1 State Trk. 1 Action Exec. 1 Effects 1 Num. Reas. 1 Hallucination 1 Composite — — — — — — 1 AVG 10 Object Trk. 10 Fluent Trk. 10 State Trk. 10 Action Exec. 10 Effects 10 Num. Reas. 10 Hallucination 10 Composite — — — — — — 10 AVG 19 Object Trk. 19 Fluent Trk. 19 State Trk. 19 Action Exec. 19 Effects 19 Num. Reas. 19 Hallucination 19 Composite — — — — — — 19 AVG
Plan Question GPT-4o Gemini Pro LLaMa2 13b LLaMa3 8b LLaMa2 7b Gemma 7b Len. Categories few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 1 Object Trk. 1 Fluent Trk. 1 State Trk. 1 Action Exec. 1 Effects 1 Num. Reas. 1 Hallucination 1 Composite — — — — — — 1 AVG 10 Object Trk. 10 Fluent Trk. 10 State Trk. 10 Action Exec. 10 Effects — 10 Num. Reas. 10 Hallucination 10 Composite — — — — — — 10 AVG 19 Object Trk. 19 Fluent Trk. 19 State Trk. — 19 Action Exec. 19 Effects 19 Num. Reas. 19 Hallucination 19 Composite — — — — — — 19 AVG
Plan Question GPT-4o Gemini Pro LLaMa2 13b LLaMa3 8b LLaMa2 7b Gemma 7b Len. Categories few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 few shot 1 1 Object Trk. 1 Fluent Trk. 1 State Trk. 1 Action Exec. 1 Effects 1 Num. Reas. 1 Hallucination 1 Composite — — — — — — 1 AVG 10 Object Trk. 10 Fluent Trk. 10 State Trk. 10 Action Exec. 10 Effects 10 Num. Reas. 10 Hallucination 10 Composite — — — — — — 10 AVG 19 Object Trk. 19 Fluent Trk. 19 State Trk. 19 Action Exec. 19 Effects 19 Num. Reas. 19 Hallucination 19 Composite — — — — — — 19 AVG
C.3 Additional Observations
C.3.1 Observations on Domains Variation
State Tracking and Action Executability is better on the Transporation Domains
Analysis of Figure 4 reveals statistically significant trends concerning the performance of state tracking and action executability questions within transportation domains. These types of questions consistently show enhanced performance when situated in transportation-related contexts, as indicated by results falling outside the SEM.
While other tendencies are also observable, such as improved performance in object tracking, fluent tracking, and numerical reasoning within transportation domains, these trends remain within the SEM. This suggests that the differences are not statistically significant although there is a slight preference for transportation settings in these categories. Thus, while transportation domains specifically enhance model capabilities in state tracking and action executability, the influence on other categories of questions is less pronounced and should be interpreted cautiously.
Hallucination Detection is better on the Non-Transporation Domains
Figure 4 highlights a statistically significant trend wherein hallucination detection tasks exhibit better performance in non-transportation domains. This notable improvement suggests that models may be better tuned or more responsive to the complexities and nuances in scenarios outside of transportation contexts when it comes to recognizing and correcting hallucinations in the data.
Conversely, while there is a slight indication that the effects of actions are also better understood in non-transportation domains, this performance difference remains within the SEM. Therefore, although there is a tendency for improved action effect recognition in non-transportation settings, this finding is not statistically robust and cannot be conclusively affirmed. This underscores the variability in model performance across different domains and highlights the need for cautious interpretation and further validation.
Little statistical differences for Ramifications and Domain Types
Analysis of Figure 8 shows that the performance on questions involving ramifications versus those without ramifications falls statistically within the standard error of the mean (SEM). This indicates that the presence or absence of ramifications does not significantly impact overall model performance. Although there is a slight tendency for base and derived fluents to perform better in scenarios involving ramifications, this difference also remains within the SEM boundaries.
In our examination of behavior by domains as depicted in Figure 5, a similar pattern emerges. Base and derived fluents tend towards better performance in transportation-related scenarios, whereas static fluents tend to perform better in non-transportation contexts. However, all these observations remain statistically within the standard error of the mean (SEM), indicating that these trends do not represent significant deviations. For persistent fluents, there is a noticeable, albeit slight, inclination towards better performance in transportation scenarios. Nonetheless, given the proximity of these results to the SEM threshold, these findings should be approached cautiously.
C.3.2 Fluents
C.4 Ramifications
Appendix D Fine-tuning Results and Plots
In this section, we describe the fine-tuning performed on the training split of ActionReasoningBench described in section 3.6. We finetune Llama3-8b for 2 epochs with AdamW optimizer and a batch size of 4. Fig 9 and 10 shows the loss over time for binary and free-answer fine-tuning respectively.
Appendix E Free Answers Evaluation Details
E.1 Metrics Tests
Despite the widespread use of Rouge (Lin, 2004) in natural language generation tasks, there are several inherent limitations that undermine its effectiveness as a reliable metric for our task. We highlight some examples below to support our claim.
Appendix F Dataset Creation
F.1 Domain, grid-visit-all example
Going line by line: indicates that objects involved in the domain are of different types. defines what types of objects we are working with; in this case, it is just a place. define predicates in the problem that take in objects. Predicates can only be true or false. And finally, we have ; in this example, there is a single action, but there can be many in general. Each action has that define variables of various , , which defines a set of that need to be true in order to execute the action, and , which is a set of predicates that will be true after the action is executed.
F.2 Instance, grid-visit-all example
An instance is composed of objects, initial and final conditions. This is a trivial problem instance, as initial and goal conditions are the same.
F.3 Blocksworld
The Blocksworld involves a set of blocks, a hand that can manipulate blocks one at a time, and a table where the blocks stay. The objective is to re-stack an initial set of blocks to a desired configuration using a hand.
F.4 Logistics
The Logistics is a transportation domain that involves of transport of packages using 2 types of mobiles: trucks and airplanes. Package locations are located within cities. Each city has a unique airport. The objective is to move packages using mobiles from initial to goal locations. Those locations can be in various cities.
F.5 Depots
The Depots is a transportation domain that is a combination of Blocksworld and Logistics. The transportation aspects are the same as in the Logistics domain, but one can stack packages like in Blocksworld.
F.6 Driverlog
The Driverlog is a transportation domain that involves the transportation of packages using trucks. It involves packages, trucks, locations and drivers.
F.7 Goldminer
The Goldminer involves an agent, bombs, a laser, gold, and a mine organized as a grid. Each cell in the grid represents soft or hard rock. An agent can use bombs or a laser to penetrate the soft rock to uncover gold. The bomb cannot destroy gold, but the laser can. The objective is to collect gold.
F.8 Gripper
The Gripper is a transportation domain that involves a single agent with 2 "grippers" to pick up and put down objects. The objective is to move objects from initial locations to goal locations.
F.9 Miconic
Miconic is a transportation domain that involves an elevator, floors, and passengers. The objective is to transport passengers from the initial to goal floors using an elevator.
F.10 Mystery
The Mystery is a transportation domain with fuel restrictions. It involves locations, mobiles, fuel and fuel constraints, and cargo. The objective is to organize proper logistics.
F.11 N-puzzle
A common variation of this domain is a familiar "15 puzzle", a slide square puzzle with 15 tiles numbered 1-15 and one open slot. The objective is to rearrange the tiles in numerical order. N-puzzle is a generalization to N tiles.
F.12 Satellite
This domain involves the scheduling of satellites to gather information about space phenomena. It involves satellites, instruments, image modes, pointing directions, locations, instrument capabilities, calibration target functions, initial goal pointing directions, and image objectives.
F.13 Spanner
This domain involves an agent, a location, spanners (tools) and nuts to be tightened. The objective is to pick up spanners and tighten the nuts. The caveat is that only one spanner can be used to tighten one nut.
F.14 Visit-All
This very simple domain involves an agent that has to visit all cells in an grid.
F.15 Zenotravel
Zenotravel is a transportation domain that involves airplanes, locations, fuel levels, and passengers. The objective is to transport passengers from initial to goal locations and not run out of fuel.
F.16 Additional Statistics
test-true_false_answer test-free_answer train-true_false_answer train-free_answer action_executability 240 119 2360 2481 effects 957 108 8243 1842 fluent_tracking 1600 130 15801 5870 hallucination 240 130 7660 13426 numerical_reasoning 240 120 12760 6380 object_tracking 1598 120 8696 1180 state_tracking 153 101 1147 1849
F.17 Domains
Domains Info Instances Info Other Domain Description # fluents # actions <exec>std <effects-wo-ram>std <effects-wi-ram>std # object types <objects>std <actions from a state>std <log(state space size)>std Domain type IPC Year State Complexity Blocksworld 5 4 2.25 ± 0.83 4.5 ± 0.5 1 8.1 ± 0.83 132.6 ± 26.67 92.46 ± 3.96 2000 Depots 6 5 3.4 ± 1.36 4 ± 1.67 6 25.5 ± 2.25 4361.8 ± 1052.02 158.68 ± 4.57 2002 DriverLog 5 6 2.33 ± 0.47 2.33 ± 0.47 4 19.1 ± 2.02 1120.8 ± 503.46 131.37 ± 9.19 2002 GoldMiner 12 7 3 ± 0.53 2.86 ± 0.83 1 15.3 ± 4.29 772.8 ± 451.06 123.57 ± 10.11 2018 Grippers 4 3 2 ± 0.82 2.67 ± 0.47 4 15.8 ± 1.47 245.8 ± 161.38 101.14 ± 10.99 1998 Logistics 3 6 2 ± 0.58 2 ± 0 5 18 ± 2.57 623.0 ± 312.62 119.21 ± 11.73 1998 Miconic 8 4 2.25 ± 0.43 1.75 ± 0.43 2 17.1 ± 1.58 279.4 ± 58.04 106.6 ± 4.08 2000 Mystery 7 3 4 ± 0 4 ± 0 5 24.2 ± 1.33 413.2 ± 154.92 113.05 ± 7.48 1998 NPuzzle 3 1 3 ± 0 4 ± 0 2 17 ± 0 576.0 ± 0.0 120.77 ± 0.0 2018 Satellite 8 5 2.8 ± 1.47 1.8 ± 0.75 4 24.7 ± 3.85 1234.8 ± 490.17 133.48 ± 8.65 2002 Spanner 6 3 3 ± 1.41 2.33 ± 0.47 4 22.3 ± 0.46 455.6 ± 23.83 116.29 ± 0.97 2011 Visitall 3 1 2 ± 0 3 ± 0 1 23.9 ± 3.01 556.4 ± 148.19 119.5 ± 4.73 2014 ZenoTravel 4 5 2.8 ± 0.75 2.8 ± 0.98 4 20.1 ± 1.04 4340.0 ± 1282.95 158.03 ± 7.03 2002
F.18 Planning Description
Simple planning involves finding a sequence of actions that will take the world from a given initial state to a state that satisfies the goal conditions. Predicates (as defined in PDDL; also referred to as "fluents" in the reasoning about actions community) are properties of the world, such as meaning is on , and a state of the world describes what fluents are true and what are not true in that world. A (planning) domain description describes the fluents in that domain, the actions in that domain, and how the actions impact the fluents in that domain. In formal planning languages, often the language PDDL is used to describe domains. Given a domain description , the semantics of the description language defines a transition function , where means that execution of the action in the state results in the state . If is undefined, then is said to be inexecutable in . A simple goal is a classical formula whose truth value can be evaluated with respect to a state. A planning instance consists of the triplet , where is a domain description, is an initial state, and is a goal. Given a planning instance , a plan is any sequence of actions where the state is defined and evaluates to true with respect to that state. The plan generation task that we are concerned with is the task where , , and , described in natural language, are together given as input, and the model has to generate a plan .
F.19 PDDL
"The Planning Domain Definition Language (PDDL) is a formal knowledge representation language designed to express planning models … a de-facto standard input language for many planning systems, although it is not the only modeling language for planning. Several variants of PDDL have emerged that capture planning problem of different natures and complexities, with a focus on deterministic problems." (Haslum et al., 2019). One can use PDDL to describe the planning domain and a planning problem instance (objects, initial, and goal conditions). An example of Visit-All domain and a problem instance described in PDDL can be seen in Appendix F. This particular domain example belongs to PDDL’s subset, a language called "STRIPS" (Stanford Research Institute Problem Solver) (Fikes and Nilsson, 1971). In addition to STRIPS, this domain description is "typed": objects involved in a problem have type and subtype definitions. In this study, we restrict our attention only to typed "STRIPS" domains and instances. Those include typed objects and predicates that operate on the objects and actions with preconditions and effects and a problem instance with typed objects and initial and goal states.
Appendix G Calculating State Space
In the following section, we calculate state-space complexity for the domains present in ActionReasoningBench. A higher state-space complexity refers to a complex domain in traditional AI (Rintanen, 2004). Therefore, we calculate whether the LLMs also struggle more in the “harder” domains. Fig 11 represents the plot between the accuracy and state-space complexity of the domains. The plot reveals that LLMs function differently than traditional AI solvers.
G.1 Blocksworld
The number of fluents in a state is defined by predicates:
-
•
on(b1,b2)
-
•
notable(b)
-
•
clear(b)
-
•
holding(b)
-
•
handempty
The complexity of a state is, where is the number of objects
(1) |
G.2 Depots
-
•
(at ?x - locatable ?y - place)
-
•
(on ?x - crate ?y - surface)
-
•
(in ?x - crate ?y - truck)
-
•
(lifting ?x - hoist ?y - crate)
-
•
(available ?x - hoist)
-
•
(clear ?x - surface))
(2) |
G.3 Driverlog
-
•
(at ?obj - locatable ?loc - location)
-
•
(in ?obj1 - obj ?obj - truck)
-
•
(driving ?d - driver ?v - truck)
-
•
(link ?x ?y - location)
-
•
(path ?x ?y - location)
-
•
(empty ?v - truck)
(3) |
G.4 Goldminer
-
•
(connected ?x - LOC ?y - LOC)
-
•
(robot-at ?x - LOC)
-
•
(bomb-at ?x - LOC )
-
•
(laser-at ?x - LOC)
-
•
(soft-rock-at ?x - LOC)
-
•
(hard-rock-at ?x - LOC)
-
•
(gold-at ?x - LOC)
-
•
(clear ?x - LOC)
-
•
(arm-empty)
-
•
(holds-bomb)
-
•
(holds-laser)
-
•
(holds-gold)
(4) |
G.5 Grippers
-
•
(carry ?r - robot ?o - object ?g - gripper)
-
•
(at-robby ?r - robot ?x - room)
-
•
(at ?o - object ?x - room)
-
•
(free ?r - robot ?g - gripper)
(5) |
G.6 Logistics
-
•
(in-city ?loc - place ?city - city)
-
•
(at ?obj - physobj ?loc - place)
-
•
(in ?pkg - package ?veh - vehicle)
(6) |
G.7 Miconic
-
•
(origin ?person - passenger ?floor - floor)
-
•
(destin ?person - passenger ?floor - floor)
-
•
(above ?floor1 - floor ?floor2 - floor)
-
•
(boarded ?person - passenger)
-
•
(not-boarded ?person - passenger)
-
•
(served ?person - passenger)
-
•
(not-served ?person - passenger)
-
•
(lift-at ?floor - floor)
(7) |
G.8 Mystery
-
•
(at ?v - movable ?l - location)
-
•
(has-fuel ?l - location ?f - fuel)
-
•
(in ?c - cargo ?v - vehicle)
-
•
(has-space ?v - vehicle ?s - space)
-
•
(conn ?l1 ?l2 - location)
-
•
(fuel-neighbor ?f1 ?f2 - fuel)
-
•
(space-neighbor ?s1 ?s2 - space)
(8) |
G.9 N-Puzzle
-
•
(at ?tile - tile ?position - position)
-
•
(neighbor ?p1 - position ?p2 - position)
-
•
(empty ?position - position)
(9) |
G.10 Satellite
-
•
(on-board ?i - instrument ?s - satellite)
-
•
(supports ?i - instrument ?m - mode)
-
•
(pointing ?s - satellite ?d - direction)
-
•
(have-image ?d - direction ?m - mode)
-
•
(calibration-target ?i - instrument ?d - direction)
-
•
(power-avail ?s - satellite)
-
•
(power-on ?i - instrument)
-
•
(calibrated ?i - instrument)
(10) |
G.11 Spanner
-
•
(at ?m - locatable ?l - location)
-
•
(carrying ?m - man ?s - spanner)
-
•
(link ?l1 - location ?l2 - location)
-
•
(useable ?s - spanner)
-
•
(tightened ?n - nut)
-
•
(loose ?n - nut))
(11) |
G.12 Visitall
-
•
(connected ?x ?y - place)
-
•
(at-robot ?x - place)
-
•
(visited ?x - place)
(12) |
G.13 Zenotravel
-
•
(at ?x - (either person aircraft) ?c - city)
-
•
(in ?p - person ?a - aircraft)
-
•
(fuel-level ?a - aircraft ?l - level)
-
•
(next ?l1 ?l2 - flevel)
(13) |
Appendix H Domain Descriptions
H.1 Without Obfuscation Without Ramifications
H.2 Without Obfuscation With Ramifications
H.3 With Obfuscation With Ramifications
H.4 With Obfuscation Without Ramifications
Appendix I Examples of Questions
I.1 Object Tracking
I.1.1 True/False Question, Action-Sequence Length 1
I.1.2 Free Answer Question, Action-Sequence Length 1
I.2 Fluent Tracking
I.2.1 True/False Question, Action-Sequence Length 1
I.2.2 Free Answer Question, Action-Sequence Length 1
I.3 State Tracking
I.3.1 True/False Question, Action-Sequence Length 1
I.3.2 Free Answer Question, Action-Sequence Length 1
I.4 Action Executability
I.4.1 True/False Question, Action-Sequence Length 1
I.4.2 Free Answer Question, Action-Sequence Length 1
I.5 Effects of Actions
I.5.1 True/False Question, Action-Sequence Length 1
I.5.2 Free Answer Question, Action-Sequence Length 1
I.6 Numerical RAC
I.6.1 True/False Question, Action-Sequence Length 1
I.6.2 Free Answer Question, Action-Sequence Length 1
I.7 Hallucinations
I.7.1 True/False Question, Action-Sequence Length 1
I.7.2 Free Answer Question, Action-Sequence Length 1
I.8 Composite Questions
I.8.1 True/False Question, Action-Sequence Length 1
I.8.2 Free Answer Question, Action-Sequence Length 1
Appendix J Fluent Types Questions
All of the below examples are from the goldminer domain. Each of the fluent types are defined in 3.2. Below are the fluent types for the goldminer domain. ⬇ Base Fluents = {bomb_at, laser_at, soft_rock_at, hard_rock_at, holds_bomb, holds_laser, holds_gold} Base Fluents with Constraints = {robot_at, gold_at} Derived Fluents = {arm_empty, clear} Static Properties = {connected}