-
Supporting Self-Reflection at Scale with Large Language Models: Insights from Randomized Field Experiments in Classrooms
Authors:
Harsh Kumar,
Ruiwei Xiao,
Benjamin Lawson,
Ilya Musabirov,
Jiakai Shi,
Xinyuan Wang,
Huayin Luo,
Joseph Jay Williams,
Anna Rafferty,
John Stamper,
Michael Liut
Abstract:
Self-reflection on learning experiences constitutes a fundamental cognitive process, essential for the consolidation of knowledge and the enhancement of learning efficacy. However, traditional methods to facilitate reflection often face challenges in personalization, immediacy of feedback, engagement, and scalability. Integration of Large Language Models (LLMs) into the reflection process could mi…
▽ More
Self-reflection on learning experiences constitutes a fundamental cognitive process, essential for the consolidation of knowledge and the enhancement of learning efficacy. However, traditional methods to facilitate reflection often face challenges in personalization, immediacy of feedback, engagement, and scalability. Integration of Large Language Models (LLMs) into the reflection process could mitigate these limitations. In this paper, we conducted two randomized field experiments in undergraduate computer science courses to investigate the potential of LLMs to help students engage in post-lesson reflection. In the first experiment (N=145), students completed a take-home assignment with the support of an LLM assistant; half of these students were then provided access to an LLM designed to facilitate self-reflection. The results indicated that the students assigned to LLM-guided reflection reported increased self-confidence and performed better on a subsequent exam two weeks later than their peers in the control condition. In the second experiment (N=112), we evaluated the impact of LLM-guided self-reflection against other scalable reflection methods, such as questionnaire-based activities and review of key lecture slides, after assignment. Our findings suggest that the students in the questionnaire and LLM-based reflection groups performed equally well and better than those who were only exposed to lecture slides, according to their scores on a proctored exam two weeks later on the same subject matter. These results underscore the utility of LLM-guided reflection and questionnaire-based activities in improving learning outcomes. Our work highlights that focusing solely on the accuracy of LLMs can overlook their potential to enhance metacognitive skills through practices such as self-reflection. We discuss the implications of our research for the Edtech community.
△ Less
Submitted 31 May, 2024;
originally announced June 2024.
-
An Automatic Question Usability Evaluation Toolkit
Authors:
Steven Moore,
Eamon Costello,
Huy A. Nguyen,
John Stamper
Abstract:
Evaluating multiple-choice questions (MCQs) involves either labor intensive human assessments or automated methods that prioritize readability, often overlooking deeper question design flaws. To address this issue, we introduce the Scalable Automatic Question Usability Evaluation Toolkit (SAQUET), an open-source tool that leverages the Item-Writing Flaws (IWF) rubric for a comprehensive and automa…
▽ More
Evaluating multiple-choice questions (MCQs) involves either labor intensive human assessments or automated methods that prioritize readability, often overlooking deeper question design flaws. To address this issue, we introduce the Scalable Automatic Question Usability Evaluation Toolkit (SAQUET), an open-source tool that leverages the Item-Writing Flaws (IWF) rubric for a comprehensive and automated quality evaluation of MCQs. By harnessing the latest in large language models such as GPT-4, advanced word embeddings, and Transformers designed to analyze textual complexity, SAQUET effectively pinpoints and assesses a wide array of flaws in MCQs. We first demonstrate the discrepancy between commonly used automated evaluation metrics and the human assessment of MCQ quality. Then we evaluate SAQUET on a diverse dataset of MCQs across the five domains of Chemistry, Statistics, Computer Science, Humanities, and Healthcare, showing how it effectively distinguishes between flawed and flawless questions, providing a level of analysis beyond what is achievable with traditional metrics. With an accuracy rate of over 94% in detecting the presence of flaws identified by human evaluators, our findings emphasize the limitations of existing evaluation methods and showcase potential in improving the quality of educational assessments.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Automated Generation and Tagging of Knowledge Components from Multiple-Choice Questions
Authors:
Steven Moore,
Robin Schmucker,
Tom Mitchell,
John Stamper
Abstract:
Knowledge Components (KCs) linked to assessments enhance the measurement of student learning, enrich analytics, and facilitate adaptivity. However, generating and linking KCs to assessment items requires significant effort and domain-specific knowledge. To streamline this process for higher-education courses, we employed GPT-4 to generate KCs for multiple-choice questions (MCQs) in Chemistry and E…
▽ More
Knowledge Components (KCs) linked to assessments enhance the measurement of student learning, enrich analytics, and facilitate adaptivity. However, generating and linking KCs to assessment items requires significant effort and domain-specific knowledge. To streamline this process for higher-education courses, we employed GPT-4 to generate KCs for multiple-choice questions (MCQs) in Chemistry and E-Learning. We analyzed discrepancies between the KCs generated by the Large Language Model (LLM) and those made by humans through evaluation from three domain experts in each subject area. This evaluation aimed to determine whether, in instances of non-matching KCs, evaluators showed a preference for the LLM-generated KCs over their human-created counterparts. We also developed an ontology induction algorithm to cluster questions that assess similar KCs based on their content. Our most effective LLM strategy accurately matched KCs for 56% of Chemistry and 35% of E-Learning MCQs, with even higher success when considering the top five KC suggestions. Human evaluators favored LLM-generated KCs, choosing them over human-assigned ones approximately two-thirds of the time, a preference that was statistically significant across both domains. Our clustering algorithm successfully grouped questions by their underlying KCs without needing explicit labels or contextual information. This research advances the automation of KC generation and classification for assessment items, alleviating the need for student data or predefined KC labels.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Enhancing LLM-Based Feedback: Insights from Intelligent Tutoring Systems and the Learning Sciences
Authors:
John Stamper,
Ruiwei Xiao,
Xinying Hou
Abstract:
The field of Artificial Intelligence in Education (AIED) focuses on the intersection of technology, education, and psychology, placing a strong emphasis on supporting learners' needs with compassion and understanding. The growing prominence of Large Language Models (LLMs) has led to the development of scalable solutions within educational settings, including generating different types of feedback…
▽ More
The field of Artificial Intelligence in Education (AIED) focuses on the intersection of technology, education, and psychology, placing a strong emphasis on supporting learners' needs with compassion and understanding. The growing prominence of Large Language Models (LLMs) has led to the development of scalable solutions within educational settings, including generating different types of feedback in Intelligent Tutoring Systems. However, the approach to utilizing these models often involves directly formulating prompts to solicit specific information, lacking a solid theoretical foundation for prompt construction and empirical assessments of their impact on learning. This work advocates careful and caring AIED research by going through previous research on feedback generation in ITS, with emphasis on the theoretical frameworks they utilized and the efficacy of the corresponding design in empirical evaluations, and then suggesting opportunities to apply these evidence-based principles to the design, experiment, and evaluation phases of LLM-based feedback generation. The main contributions of this paper include: an avocation of applying more cautious, theoretically grounded methods in feedback generation in the era of generative AI; and practical suggestions on theory and evidence-based feedback design for LLM-powered ITS.
△ Less
Submitted 11 May, 2024; v1 submitted 7 May, 2024;
originally announced May 2024.
-
Uncovering Name-Based Biases in Large Language Models Through Simulated Trust Game
Authors:
Yumou Wei,
Paulo F. Carvalho,
John Stamper
Abstract:
Gender and race inferred from an individual's name are a notable source of stereotypes and biases that subtly influence social interactions. Abundant evidence from human experiments has revealed the preferential treatment that one receives when one's name suggests a predominant gender or race. As large language models acquire more capabilities and begin to support everyday applications, it becomes…
▽ More
Gender and race inferred from an individual's name are a notable source of stereotypes and biases that subtly influence social interactions. Abundant evidence from human experiments has revealed the preferential treatment that one receives when one's name suggests a predominant gender or race. As large language models acquire more capabilities and begin to support everyday applications, it becomes crucial to examine whether they manifest similar biases when encountering names in a complex social interaction. In contrast to previous work that studies name-based biases in language models at a more fundamental level, such as word representations, we challenge three prominent models to predict the outcome of a modified Trust Game, a well-publicized paradigm for studying trust and reciprocity. To ensure the internal validity of our experiments, we have carefully curated a list of racially representative surnames to identify players in a Trust Game and rigorously verified the construct validity of our prompts. The results of our experiments show that our approach can detect name-based biases in both base and instruction-tuned models.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
Exploring How Multiple Levels of GPT-Generated Programming Hints Support or Disappoint Novices
Authors:
Ruiwei Xiao,
Xinying Hou,
John Stamper
Abstract:
Recent studies have integrated large language models (LLMs) into diverse educational contexts, including providing adaptive programming hints, a type of feedback focuses on hel** students move forward during problem-solving. However, most existing LLM-based hint systems are limited to one single hint type. To investigate whether and how different levels of hints can support students' problem-sol…
▽ More
Recent studies have integrated large language models (LLMs) into diverse educational contexts, including providing adaptive programming hints, a type of feedback focuses on hel** students move forward during problem-solving. However, most existing LLM-based hint systems are limited to one single hint type. To investigate whether and how different levels of hints can support students' problem-solving and learning, we conducted a think-aloud study with 12 novices using the LLM Hint Factory, a system providing four levels of hints from general natural language guidance to concrete code assistance, varying in format and granularity. We discovered that high-level natural language hints alone can be helpless or even misleading, especially when addressing next-step or syntax-related help requests. Adding lower-level hints, like code examples with in-line comments, can better support students. The findings open up future work on customizing help responses from content, format, and granularity levels to accurately identify and meet students' learning needs.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
Assessing the Quality of Multiple-Choice Questions Using GPT-4 and Rule-Based Methods
Authors:
Steven Moore,
Huy A. Nguyen,
Tianying Chen,
John Stamper
Abstract:
Multiple-choice questions with item-writing flaws can negatively impact student learning and skew analytics. These flaws are often present in student-generated questions, making it difficult to assess their quality and suitability for classroom usage. Existing methods for evaluating multiple-choice questions often focus on machine readability metrics, without considering their intended use within…
▽ More
Multiple-choice questions with item-writing flaws can negatively impact student learning and skew analytics. These flaws are often present in student-generated questions, making it difficult to assess their quality and suitability for classroom usage. Existing methods for evaluating multiple-choice questions often focus on machine readability metrics, without considering their intended use within course materials and their pedagogical implications. In this study, we compared the performance of a rule-based method we developed to a machine-learning based method utilizing GPT-4 for the task of automatically assessing multiple-choice questions based on 19 common item-writing flaws. By analyzing 200 student-generated questions from four different subject areas, we found that the rule-based method correctly detected 91% of the flaws identified by human annotators, as compared to 79% by GPT-4. We demonstrated the effectiveness of the two methods in identifying common item-writing flaws present in the student-generated questions across different subject areas. The rule-based method can accurately and efficiently evaluate multiple-choice questions from multiple domains, outperforming GPT-4 and going beyond existing metrics that do not account for the educational use of such questions. Finally, we discuss the potential for using these automated methods to improve the quality of questions based on the identified flaws.
△ Less
Submitted 16 July, 2023;
originally announced July 2023.
-
Learnersourcing in the Age of AI: Student, Educator and Machine Partnerships for Content Creation
Authors:
Hassan Khosravi,
Paul Denny,
Steven Moore,
John Stamper
Abstract:
Engaging students in creating novel content, also referred to as learnersourcing, is increasingly recognised as an effective approach to promoting higher-order learning, deeply engaging students with course material and develo** large repositories of content suitable for personalized learning. Despite these benefits, some common concerns and criticisms are associated with learnersourcing (e.g.,…
▽ More
Engaging students in creating novel content, also referred to as learnersourcing, is increasingly recognised as an effective approach to promoting higher-order learning, deeply engaging students with course material and develo** large repositories of content suitable for personalized learning. Despite these benefits, some common concerns and criticisms are associated with learnersourcing (e.g., the quality of resources created by students, challenges in incentivising engagement and lack of availability of reliable learnersourcing systems), which have limited its adoption. This paper presents a framework that considers the existing learnersourcing literature, the latest insights from the learning sciences and advances in AI to offer promising future directions for develo** learnersourcing systems. The framework is designed around important questions and human-AI partnerships relating to four key aspects: (1) creating novel content, (2) evaluating the quality of the created content, (3) utilising learnersourced contributions of students and (4) enabling instructors to support students in the learnersourcing process. We then present two comprehensive case studies that illustrate the application of the proposed framework in relation to two existing popular learnersourcing systems.
△ Less
Submitted 10 June, 2023;
originally announced June 2023.