-
Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability to Mark Short Answer Questions in K-12 Education
Authors:
Owen Henkel,
Adam Boxer,
Libby Hills,
Bill Roberts
Abstract:
This paper presents reports on a series of experiments with a novel dataset evaluating how well Large Language Models (LLMs) can mark (i.e. grade) open text responses to short answer questions, Specifically, we explore how well different combinations of GPT version and prompt engineering strategies performed at marking real student answers to short answer across different domain areas (Science and…
▽ More
This paper presents reports on a series of experiments with a novel dataset evaluating how well Large Language Models (LLMs) can mark (i.e. grade) open text responses to short answer questions, Specifically, we explore how well different combinations of GPT version and prompt engineering strategies performed at marking real student answers to short answer across different domain areas (Science and History) and grade-levels (spanning ages 5-16) using a new, never-used-before dataset from Carousel, a quizzing platform. We found that GPT-4, with basic few-shot prompting performed well (Kappa, 0.70) and, importantly, very close to human-level performance (0.75). This research builds on prior findings that GPT-4 could reliably score short answer reading comprehension questions at a performance-level very close to that of expert human raters. The proximity to human-level performance, across a variety of subjects and grade levels suggests that LLMs could be a valuable tool for supporting low-stakes formative assessment tasks in K-12 education and has important implications for real-world education delivery.
△ Less
Submitted 5 May, 2024;
originally announced May 2024.
-
Can LLMs Grade Short-Answer Reading Comprehension Questions : An Empirical Study with a Novel Dataset
Authors:
Owen Henkel,
Libby Hills,
Bill Roberts,
Joshua McGrane
Abstract:
Open-ended questions, which require students to produce multi-word, nontrivial responses, are a popular tool for formative assessment as they provide more specific insights into what students do and don't know. However, grading open-ended questions can be time-consuming leading teachers to resort to simpler question formats or conduct fewer formative assessments. While there has been a longstandin…
▽ More
Open-ended questions, which require students to produce multi-word, nontrivial responses, are a popular tool for formative assessment as they provide more specific insights into what students do and don't know. However, grading open-ended questions can be time-consuming leading teachers to resort to simpler question formats or conduct fewer formative assessments. While there has been a longstanding interest in automating of short-answer grading (ASAG), but previous approaches have been technically complex, limiting their use in formative assessment contexts. The newest generation of Large Language Models (LLMs) potentially makes grading short answer questions more feasible. This paper investigates the potential for the newest version of LLMs to be used in ASAG, specifically in the grading of short answer questions for formative assessments, in two ways. First, it introduces a novel dataset of short answer reading comprehension questions, drawn from a set of reading assessments conducted with over 150 students in Ghana. This dataset allows for the evaluation of LLMs in a new context, as they are predominantly designed and trained on data from high-income North American countries. Second, the paper empirically evaluates how well various configurations of generative LLMs grade student short answer responses compared to expert human raters. The findings show that GPT-4, with minimal prompt engineering, performed extremely well on grading the novel dataset (QWK 0.92, F1 0.89), reaching near parity with expert human raters. To our knowledge this work is the first to empirically evaluate the performance of generative LLMs on short answer reading comprehension questions using real student data, with low technical hurdles to attaining this performance. These findings suggest that generative LLMs could be used to grade formative literacy assessment tasks.
△ Less
Submitted 5 May, 2024; v1 submitted 26 October, 2023;
originally announced October 2023.
-
Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in Ghana
Authors:
Owen Henkel,
Hannah Horne-Robinson,
Libby Hills,
Bill Roberts,
Joshua McGrane
Abstract:
This paper reports on a set of three recent experiments utilizing large-scale speech models to evaluate the oral reading fluency (ORF) of students in Ghana. While ORF is a well-established measure of foundational literacy, assessing it typically requires one-on-one sessions between a student and a trained evaluator, a process that is time-consuming and costly. Automating the evaluation of ORF coul…
▽ More
This paper reports on a set of three recent experiments utilizing large-scale speech models to evaluate the oral reading fluency (ORF) of students in Ghana. While ORF is a well-established measure of foundational literacy, assessing it typically requires one-on-one sessions between a student and a trained evaluator, a process that is time-consuming and costly. Automating the evaluation of ORF could support better literacy instruction, particularly in education contexts where formative assessment is uncommon due to large class sizes and limited resources. To our knowledge, this research is among the first to examine the use of the most recent versions of large-scale speech models (Whisper V2 wav2vec2.0) for ORF assessment in the Global South.
We find that Whisper V2 produces transcriptions of Ghanaian students reading aloud with a Word Error Rate of 13.5. This is close to the model's average WER on adult speech (12.8) and would have been considered state-of-the-art for children's speech transcription only a few years ago. We also find that when these transcriptions are used to produce fully automated ORF scores, they closely align with scores generated by expert human graders, with a correlation coefficient of 0.96. Importantly, these results were achieved on a representative dataset (i.e., students with regional accents, recordings taken in actual classrooms), using a free and publicly available speech model out of the box (i.e., no fine-tuning). This suggests that using large-scale speech models to assess ORF may be feasible to implement and scale in lower-resource, linguistically diverse educational contexts.
△ Less
Submitted 26 October, 2023;
originally announced October 2023.
-
The Rhythms of Transient Relationships: Allocating time between weekdays and weekends
Authors:
Valentín Vergara Hidd,
Mailun Zhang,
Simone Centellegher,
Sam G. B. Roberts,
Bruno Lepri,
Eduardo López
Abstract:
A fundamental question of any new relationship is, will it last? Transient relationships, recently defined by the authors, are an ideal type of social tie to explore this question: these relationships are characterized by distinguishable starting and ending temporal points, linking the question of tie longevity to relationship finite lifetime. In this study, we use mobile phone data sets from the…
▽ More
A fundamental question of any new relationship is, will it last? Transient relationships, recently defined by the authors, are an ideal type of social tie to explore this question: these relationships are characterized by distinguishable starting and ending temporal points, linking the question of tie longevity to relationship finite lifetime. In this study, we use mobile phone data sets from the UK and Italy to analyze the weekly allocation of time invested in maintaining transient relationships. We find that more relationships are created during weekdays, with a greater proportion of them receiving more contact during these days of the week in the long term. The smaller group of relationships that receive more phone calls during the weekend tend to remain active for more time. We uncover a sorting process by which some ties are moved from weekdays to weekends and vice versa, mostly in the first half of the relationship. This process also carries more information about the ultimate lifetime of a tie than the part of the week when the relationship started, which suggests an early evaluation period that leads to a decision on how to allocate time to different types of transient ties.
△ Less
Submitted 28 August, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Authors:
Aarohi Srivastava,
Abhinav Rastogi,
Abhishek Rao,
Abu Awal Md Shoeb,
Abubakar Abid,
Adam Fisch,
Adam R. Brown,
Adam Santoro,
Aditya Gupta,
Adrià Garriga-Alonso,
Agnieszka Kluska,
Aitor Lewkowycz,
Akshat Agarwal,
Alethea Power,
Alex Ray,
Alex Warstadt,
Alexander W. Kocurek,
Ali Safaya,
Ali Tazarv,
Alice Xiang,
Alicia Parrish,
Allen Nie,
Aman Hussain,
Amanda Askell,
Amanda Dsouza
, et al. (426 additional authors not shown)
Abstract:
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur…
▽ More
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
△ Less
Submitted 12 June, 2023; v1 submitted 9 June, 2022;
originally announced June 2022.
-
Franchised Quantum Money
Authors:
Bhaskar Roberts,
Mark Zhandry
Abstract:
The construction of public key quantum money based on standard cryptographic assumptions is a longstanding open question. Here we introduce franchised quantum money, an alternative form of quantum money that is easier to construct. Franchised quantum money retains the features of a useful quantum money scheme, namely unforgeability and local verification: anyone can verify banknotes without commun…
▽ More
The construction of public key quantum money based on standard cryptographic assumptions is a longstanding open question. Here we introduce franchised quantum money, an alternative form of quantum money that is easier to construct. Franchised quantum money retains the features of a useful quantum money scheme, namely unforgeability and local verification: anyone can verify banknotes without communicating with the bank. In franchised quantum money, every user gets a unique secret verification key, and the scheme is secure against counterfeiting and sabotage, a new security notion that appears in the franchised model. Finally, we construct franchised quantum money and prove security assuming one-way functions.
△ Less
Submitted 19 October, 2021;
originally announced October 2021.
-
Proximity in face-to-face interaction is associated with mobile phone communication
Authors:
Tobias Bornakke,
Talayeh Aledavood,
Jari Saramäki,
Sam G. B. Roberts
Abstract:
The frequency of mobile communication is often used as an indicator of the strength of a tie between two individuals, but how mobile communication relates to other forms of behaving close in social relationships is poorly understood. We used a unique multi-channel 10-month dataset from 510 participants to examine how the frequency of mobile communication was related to the frequency of face-to-fac…
▽ More
The frequency of mobile communication is often used as an indicator of the strength of a tie between two individuals, but how mobile communication relates to other forms of behaving close in social relationships is poorly understood. We used a unique multi-channel 10-month dataset from 510 participants to examine how the frequency of mobile communication was related to the frequency of face-to-face interaction, as measured by Bluetooth scans between the participants mobile phones. The number of phone calls between a dyad was significantly related to the number of face-to-face interactions. Physical proximity during face-to-face interactions was the single strongest predictor of the number of phone calls. Overall, 36 percent of variance in phone calls could be explained by face-to-face interactions and the control variables. Our results suggest that the amount of mobile communication between a dyad is a useful but noisy measure of tie strength with some significant limitations.
△ Less
Submitted 20 July, 2021; v1 submitted 19 July, 2021;
originally announced July 2021.
-
A Dataset for Lane Instance Segmentation in Urban Environments
Authors:
Brook Roberts,
Sebastian Kaltwang,
Sina Samangooei,
Mark Pender-Bare,
Konstantinos Tertikas,
John Redford
Abstract:
Autonomous vehicles require knowledge of the surrounding road layout, which can be predicted by state-of-the-art CNNs. This work addresses the current lack of data for determining lane instances, which are needed for various driving manoeuvres. The main issue is the time-consuming manual labelling process, typically applied per image. We notice that driving the car is itself a form of annotation.…
▽ More
Autonomous vehicles require knowledge of the surrounding road layout, which can be predicted by state-of-the-art CNNs. This work addresses the current lack of data for determining lane instances, which are needed for various driving manoeuvres. The main issue is the time-consuming manual labelling process, typically applied per image. We notice that driving the car is itself a form of annotation. Therefore, we propose a semi-automated method that allows for efficient labelling of image sequences by utilising an estimated road plane in 3D based on where the car has driven and projecting labels from this plane into all images of the sequence. The average labelling time per image is reduced to 5 seconds and only an inexpensive dash-cam is required for data capture. We are releasing a dataset of 24,000 images and additionally show experimental semantic segmentation and instance segmentation results.
△ Less
Submitted 2 August, 2018; v1 submitted 3 July, 2018;
originally announced July 2018.
-
Multichannel social signatures and persistent features of ego networks
Authors:
S. Heydari,
S. G. B. Roberts,
R. I. M. Dunbar,
J. Saramäki
Abstract:
The structure of egocentric networks reflects the way people balance their need for strong, emotionally intense relationships and a diversity of weaker ties. Egocentric network structure can be quantified with 'social signatures', which describe how people distribute their communication effort across the members (alters) of their personal networks. Social signatures based on call data have indicat…
▽ More
The structure of egocentric networks reflects the way people balance their need for strong, emotionally intense relationships and a diversity of weaker ties. Egocentric network structure can be quantified with 'social signatures', which describe how people distribute their communication effort across the members (alters) of their personal networks. Social signatures based on call data have indicated that people mostly communicate with a few close alters; they also have persistent, distinct signatures. To examine if these results hold for other channels of communication, here we compare social signatures built from call and text message data, and develop a way of constructing mixed social signatures using both channels. We observe that all types of signatures display persistent individual differences that remain stable despite the turnover in individual alters. We also show that call, text, and mixed signatures resemble one another both at the population level and at the level of individuals. The consistency of social signatures across individuals for different channels of communication is surprising because the choice of channel appears to be alter-specific with no clear overall pattern, and ego networks constructed from calls and texts overlap only partially in terms of alters. These results demonstrate individuals vary in how they allocate their communication effort across their personal networks and this variation is persistent over time and across different channels of communication.
△ Less
Submitted 7 June, 2018;
originally announced June 2018.
-
Channel-Specific Daily Patterns in Mobile Phone Communication
Authors:
Talayeh Aledavood,
Eduardo López,
Sam G. B. Roberts,
Felix Reed-Tsochas,
Esteban Moro,
Robin I. M. Dunbar,
Jari Saramäki
Abstract:
Humans follow circadian rhythms, visible in their activity levels as well as physiological and psychological factors. Such rhythms are also visible in electronic communication records, where the aggregated activity levels of e.g. mobile telephone calls or Wikipedia edits are known to follow their own daily patterns. Here, we study the daily communication patterns of 24 individuals over 18 months,…
▽ More
Humans follow circadian rhythms, visible in their activity levels as well as physiological and psychological factors. Such rhythms are also visible in electronic communication records, where the aggregated activity levels of e.g. mobile telephone calls or Wikipedia edits are known to follow their own daily patterns. Here, we study the daily communication patterns of 24 individuals over 18 months, and show that each individual has a different, persistent communication pattern. These patterns may differ for calls and text messages, which points towards calls and texts serving a different role in communication. For both calls and texts, evenings play a special role. There are also differences in the daily patterns of males and females both for calls and texts, both in how they communicate with individuals of the same gender vs. opposite gender, and also in how communication is allocated at social ties of different nature (kin ties vs. non-kin ties). Taken together, our results show that there is an unexpected richness to the daily communication patterns, from different types of ties being activated at different times of day to different roles of communication channels and gender differences.
△ Less
Submitted 16 July, 2015;
originally announced July 2015.
-
Daily rhythms in mobile telephone communication
Authors:
Talayeh Aledavood,
Eduardo López,
Sam G. B. Roberts,
Felix Reed-Tsochas,
Esteban Moro,
Robin I. M. Dunbar,
Jari Saramäki
Abstract:
Circadian rhythms are known to be important drivers of human activity and the recent availability of electronic records of human behaviour has provided fine-grained data of temporal patterns of activity on a large scale. Further, questionnaire studies have identified important individual differences in circadian rhythms, with people broadly categorised into morning-like or evening-like individuals…
▽ More
Circadian rhythms are known to be important drivers of human activity and the recent availability of electronic records of human behaviour has provided fine-grained data of temporal patterns of activity on a large scale. Further, questionnaire studies have identified important individual differences in circadian rhythms, with people broadly categorised into morning-like or evening-like individuals. However, little is known about the social aspects of these circadian rhythms, or how they vary across individuals. In this study we use a unique 18-month dataset that combines mobile phone calls and questionnaire data to examine individual differences in the daily rhythms of mobile phone activity. We demonstrate clear individual differences in daily patterns of phone calls, and show that these individual differences are persistent despite a high degree of turnover in the individuals' social networks. Further, women's calls were longer than men's calls, especially during the evening and at night, and these calls were typically focused on a small number of emotionally intense relationships. These results demonstrate that individual differences in circadian rhythms are not just related to broad patterns of morningness and eveningness, but have a strong social component, in directing phone calls to specific individuals at specific times of day.
△ Less
Submitted 24 February, 2015;
originally announced February 2015.
-
Ranking and Tradeoffs in Sponsored Search Auctions
Authors:
Ben Roberts,
Dinan Gunawardena,
Ian A. Kash,
Peter Key
Abstract:
In a sponsored search auction, decisions about how to rank ads impose tradeoffs between objectives such as revenue and welfare. In this paper, we examine how these tradeoffs should be made. We begin by arguing that the most natural solution concept to evaluate these tradeoffs is the lowest symmetric Nash equilibrium (SNE). As part of this argument, we generalise the well known connection between t…
▽ More
In a sponsored search auction, decisions about how to rank ads impose tradeoffs between objectives such as revenue and welfare. In this paper, we examine how these tradeoffs should be made. We begin by arguing that the most natural solution concept to evaluate these tradeoffs is the lowest symmetric Nash equilibrium (SNE). As part of this argument, we generalise the well known connection between the lowest SNE and the VCG outcome. We then propose a new ranking algorithm, loosely based on the revenue-optimal auction, that uses a reserve price to order the ads (not just to filter them) and give conditions under which it raises more revenue than simply applying that reserve price. Finally, we conduct extensive simulations examining the tradeoffs enabled by different ranking algorithms and show that our proposed algorithm enables superior operating points by a variety of metrics.
△ Less
Submitted 29 April, 2013;
originally announced April 2013.
-
Time as a limited resource: Communication Strategy in Mobile Phone Networks
Authors:
Giovanna Miritello,
Esteban Moro,
Rubén Lara,
Rocío Martínez-López,
Sam G. B. Roberts,
Robin I. M. Dunbar
Abstract:
We used a large database of 9 billion calls from 20 million mobile users to examine the relationships between aggregated time spent on the phone, personal network size, tie strength and the way in which users distributed their limited time across their network (disparity). Compared to those with smaller networks, those with large networks did not devote proportionally more time to communication an…
▽ More
We used a large database of 9 billion calls from 20 million mobile users to examine the relationships between aggregated time spent on the phone, personal network size, tie strength and the way in which users distributed their limited time across their network (disparity). Compared to those with smaller networks, those with large networks did not devote proportionally more time to communication and had on average weaker ties (as measured by time spent communicating). Further, there were not substantially different levels of disparity between individuals, in that mobile users tend to distribute their time very unevenly across their network, with a large proportion of calls going to a small number of individuals. Together, these results suggest that there are time constraints which limit tie strength in large personal networks, and that even high levels of mobile communication do not fundamentally alter the disparity of time allocation across networks.
△ Less
Submitted 11 January, 2013;
originally announced January 2013.
-
The persistence of social signatures in human communication
Authors:
J. Saramaki,
E. A. Leicht,
E. Lopez,
S. G. B. Roberts,
F. Reed-Tsochas,
R. I. M. Dunbar
Abstract:
The social network maintained by a focal individual, or ego, is intrinsically dynamic and typically exhibits some turnover in membership over time as personal circumstances change. However, the consequences of such changes on the distribution of an ego's network ties are not well understood. Here we use a unique 18-month data set that combines mobile phone calls and survey data to track changes in…
▽ More
The social network maintained by a focal individual, or ego, is intrinsically dynamic and typically exhibits some turnover in membership over time as personal circumstances change. However, the consequences of such changes on the distribution of an ego's network ties are not well understood. Here we use a unique 18-month data set that combines mobile phone calls and survey data to track changes in the ego networks and communication patterns of students making the transition from school to university or work. Our analysis reveals that individuals display a distinctive and robust social signature, captured by how interactions are distributed across different alters. Notably, for a given ego, these social signatures tend to persist over time, despite considerable turnover in the identity of alters in the ego network. Thus as new network members are added, some old network members are either replaced or receive fewer calls, preserving the overall distribution of calls across network members. This is likely to reflect the consequences of finite resources such as the time available for communication, the cognitive and emotional effort required to sustain close relationships, and the ability to make emotional investments.
△ Less
Submitted 16 December, 2013; v1 submitted 25 April, 2012;
originally announced April 2012.