Neural Automated Writing Evaluation with Corrective Feedback

Izia Xiaoxiao Wang^1∗ Xihan Wu^2∗ Edith Coates³ Min Zeng² Jiexin Kuang²
Siliang Liu² Mengyang Qiu⁴ Jungyeul Park²
¹Linguistik Zentrum Zürich, Universität Zürich, Schweiz
²Department of Linguistics, The University of British Columbia, Canada
³Department of Mathematics, The University of British Columbia, Canada
⁴Department of Psychology, Trent University, Canada
Equally contributed authors.

Abstract

The utilization of technology in second language learning and teaching has become ubiquitous. For the assessment of writing specifically, automated writing evaluation (AWE) and grammatical error correction (GEC) have become immensely popular and effective methods for enhancing writing proficiency and delivering instant and individualized feedback to learners. By leveraging the power of natural language processing (NLP) and machine learning algorithms, AWE and GEC systems have been developed separately to provide language learners with automated corrective feedback and more accurate and unbiased scoring that would otherwise be subject to examiners. In this paper, we propose an integrated system for automated writing evaluation with corrective feedback as a means of bridging the gap between AWE and GEC results for second language learners. This system enables language learners to simulate the essay writing tests: a student writes and submits an essay, and the system returns the assessment of the writing along with suggested grammatical error corrections. Given that automated scoring and grammatical correction are more efficient and cost-effective than human grading, this integrated system would also alleviate the burden of manually correcting innumerable essays. A link to the demonstration video: https://youtu.be/CoqmBqqcXio

Izia Xiaoxiao Wang^1∗ Xihan Wu^2∗ Edith Coates³^†^†thanks: Equally contributed authors. Min Zeng² Jiexin Kuang² Siliang Liu² Mengyang Qiu⁴ Jungyeul Park² ¹Linguistik Zentrum Zürich, Universität Zürich, Schweiz ²Department of Linguistics, The University of British Columbia, Canada ³Department of Mathematics, The University of British Columbia, Canada ⁴Department of Psychology, Trent University, Canada

1 Introduction

Second language (L2) instructors and learners both believe that instructors are not fulfilling their duties if they fail to offer sufficient feedback on the essay submissions (Cho, 2018). However, providing and revising feedback can be incredibly time-consuming and laborious tasks for both language instructors and learners. Automated writing evaluation and grammatical error correction tools would offer potential solutions for alleviating the workload for language instructors and reducing the overhead of trying to understand the instructor’s feedback for language learners. These systems help to meet the need for more efficient practices in the digital age, providing instantaneous scoring with corrective feedback.

The basic term AWE is defined as “the process of evaluating and scoring written prose via computer programs” (Shermis and Burstein, 2003). In essence, this system predicts continuous values for a holistic score or a set of trait scores, thereby providing a comprehensive assessment of the writing quality. AWE systems require multiple source prompt (e.g. cross-prompt) scoring instead of prompt-specific scoring. In practical usage, a set of scores across different rubrics is much more in need, rather than a holistic and overall writing quality assessment. Therefore, we argue that an effective AWE system should incorporate cross-prompt rubric scoring for constructive results.

GEC is the automation of detecting and correcting grammatical errors in the text. As the number of second language learners of English rises, and the demand for timely feedback to better facilitate their learning intensifies, GEC has gained increasing popularity and attention in both academia and industry in recent years. Currently, a common approach is to treat GEC as a translation task where incorrect sentences are translated into correct sentences.

In this paper, we aim to demonstrate a system that would contribute to the integration of AWE and GEC and is designed to facilitate the language learning process and promote better learning outcomes. While there exist various approaches that attempt to leverage the efficiency of GEC or AWE systems to give away seemingly instantaneous but virtually momentary solutions for language learners, our objective is to introduce a robust and sensible approach to transform traditional language education. Our approach empowers language learners to "write it right" by enabling simulated examination situations.

2 Previous Work

AWE

Jacobs et al. (1981) first proposed the prototype for a set of components that in conjunction capture the communicative effectiveness of compositions holistically. Later, it was developed and refined by Connor-Linton and Polio (2014). Moreover, a variation of the same holistic evaluation on essays is implicitly used in develo** the different attributes in the ASAP++ data set (Mathias and Bhattacharyya, 2018) for English. Currently, neural models have been dominating AWE systems. Ke and Ng (2019), Ramesh and Sanampudi (2021), and Uto (2021) have summarized recent neural models well. For automatic essay scoring, there are two main model types. Firstly, in RNN-based models, the RNN output is sent to mean-over-time to aggregate the input to the fixed length vector and a linear layer for the scalar value (Taghipour and Ng, 2016) , or alternatively, a simple BiLSTM to the linear layer is used for predicting essay scores (Alikaniotis et al., 2016). Secondly, transformer-based models, e.g., BERT with BiLSTM with attention (Nadeem et al., 2019) or BERT concatenated with handcrafted features (Uto et al., 2020), can be used to predict the score. Fine-tuning BERT using multiple losses, including regression loss and reranking loss, for constraining automated essay scores has been shown to produce state-of-the-art results (Yang et al., 2020).

GEC

Similar to AWE, neural models have also dominated GEC systems, where Wang et al. (2021) presents a comprehensive review. In general, recent state-of-the-art systems can be divided into two categories: sequence-to-sequence NMT-based approaches and sequence tagging edit-based approaches. For NMT-based approaches, a transformer-based encoder-decoder architecture is typically used, where source (or incorrect) sentences are encoded into a hidden state with a multi-head self-attention layer (Vaswani et al., 2017). Unlike a typical translation task, GEC only requires changing a few words in a sentence. Incorporating a copying mechanism into seq2seq (that is, copying unchanged words from original sentences) has been shown to significantly boost GEC performance (Zhao et al., 2019). This unique characteristic of GEC (i.e., the high overlap between the source and target sentences) has also led to the development of edit-based approaches (Omelianchuk et al., 2020). Edit-based models also use a transformer-based architecture. However, instead of predicting complete sentences, these models are trained to predict a series of editing operations (e.g., delete, append, and replace), which has dramatically improved the speed of inference while maintaining high performance. Given that there is a lack of sufficient parallel training data in GEC, data augmentation via artificial error generation has been shown to boost performance in both NMT-based and edit-based approaches (Stahlberg and Kumar, 2021). In addition, incorporating large pre-trained masked language models, such as BERT and its variants (Devlin et al., 2019), in pre-training and/or fine-tuning stages has yielded better results (Kaneko et al., 2020).

Only a few previous work have shown for their efforts to integrate pre-neural AWE and GEC systems together where they have focused on perceived feedback of language learners (Ranalli, 2018; O’Neill and Russell, 2019; Zhang, 2020; Ariyanto et al., 2021; Reynolds et al., 2021). Whereas empirical studies have explored students’ perception, beliefs, and preferences towards feedback, their influential role in their writing performance has remained unknown (Truscott, 1999; Ferris, 1999; Chandler, 2003; Ferris, 2004, 2014). Hence, this paper concentrates on the integration of the AWE and GEC systems and its impact on language education. The combination of neural AWE and GEC systems is introduced for the first time to achieve results that are close to the state-of-the-art (SOTA).

3 System Description

The flowchart in Figure 1 demonstrates the learner’s interactions with the system, from the initial step of submitting the writing to receiving the AWE and GEC results. Adapted to the language learners’ perspectives, the procedures are disassembled as the following:

(i) After completing the writings, language learners submit their writings to the system. (ii) Taken the submitted writing as input, the system compiles both AWE and GEC components with it. (iii) For the GEC, the output presents in-line grammatical error corrections with the original writing. (iv) For the AWE, the output offers one overall and eight rubric scores, which delivers the complete set of the evaluation results from the AWE system. (v) At the output interface, the language learner receives the combined feedback from the two integrated outputs generated from AWE and GEC.

Refer to caption — Figure 1: System workflow of integrated AWE and GEC for language learners. A user flow can be applied to simulate examination situations: the language learners receive instant objective scoring results and corrective feedback.

(1)

*I gess almost people cannot speaking English.
(2)

*I guess most people cannot speak English.

Figure 2 shows an interface screenshot of the integrated AWE and GEC results for the sentence from a language learner’s writing in (1) and grammatical errors corrected in (2). AWE results show the absolute values between 0 and 100 regardless of their proficiency level. These AWE scores can also be normalized based on the learner’s proficiency level using min-max normalization if it is required.

AWE

For AWE models,¹¹1Since we are working with L2 English learners, we prefer to use AWE instead of AES (automatic essay scoring), which is designated for L1 English learners. we use Automated Student Assessment Prize (ASAP)²²2https://www.kaggle.com/competitions/asap-aes and its extension ASAP++ (Mathias and Bhattacharyya, 2018) for English. ASAP was introduced as part of a Kaggle competition in 2012. It has since become widely used in prompt-specific (Alikaniotis et al., 2016; Dong et al., 2017) and cross-prompt (Cummins et al., 2016; Ridley et al., 2021; Jiang et al., 2023; Chen and Li, 2023; Do et al., 2023) automated scoring systems. Since only two of eight sets in ASAP have scores for individual essay attributes, ASAP++ annotates other rubric scoring for six sets of ASAP. The current ASAP and ASAP++ datasets contain eight different prompt essay sets with from three to five rubric human-rated scores: e.g. content, organization, word choice, sentence fluency, conventions, prompt adherence, language, narrativity as well as overall. ASAP consists of only overall scores for Prompts 1-6, and different rubric scores (content, organization, conventions for Prompt 7, and content, organization, word choice, sentence fluency, conventions for Prompt 8). ASAP++ provides various rubric scores for Prompts 1–6 (content, organization, word choice, sentence fluency, convention for Prompts 1-2, and content, prompt adherence, language, narrativity for Prompt 3-6) to supplement the original overall scores from ASAP (Prompt 1-6).

For the current system, we use the denoised ASAP dataset, which we employ simple text denoising techniques using prompting. The original ASAP dataset employed a named entity recognition approach using Stanford CoreNLP (Manning et al., 2014) and a range of pattern matching rules to eliminate personally identifying information from the essays. Consequently, entities within the text are identified and replaced with strings initiated with the ‘@’ symbol, such as @PERSON1. Furthermore, more than 5% of sentences exhibit encoding issues where UTF-8 symbols are not correctly displayed, in addition to the presence of non-word entities. We classify sentences containing encoding issues and non-word entities as noise, and subsequently, undergo a denoising process to address them. Our hypothesis posits that enhancing text quality through denoising will yield improved linear regression results in AES. For this process, we employ two prompts with gpt-3.5-turbo-instruct: one to address encoding errors and another to replace non-word entities with arbitrary entity names sequentially. It is noteworthy that gpt-3.5-turbo-instruct has a tendency to correct grammatical errors in the original text during sentence generation. To restore the original words, we utilize the .m2 annotation generated by ERRANT (Bryant et al., 2017). This annotation delineates the modifications made in the original text, allowing us to retain only the symbols with corrected encoding and the replaced non-word entities. After text denoising, we utilize roberta-base (Liu et al., 2019) for the linear regression task. Detailed prompt-by-prompt results are presented in Table 1, including the evaluation conducted on the original text to ensure a fair comparison (Cleaned). During training, we utilize the default values of the Trainer class,³³3https://huggingface.co/docs/transformers/main_classes/trainer employing the RoBERTa base model. We normalize the score to the range of 0 and 1, and we multiply the results by 100 to calculate QWK using the standard evaluation script provided by ets.org.

	Prompt 1	Prompt 2	Prompt 3	Prompt 4	Prompt 5	Prompt 6	Prompt 7	Prompt 8	Average
Original	0.6187	0.6308	0.6962	0.6626	0.7158	0.5924	0.4622	0.4586	0.6047
Cleaned	0.7344	0.5485	0.6978	0.6540	0.7416	0.6156	0.4637	0.4589	0.6143

Table 1: Prompt-by-prompt QWK results: e.g. prompt 1 represents that prompt 1 is used as a test set, and prompts 2-7 as a training set, and prompt 8 as a dev set.

GEC

We build seq2seq GEC models using fairseq (Ott et al., 2019). We explore ensemble models that incorporate different pre-trained large language models (LLMs), including BERT and BART. The key difference between the two LLMs is in their pre-training objectives. BERT is trained using a masked language modelling objective, where a certain percentage of the input tokens are masked, and the model is trained to predict the original tokens. In contrast, BART is trained using a denoising auto-encoding objective, where the model is trained to reconstruct the original input text from a corrupted version of the input. Our current training setup is largely based on Kaneko et al. (2020). We use four GEC datasets – FCE, NUCLE, W&I +LOCNESS, and Lang-8 – provided by the BEA 2019 Shared Task for training (Bryant et al., 2019).⁴⁴4https://www.cl.cam.ac.uk/research/nl/bea2019st All the incorrect sentences are spell-checked and added back in order to augment the size of the original training set (1,157,324 sentence pairs), resulting in 1,718,693 sentence pairs in total. 6,575 sentence pairs from the FCE and W&I+LOCNESS development sets are used as validation. Our best model (using BERT) achieves a 65.29 F_0.5 score on the BEA 2019 test set. We have also been exploring various ways of fine-tuning the power of LLMs as well as edit-based approaches such as GECToR (Omelianchuk et al., 2020) and T5 (Rothe et al., 2021), which currently have provided SOTA results for the GEC task. These pre-trained models are prepared for seamless integration into our system. While prompting GPT for the GEC task has also been been investigated, it could not obtain the state of the art results for the GEC task, and this is still prevalent in recent previous work (Loem et al., 2023).

4 Discussion and Future Perspectives

Features and metrics

In the integration, we have examined several automatic metrics as features to describe the attributes of the training dataset. Specifically, these attributes are represented in terms of complexity, fluency, and accuracy features. First, complexity features use quantitative measures such as the number of words and sentences in the text with their numbers and mean lengths. The length of the written text has been considered an essential feature in the learner corpus mostly for identifying proficiency levels. Complexity features can also measure syntactic complexity in L2 writing (Polio, 1997; Ortega, 2003; Lu, 2010). Syntactic complexity measures are also considered, which include Yngve’s depth algorithm (Yngve, 1960), Frazier’s local non-terminal numbers (Frazier, 1985), and the D-level scale (Rosenberg and Abbeduto, 1987; Covington et al., 2006) in L1 writing. Second, fluency, as the potential of language learners to apply their knowledge of grammar to produce intelligible speech and writing, also plays an important role in language production. For fluency, we considered the two metrics defined in Asano et al. (2017) and Ge et al. (2018) as fluency features of AWE. Third, accuracy represents the ability to produce correct sentences using correct grammar and vocabulary. However, it requires linguistic annotation such as grammatical error categories and error correction. Currently, the AWE and GEC models have been trained independently. We plan to introduce GEC results as accuracy features for the AWE system in future works.

Multilingual aspects for GEC

Current system is demonstrated with GEC results for English, but the GEC model is being extended to the multilingual environment, including Chinese and Korean, using multilingual Transformers such as mBART (Liu et al., 2020). For Chinese, our current GEC training set is from NLPTEA 2014-2016 CGED Shared Tasks (Lee et al., 2016).⁵⁵5https://sites.google.com/view/nlptea2018/shared-task It contains 587,050 sentence pairs from writing samples of Chinese L2 language learners (i.e., the writing section of the computer-based Test of Chinese as a Foreign Language (TOCFL) and of the Hanyu Shui** Kaoshi (HSK) (‘Chinese Proficiency Test’)). Chinese GEC sentences are segmented into words using pkuseg in SpaCy⁶⁶6https://github.com/explosion/spacy-pkuseg. For Korean, the grammatical error correction dataset from National Institute of Korean Language⁷⁷7https://www.korean.go.kr contains 57,320 sentence pairs for training from written writing samples of Korean L2 language learners. Korean GEC sentences are preprocessed into the sequence of morphemes as a de facto format for Korean machine translation. With ERRANT (Bryant et al., 2017) being adapted to Chinese (Hinson et al., 2020), Korean (Yoon et al., 2023), and many other languages, it is now possible to streamline grammatical error annotation and evaluation across languages.

Multilingual aspects for AWE and universal rubrics

Similar to the multilingual GEC, we also adopt a multilingual perspective for our AWE system. Although the current AWE is demonstrated to assess English inputs, it is designed to be compatible with multilingual input data. This is made possible by two configurations in the assessment process: (1) having the same underlying scoring scheme for the AWE training data regardless of what language is being used, and (2) the set of evaluation standards designed in the scoring scheme assesses the communicative effectiveness of the writers’ compositions, rather than any language-specific traits. Both the multilingual scoring scheme and the evaluation standards are ascertained from comparing the proposed language-specific AWE rubrics in the literature, and then extracting the shared analytic traits to serve as the basis for multilingual AWE’s training dataset evaluation rubrics. Following the same mechanism, our AWE has the potential to easily expand to incorporate a new language if that language’s proposed AWE evaluation rubrics can be accommodated into our existing ones. In fact, for the Korean AWE system, Lim et al. (2023) already implemented a holistic scoring system based on the learner corpus from L2 learners of Korean (Park and Lee, 2016). We are currently develo** a multilingual rubric scoring system based on the evaluation rubrics from ASAP++ (Mathias and Bhattacharyya, 2018) for English, ACEA dataset’s multi-traits (He et al., 2022) for Chinese, and the rubrics used in the recently released Korean essay evaluation dataset for Korean.⁸⁸8Released in October 2022, https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=545

Perceived fairness and accuracy of GEC and AWE:
	- On a scale of 1-5, how fair do you think the grammar feedback on this writing to be? (1= not fair at all, 5= very fair)
	- On a scale of 1-5, how accurate and trustworthy do you think the grammar correction on this writing to be? (1= not accurate/trustworthy at all, 5= very accurate/trustworthy)
	- On a scale of 1-5, how fair do you think the grading and evaluation of this writing to be? (1= not fair at all, 5= very fair)
	- On a scale of 1-5, how accurate and trustworthy do you think the grading and evaluation of this writing to be? (1= not accurate/trustworthy at all, 5= very accurate/trustworthy)
\hdashline Perceived clarity and helpfulness:
	- On a scale of 1-5, how clear do you think the grading and feedback on this writing to be? (1= not clear at all, 5= very clear)
	- On a scale of 1-5, how helpful do you perceive the feedback on this writing to be in improving this student’s writing skills? (1= not helpful at all, 5= very helpful)
	- On a scale of 1-5, how helpful do you think rubrics with criteria and scores to be in understanding the grading and feedback on this writing assignment? (1= not helpful at all, 5= very helpful)
\hdashline Mental effort:
	- On a scale of 1-5, how difficult do you find it to perceive the grading and feedback on this writing to be?
	- On a scale of 1-5, how much mental effort does this task (differ for stages 1 and 2) require from you to perceive the feedback?
\hdashline Attitudinal responses:
	- On a scale of 1-5, how comfortable do you feel about perceiving/using this system to receive the grading and feedback on this writing?
	- On a scale of 1-5, how effective do you feel about perceiving/using this system to receive the grading and feedback on this writing?

Neural Automated Writing Evaluation with Corrective Feedback

Abstract

1 Introduction

2 Previous Work

AWE

GEC

3 System Description

AWE

GEC

4 Discussion and Future Perspectives

Features and metrics

Multilingual aspects for GEC

Multilingual aspects for AWE and universal rubrics

Teacher’s view

Perception problem

5 Conclusion

Ethics Statement

Acknowledgement

References