-
How Well Can You Articulate that Idea? Insights from Automated Formative Assessment
Authors:
Mahsa Sheikhi Karizaki,
Dana Gnesdilow,
Sadhana Puntambekar,
Rebecca J. Passonneau
Abstract:
Automated methods are becoming increasingly integrated into studies of formative feedback on students' science explanation writing. Most of this work, however, addresses students' responses to short answer questions. We investigate automated feedback on students' science explanation essays, where students must articulate multiple ideas. Feedback is based on a rubric that identifies the main ideas…
▽ More
Automated methods are becoming increasingly integrated into studies of formative feedback on students' science explanation writing. Most of this work, however, addresses students' responses to short answer questions. We investigate automated feedback on students' science explanation essays, where students must articulate multiple ideas. Feedback is based on a rubric that identifies the main ideas students are prompted to include in explanatory essays about the physics of energy and mass, given their experiments with a simulated roller coaster. We have found that students generally improve on revised versions of their essays. Here, however, we focus on two factors that affect the accuracy of the automated feedback. First, we find that the main ideas in the rubric differ with respect to how much freedom they afford in explanations of the idea, thus explanation of a natural law is relatively constrained. Students have more freedom in how they explain complex relations they observe in their roller coasters, such as transfer of different forms of energy. Second, by tracing the automated decision process, we can diagnose when a student's statement lacks sufficient clarity for the automated tool to associate it more strongly with one of the main ideas above all others. This in turn provides an opportunity for teachers and peers to help students reflect on how to state their ideas more clearly.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
VerAs: Verify then Assess STEM Lab Reports
Authors:
Berk Atil,
Mahsa Sheikhi Karizaki,
Rebecca J. Passonneau
Abstract:
With an increasing focus in STEM education on critical thinking skills, science writing plays an ever more important role in curricula that stress inquiry skills. A recently published dataset of two sets of college level lab reports from an inquiry-based physics curriculum relies on analytic assessment rubrics that utilize multiple dimensions, specifying subject matter knowledge and general compon…
▽ More
With an increasing focus in STEM education on critical thinking skills, science writing plays an ever more important role in curricula that stress inquiry skills. A recently published dataset of two sets of college level lab reports from an inquiry-based physics curriculum relies on analytic assessment rubrics that utilize multiple dimensions, specifying subject matter knowledge and general components of good explanations. Each analytic dimension is assessed on a 6-point scale, to provide detailed feedback to students that can help them improve their science writing skills. Manual assessment can be slow, and difficult to calibrate for consistency across all students in large classes. While much work exists on automated assessment of open-ended questions in STEM subjects, there has been far less work on long-form writing such as lab reports. We present an end-to-end neural architecture that has separate verifier and assessment modules, inspired by approaches to Open Domain Question Answering (OpenQA). VerAs first verifies whether a report contains any content relevant to a given rubric dimension, and if so, assesses the relevant sentences. On the lab reports, VerAs outperforms multiple baselines based on OpenQA systems or Automated Essay Scoring (AES). VerAs also performs well on an analytic rubric for middle school physics essays.
△ Less
Submitted 25 April, 2024; v1 submitted 7 February, 2024;
originally announced February 2024.
-
The Sentiment Problem: A Critical Survey towards Deconstructing Sentiment Analysis
Authors:
Pranav Narayanan Venkit,
Mukund Srinath,
Sanjana Gautam,
Saranya Venkatraman,
Vipul Gupta,
Rebecca J. Passonneau,
Shomir Wilson
Abstract:
We conduct an inquiry into the sociotechnical aspects of sentiment analysis (SA) by critically examining 189 peer-reviewed papers on their applications, models, and datasets. Our investigation stems from the recognition that SA has become an integral component of diverse sociotechnical systems, exerting influence on both social and technical users. By delving into sociological and technological li…
▽ More
We conduct an inquiry into the sociotechnical aspects of sentiment analysis (SA) by critically examining 189 peer-reviewed papers on their applications, models, and datasets. Our investigation stems from the recognition that SA has become an integral component of diverse sociotechnical systems, exerting influence on both social and technical users. By delving into sociological and technological literature on sentiment, we unveil distinct conceptualizations of this term in domains such as finance, government, and medicine. Our study exposes a lack of explicit definitions and frameworks for characterizing sentiment, resulting in potential challenges and biases. To tackle this issue, we propose an ethics sheet encompassing critical inquiries to guide practitioners in ensuring equitable utilization of SA. Our findings underscore the significance of adopting an interdisciplinary approach to defining sentiment in SA and offer a pragmatic solution for its implementation.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias
Authors:
Vipul Gupta,
Pranav Narayanan Venkit,
Hugo Laurençon,
Shomir Wilson,
Rebecca J. Passonneau
Abstract:
As language models (LMs) become increasingly powerful and widely used, it is important to quantify them for sociodemographic bias with potential for harm. Prior measures of bias are sensitive to perturbations in the templates designed to compare performance across social groups, due to factors such as low diversity or limited number of templates. Also, most previous work considers only one NLP tas…
▽ More
As language models (LMs) become increasingly powerful and widely used, it is important to quantify them for sociodemographic bias with potential for harm. Prior measures of bias are sensitive to perturbations in the templates designed to compare performance across social groups, due to factors such as low diversity or limited number of templates. Also, most previous work considers only one NLP task. We introduce Comprehensive Assessment of Language Models (CALM) for robust measurement of two types of universally relevant sociodemographic bias, gender and race. CALM integrates sixteen datasets for question-answering, sentiment analysis and natural language inference. Examples from each dataset are filtered to produce 224 templates with high diversity (e.g., length, vocabulary). We assemble 50 highly frequent person names for each of seven distinct demographic groups to generate 78,400 prompts covering the three NLP tasks. Our empirical evaluation shows that CALM bias scores are more robust and far less sensitive than previous bias measurements to perturbations in the templates, such as synonym substitution, or to random subset selection of templates. We apply CALM to 20 large language models, and find that for 2 language model series, larger parameter models tend to be more biased than smaller ones. The T0 series is the least biased model families, of the 20 LLMs investigated here. The code is available at https://github.com/vipulgupta1011/CALM.
△ Less
Submitted 23 January, 2024; v1 submitted 23 August, 2023;
originally announced August 2023.
-
Sociodemographic Bias in Language Models: A Survey and Forward Path
Authors:
Vipul Gupta,
Pranav Narayanan Venkit,
Shomir Wilson,
Rebecca J. Passonneau
Abstract:
This paper presents a comprehensive survey of work on sociodemographic bias in language models (LMs). Sociodemographic biases embedded within language models can have harmful effects when deployed in real-world settings. We systematically organize the existing literature into three main areas: types of bias, quantifying bias, and debiasing techniques. We also track the evolution of investigations…
▽ More
This paper presents a comprehensive survey of work on sociodemographic bias in language models (LMs). Sociodemographic biases embedded within language models can have harmful effects when deployed in real-world settings. We systematically organize the existing literature into three main areas: types of bias, quantifying bias, and debiasing techniques. We also track the evolution of investigations of LM bias over the past decade. We identify current trends, limitations, and potential future directions in bias research. To guide future research towards more effective and reliable solutions, we present a checklist of open questions. We also recommend using interdisciplinary approaches to combine works on LM bias with an understanding of the potential harms.
△ Less
Submitted 1 March, 2024; v1 submitted 13 June, 2023;
originally announced June 2023.
-
CONTaiNER: Few-Shot Named Entity Recognition via Contrastive Learning
Authors:
Sarkar Snigdha Sarathi Das,
Arzoo Katiyar,
Rebecca J. Passonneau,
Rui Zhang
Abstract:
Named Entity Recognition (NER) in Few-Shot setting is imperative for entity tagging in low resource domains. Existing approaches only learn class-specific semantic features and intermediate representations from source domains. This affects generalizability to unseen target domains, resulting in suboptimal performances. To this end, we present CONTaiNER, a novel contrastive learning technique that…
▽ More
Named Entity Recognition (NER) in Few-Shot setting is imperative for entity tagging in low resource domains. Existing approaches only learn class-specific semantic features and intermediate representations from source domains. This affects generalizability to unseen target domains, resulting in suboptimal performances. To this end, we present CONTaiNER, a novel contrastive learning technique that optimizes the inter-token distribution distance for Few-Shot NER. Instead of optimizing class-specific attributes, CONTaiNER optimizes a generalized objective of differentiating between token categories based on their Gaussian-distributed embeddings. This effectively alleviates overfitting issues originating from training domains. Our experiments in several traditional test domains (OntoNotes, CoNLL'03, WNUT '17, GUM) and a new large scale Few-Shot NER dataset (Few-NERD) demonstrate that on average, CONTaiNER outperforms previous methods by 3%-13% absolute F1 points while showing consistent performance trends, even in challenging scenarios where previous approaches could not achieve appreciable performance.
△ Less
Submitted 28 March, 2022; v1 submitted 15 September, 2021;
originally announced September 2021.
-
ABCD: A Graph Framework to Convert Complex Sentences to a Covering Set of Simple Sentences
Authors:
Yanjun Gao,
Ting-hao Huang,
Rebecca J. Passonneau
Abstract:
Atomic clauses are fundamental text units for understanding complex sentences. Identifying the atomic sentences within complex sentences is important for applications such as summarization, argument mining, discourse analysis, discourse parsing, and question answering. Previous work mainly relies on rule-based methods dependent on parsing. We propose a new task to decompose each complex sentence i…
▽ More
Atomic clauses are fundamental text units for understanding complex sentences. Identifying the atomic sentences within complex sentences is important for applications such as summarization, argument mining, discourse analysis, discourse parsing, and question answering. Previous work mainly relies on rule-based methods dependent on parsing. We propose a new task to decompose each complex sentence into simple sentences derived from the tensed clauses in the source, and a novel problem formulation as a graph edit task. Our neural model learns to Accept, Break, Copy or Drop elements of a graph that combines word adjacency and grammatical dependencies. The full processing pipeline includes modules for graph construction, graph editing, and sentence generation from the output graph. We introduce DeSSE, a new dataset designed to train and evaluate complex sentence decomposition, and MinWiki, a subset of MinWikiSplit. ABCD achieves comparable performance as two parsing baselines on MinWiki. On DeSSE, which has a more even balance of complex sentence types, our model achieves higher accuracy on the number of atomic sentences than an encoder-decoder baseline. Results include a detailed error analysis.
△ Less
Submitted 22 June, 2021;
originally announced June 2021.
-
OmniGraph: Rich Representation and Graph Kernel Learning
Authors:
Boyi Xie,
Rebecca J. Passonneau
Abstract:
OmniGraph, a novel representation to support a range of NLP classification tasks, integrates lexical items, syntactic dependencies and frame semantic parses into graphs. Feature engineering is folded into the learning through convolution graph kernel learning to explore different extents of the graph. A high-dimensional space of features includes individual nodes as well as complex subgraphs. In e…
▽ More
OmniGraph, a novel representation to support a range of NLP classification tasks, integrates lexical items, syntactic dependencies and frame semantic parses into graphs. Feature engineering is folded into the learning through convolution graph kernel learning to explore different extents of the graph. A high-dimensional space of features includes individual nodes as well as complex subgraphs. In experiments on a text-forecasting problem that predicts stock price change from news for company mentions, OmniGraph beats several benchmarks based on bag-of-words, syntactic dependencies, and semantic trees. The highly expressive features OmniGraph discovers provide insights into the semantics across distinct market sectors. To demonstrate the method's generality, we also report its high performance results on a fine-grained sentiment corpus.
△ Less
Submitted 10 October, 2015;
originally announced October 2015.
-
Abstractive Multi-Document Summarization via Phrase Selection and Merging
Authors:
Lidong Bing,
Piji Li,
Yi Liao,
Wai Lam,
Weiwei Guo,
Rebecca J. Passonneau
Abstract:
We propose an abstraction-based multi-document summarization framework that can construct new sentences by exploring more fine-grained syntactic units than sentences, namely, noun/verb phrases. Different from existing abstraction-based approaches, our method first constructs a pool of concepts and facts represented by phrases from the input documents. Then new sentences are generated by selecting…
▽ More
We propose an abstraction-based multi-document summarization framework that can construct new sentences by exploring more fine-grained syntactic units than sentences, namely, noun/verb phrases. Different from existing abstraction-based approaches, our method first constructs a pool of concepts and facts represented by phrases from the input documents. Then new sentences are generated by selecting and merging informative phrases to maximize the salience of phrases and meanwhile satisfy the sentence construction constraints. We employ integer linear optimization for conducting phrase selection and merging simultaneously in order to achieve the global optimal solution for a summary. Experimental results on the benchmark data set TAC 2011 show that our framework outperforms the state-of-the-art models under automated pyramid evaluation metric, and achieves reasonably well results on manual linguistic quality evaluation.
△ Less
Submitted 5 June, 2015; v1 submitted 4 June, 2015;
originally announced June 2015.
-
Applying Reliability Metrics to Co-Reference Annotation
Authors:
Rebecca J. Passonneau
Abstract:
Studies of the contextual and linguistic factors that constrain discourse phenomena such as reference are coming to depend increasingly on annotated language corpora. In preparing the corpora, it is important to evaluate the reliability of the annotation, but methods for doing so have not been readily available. In this report, I present a method for computing reliability of coreference annotati…
▽ More
Studies of the contextual and linguistic factors that constrain discourse phenomena such as reference are coming to depend increasingly on annotated language corpora. In preparing the corpora, it is important to evaluate the reliability of the annotation, but methods for doing so have not been readily available. In this report, I present a method for computing reliability of coreference annotation. First I review a method for applying the information retrieval metrics of recall and precision to coreference annotation proposed by Marc Vilain and his collaborators. I show how this method makes it possible to construct contingency tables for computing Cohen's Kappa, a familiar reliability metric. By comparing recall and precision to reliability on the same data sets, I also show that recall and precision can be misleadingly high. Because Kappa factors out chance agreement among coders, it is a preferable measure for develo** annotated corpora where no pre-existing target annotation exists.
△ Less
Submitted 10 June, 1997;
originally announced June 1997.
-
Integrating Gricean and Attentional Constraints
Authors:
Rebecca J. Passonneau
Abstract:
This paper concerns how to generate and understand discourse anaphoric noun phrases. I present the results of an analysis of all discourse anaphoric noun phrases (N=1,233) in a corpus of ten narrative monologues, where the choice between a definite pronoun or phrasal NP conforms largely to Gricean constraints on informativeness. I discuss Dale and Reiter's [To appear] recent model and show how i…
▽ More
This paper concerns how to generate and understand discourse anaphoric noun phrases. I present the results of an analysis of all discourse anaphoric noun phrases (N=1,233) in a corpus of ten narrative monologues, where the choice between a definite pronoun or phrasal NP conforms largely to Gricean constraints on informativeness. I discuss Dale and Reiter's [To appear] recent model and show how it can be augmented for understanding as well as generating the range of data presented here. I argue that integrating centering [Grosz et al., 1983] [Kameyama, 1985] with this model can be applied uniformly to discourse anaphoric pronouns and phrasal NPs. I conclude with a hypothesis for addressing the interaction between local and global discourse processing.
△ Less
Submitted 19 May, 1995;
originally announced May 1995.
-
Combining Multiple Knowledge Sources for Discourse Segmentation
Authors:
Diane J. Litman,
Rebecca J. Passonneau
Abstract:
We predict discourse segment boundaries from linguistic features of utterances, using a corpus of spoken narratives as data. We present two methods for develo** segmentation algorithms from training data: hand tuning and machine learning. When multiple types of features are used, results approach human performance on an independent test set (both methods), and using cross-validation (machine l…
▽ More
We predict discourse segment boundaries from linguistic features of utterances, using a corpus of spoken narratives as data. We present two methods for develo** segmentation algorithms from training data: hand tuning and machine learning. When multiple types of features are used, results approach human performance on an independent test set (both methods), and using cross-validation (machine learning).
△ Less
Submitted 10 May, 1995;
originally announced May 1995.
-
Intention-based Segmentation: Human Reliability and Correlation with Linguistic Cues
Authors:
Rebecca J. Passonneau,
Diane J. Litman
Abstract:
Certain spans of utterances in a discourse, referred to here as segments, are widely assumed to form coherent units. Further, the segmental structure of discourse has been claimed to constrain and be constrained by many phenomena. However, there is weak consensus on the nature of segments and the criteria for recognizing or generating them. We present quantitative results of a two part study usi…
▽ More
Certain spans of utterances in a discourse, referred to here as segments, are widely assumed to form coherent units. Further, the segmental structure of discourse has been claimed to constrain and be constrained by many phenomena. However, there is weak consensus on the nature of segments and the criteria for recognizing or generating them. We present quantitative results of a two part study using a corpus of spontaneous, narrative monologues. The first part evaluates the statistical reliability of human segmentation of our corpus, where speaker intention is the segmentation criterion. We then use the subjects' segmentations to evaluate the correlation of discourse segmentation with three linguistic cues (referential noun phrases, cue words, and pauses), using information retrieval metrics.
△ Less
Submitted 9 May, 1994;
originally announced May 1994.