-
The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail
Authors:
Samuel R. Bowman
Abstract:
Researchers in NLP often frame and discuss research results in ways that serve to deemphasize the field's successes, often in response to the field's widespread hype. Though well-meaning, this has yielded many misleading or false claims about the limits of our best technology. This is a problem, and it may be more serious than it looks: It harms our credibility in ways that can make it harder to m…
▽ More
Researchers in NLP often frame and discuss research results in ways that serve to deemphasize the field's successes, often in response to the field's widespread hype. Though well-meaning, this has yielded many misleading or false claims about the limits of our best technology. This is a problem, and it may be more serious than it looks: It harms our credibility in ways that can make it harder to mitigate present-day harms, like those involving biased systems for content moderation or resume screening. It also limits our ability to prepare for the potentially enormous impacts of more distant future advances. This paper urges researchers to be careful about these claims and suggests some research directions and communication strategies that will make it easier to avoid or rebut them.
△ Less
Submitted 10 March, 2022; v1 submitted 15 October, 2021;
originally announced October 2021.
-
BBQ: A Hand-Built Bias Benchmark for Question Answering
Authors:
Alicia Parrish,
Angelica Chen,
Nikita Nangia,
Vishakh Padmakumar,
Jason Phang,
Jana Thompson,
Phu Mon Htut,
Samuel R. Bowman
Abstract:
It is well documented that NLP models learn social biases, but little work has been done on how these biases manifest in model outputs for applied tasks like question answering (QA). We introduce the Bias Benchmark for QA (BBQ), a dataset of question sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine social dimensions rele…
▽ More
It is well documented that NLP models learn social biases, but little work has been done on how these biases manifest in model outputs for applied tasks like question answering (QA). We introduce the Bias Benchmark for QA (BBQ), a dataset of question sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine social dimensions relevant for U.S. English-speaking contexts. Our task evaluates model responses at two levels: (i) given an under-informative context, we test how strongly responses reflect social biases, and (ii) given an adequately informative context, we test whether the model's biases override a correct answer choice. We find that models often rely on stereotypes when the context is under-informative, meaning the model's outputs consistently reproduce harmful biases in this setting. Though models are more accurate when the context provides an informative answer, they still rely on stereotypes and average up to 3.4 percentage points higher accuracy when the correct answer aligns with a social bias than when it conflicts, with this difference widening to over 5 points on examples targeting gender for most models tested.
△ Less
Submitted 15 March, 2022; v1 submitted 15 October, 2021;
originally announced October 2021.
-
Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers
Authors:
Jason Phang,
Haokun Liu,
Samuel R. Bowman
Abstract:
Despite the success of fine-tuning pretrained language encoders like BERT for downstream natural language understanding (NLU) tasks, it is still poorly understood how neural networks change after fine-tuning. In this work, we use centered kernel alignment (CKA), a method for comparing learned representations, to measure the similarity of representations in task-tuned models across layers. In exper…
▽ More
Despite the success of fine-tuning pretrained language encoders like BERT for downstream natural language understanding (NLU) tasks, it is still poorly understood how neural networks change after fine-tuning. In this work, we use centered kernel alignment (CKA), a method for comparing learned representations, to measure the similarity of representations in task-tuned models across layers. In experiments across twelve NLU tasks, we discover a consistent block diagonal structure in the similarity of representations within fine-tuned RoBERTa and ALBERT models, with strong similarity within clusters of earlier and later layers, but not between them. The similarity of later layer representations implies that later layers only marginally contribute to task performance, and we verify in experiments that the top few layers of fine-tuned Transformers can be discarded without hurting performance, even with no further tuning.
△ Less
Submitted 20 September, 2021; v1 submitted 17 September, 2021;
originally announced September 2021.
-
NOPE: A Corpus of Naturally-Occurring Presuppositions in English
Authors:
Alicia Parrish,
Sebastian Schuster,
Alex Warstadt,
Omar Agha,
Soo-Hwan Lee,
Zhuoye Zhao,
Samuel R. Bowman,
Tal Linzen
Abstract:
Understanding language requires gras** not only the overtly stated content, but also making inferences about things that were left unsaid. These inferences include presuppositions, a phenomenon by which a listener learns about new information through reasoning about what a speaker takes as given. Presuppositions require complex understanding of the lexical and syntactic properties that trigger t…
▽ More
Understanding language requires gras** not only the overtly stated content, but also making inferences about things that were left unsaid. These inferences include presuppositions, a phenomenon by which a listener learns about new information through reasoning about what a speaker takes as given. Presuppositions require complex understanding of the lexical and syntactic properties that trigger them as well as the broader conversational context. In this work, we introduce the Naturally-Occurring Presuppositions in English (NOPE) Corpus to investigate the context-sensitivity of 10 different types of presupposition triggers and to evaluate machine learning models' ability to predict human inferences. We find that most of the triggers we investigate exhibit moderate variability. We further find that transformer-based models draw correct inferences in simple cases involving presuppositions, but they fail to capture the minority of exceptional cases in which human judgments reveal complex interactions between context and triggers.
△ Less
Submitted 14 September, 2021;
originally announced September 2021.
-
Fast, high precision autofocus on a motorised microscope: automating blood sample imaging on the OpenFlexure Microscope
Authors:
Joe Knapper,
Joel T. Collins,
Julian Stirling,
Samuel McDermott,
William Wadsworth,
Richard Bowman
Abstract:
The OpenFlexure Microscope is a 3D printed, low-cost microscope capable of automated image acquisition through the use of a motorised translation stage and a Raspberry Pi imaging system. This automation has applications in research and healthcare, including in supporting the diagnosis of malaria in low resource settings. The plasmodium parasites which cause malaria require high magnification imagi…
▽ More
The OpenFlexure Microscope is a 3D printed, low-cost microscope capable of automated image acquisition through the use of a motorised translation stage and a Raspberry Pi imaging system. This automation has applications in research and healthcare, including in supporting the diagnosis of malaria in low resource settings. The plasmodium parasites which cause malaria require high magnification imaging, which has a shallow depth of field, necessitating the development of an accurate and precise autofocus procedure. We present methods of identifying the focal plane of the microscope, and procedures for reliably acquiring a stack of focused images on a system affected by backlash and drift. We also present and assess a method to verify the success of autofocus during the scan. The speed, reliability and precision of each method is evaluated, and the limitations discussed in terms of the end users' requirements.
△ Less
Submitted 15 September, 2021; v1 submitted 14 September, 2021;
originally announced September 2021.
-
Comparing Test Sets with Item Response Theory
Authors:
Clara Vania,
Phu Mon Htut,
William Huang,
Dhara Mungra,
Richard Yuanzhe Pang,
Jason Phang,
Haokun Liu,
Kyunghyun Cho,
Samuel R. Bowman
Abstract:
Recent years have seen numerous NLP datasets introduced to evaluate the performance of fine-tuned models on natural language understanding tasks. Recent results from large pretrained models, though, show that many of these datasets are largely saturated and unlikely to be able to detect further progress. What kind of datasets are still effective at discriminating among strong models, and what kind…
▽ More
Recent years have seen numerous NLP datasets introduced to evaluate the performance of fine-tuned models on natural language understanding tasks. Recent results from large pretrained models, though, show that many of these datasets are largely saturated and unlikely to be able to detect further progress. What kind of datasets are still effective at discriminating among strong models, and what kind of datasets should we expect to be able to detect future improvements? To measure this uniformly across datasets, we draw on Item Response Theory and evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models, while SNLI, MNLI, and CommitmentBank seem to be saturated for current strong models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
△ Less
Submitted 1 June, 2021;
originally announced June 2021.
-
What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?
Authors:
Nikita Nangia,
Saku Sugawara,
Harsh Trivedi,
Alex Warstadt,
Clara Vania,
Samuel R. Bowman
Abstract:
Crowdsourcing is widely used to create data for common natural language understanding tasks. Despite the importance of these datasets for measuring and refining model understanding of language, there has been little focus on the crowdsourcing methods used for collecting the datasets. In this paper, we compare the efficacy of interventions that have been proposed in prior work as ways of improving…
▽ More
Crowdsourcing is widely used to create data for common natural language understanding tasks. Despite the importance of these datasets for measuring and refining model understanding of language, there has been little focus on the crowdsourcing methods used for collecting the datasets. In this paper, we compare the efficacy of interventions that have been proposed in prior work as ways of improving data quality. We use multiple-choice question answering as a testbed and run a randomized trial by assigning crowdworkers to write questions under one of four different data collection protocols. We find that asking workers to write explanations for their examples is an ineffective stand-alone strategy for boosting NLU example difficulty. However, we find that training crowdworkers, and then using an iterative process of collecting data, sending feedback, and qualifying workers based on expert judgments is an effective means of collecting challenging data. But using crowdsourced, instead of expert judgments, to qualify workers and send feedback does not prove to be effective. We observe that the data from the iterative protocol with expert assessments is more challenging by several measures. Notably, the human--model gap on the unanimous agreement portion of this data is, on average, twice as large as the gap for the baseline protocol data.
△ Less
Submitted 1 June, 2021;
originally announced June 2021.
-
Does Putting a Linguist in the Loop Improve NLU Data Collection?
Authors:
Alicia Parrish,
William Huang,
Omar Agha,
Soo-Hwan Lee,
Nikita Nangia,
Alex Warstadt,
Karmanya Aggarwal,
Emily Allaway,
Tal Linzen,
Samuel R. Bowman
Abstract:
Many crowdsourced NLP datasets contain systematic gaps and biases that are identified only after data collection is complete. Identifying these issues from early data samples during crowdsourcing should make mitigation more efficient, especially when done iteratively. We take natural language inference as a test case and ask whether it is beneficial to put a linguist `in the loop' during data coll…
▽ More
Many crowdsourced NLP datasets contain systematic gaps and biases that are identified only after data collection is complete. Identifying these issues from early data samples during crowdsourcing should make mitigation more efficient, especially when done iteratively. We take natural language inference as a test case and ask whether it is beneficial to put a linguist `in the loop' during data collection to dynamically identify and address gaps in the data by introducing novel constraints on the task. We directly compare three data collection protocols: (i) a baseline protocol, (ii) a linguist-in-the-loop intervention with iteratively-updated constraints on the task, and (iii) an extension of linguist-in-the-loop that provides direct interaction between linguists and crowdworkers via a chatroom. The datasets collected with linguist involvement are more reliably challenging than baseline, without loss of quality. But we see no evidence that using this data in training leads to better out-of-domain model performance, and the addition of a chat platform has no measurable effect on the resulting dataset. We suggest integrating expert analysis \textit{during} data collection so that the expert can dynamically address gaps and biases in the dataset.
△ Less
Submitted 14 April, 2021;
originally announced April 2021.
-
What Will it Take to Fix Benchmarking in Natural Language Understanding?
Authors:
Samuel R. Bowman,
George E. Dahl
Abstract:
Evaluation for many natural language understanding (NLU) tasks is broken: Unreliable and biased systems score so highly on standard benchmarks that there is little room for researchers who develop better systems to demonstrate their improvements. The recent trend to abandon IID benchmarks in favor of adversarially-constructed, out-of-distribution test sets ensures that current models will perform…
▽ More
Evaluation for many natural language understanding (NLU) tasks is broken: Unreliable and biased systems score so highly on standard benchmarks that there is little room for researchers who develop better systems to demonstrate their improvements. The recent trend to abandon IID benchmarks in favor of adversarially-constructed, out-of-distribution test sets ensures that current models will perform poorly, but ultimately only obscures the abilities that we want our benchmarks to measure. In this position paper, we lay out four criteria that we argue NLU benchmarks should meet. We argue most current benchmarks fail at these criteria, and that adversarial data collection does not meaningfully address the causes of these failures. Instead, restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets, the reliability with which they are annotated, their size, and the ways they handle social bias.
△ Less
Submitted 15 October, 2021; v1 submitted 5 April, 2021;
originally announced April 2021.
-
Searching for refractory plasmonic materials: the structural and optical properties of Au$_{3}$Zr intermetallic thin films
Authors:
Hugh Littlehailes,
William R. Hendren,
Stacey Drakeley,
Robert M. Bowman,
Fumin Huang
Abstract:
Optical properties of refractory intermetallic thin films of Au$_{3}$Zr were experimentally investigated for the first time, which show distinctive plasmonic properties in the visible and near infrared region. The films were fabricated through DC magnetron sputtering at various deposition temperature ranging from room temperature to 427$^{o}$C and annealed at different vacuum levels. Both the stru…
▽ More
Optical properties of refractory intermetallic thin films of Au$_{3}$Zr were experimentally investigated for the first time, which show distinctive plasmonic properties in the visible and near infrared region. The films were fabricated through DC magnetron sputtering at various deposition temperature ranging from room temperature to 427$^{o}$C and annealed at different vacuum levels. Both the structural and optical properties are found to be critically dependent on deposition temperature and anneal conditions. Films deposited between 205-320$^{o}$C are shown to exhibit lower negative permittivity and better thermal stability, which could be linked to specific crystalline orientations. The films are stable when annealed at 10$^{-8}$ Torr, but are partially oxidized when annealed at 10$^{-6}$ Torr, suggesting oxidization could be a restricting issue for high-temperature applications in ambient environment.
△ Less
Submitted 23 March, 2021;
originally announced March 2021.
-
Modern Microscopy with the Web of Things: The OpenFlexure Microscope Software Stack
Authors:
Joel T. Collins,
Joe Knapper,
Julian Stirling,
Samuel McDermott,
Richard Bowman
Abstract:
Automated and computerised control of scientific instrumentation is almost ubiquitous in the modern laboratory. Most instrumentation is controlled over decades old communication busses or is accessed via proprietary system libraries. This limits which languages and operating systems can be used to control instruments, and poses a significant problem when interfacing multiple instruments into the s…
▽ More
Automated and computerised control of scientific instrumentation is almost ubiquitous in the modern laboratory. Most instrumentation is controlled over decades old communication busses or is accessed via proprietary system libraries. This limits which languages and operating systems can be used to control instruments, and poses a significant problem when interfacing multiple instruments into the same experiment. Here we present the OpenFlexure Microscope software stack as an example of how a scientific instrument can be controlled using existing, cross-platform, language-independent, industry-supported standards. We split the control code into client and server applications interfaced via a web API that conforms to the W3C Web of Things standard. This enables simple control of the microscope from multiple languages, provides a modern graphical control interface, and minimises duplicated code. Network control also makes the software stack more robust, allows multiple microscopes to be controlled by one computer, and facilitates sharing of equipment between local or remote users. Using a Web of Things approach in research laboratories has the potential to solve many of the key challenges of experiment integration, using technology that is already well established.
△ Less
Submitted 28 September, 2021; v1 submitted 4 January, 2021;
originally announced January 2021.
-
Relaxed current matching requirements in highly luminescent perovskite tandem solar cells and their fundamental efficiency limits
Authors:
Alan R. Bowman,
Felix Lang,
Yu-Hsien Chiang,
Alberto Jiménez-Solano,
Kyle Frohna,
Giles E. Eperon,
Edoardo Ruggeri,
Mojtaba Abdi-Jalebi,
Miguel Anaya,
Bettina V. Lotsch,
Samuel D. Stranks
Abstract:
Here we use time-resolved and steady-state optical spectroscopy on state-of-the-art low- and high-bandgap perovskite films for tandems to quantify intrinsic recombination rates and absorption coefficients. We apply these data to calculate the limiting efficiency of perovskite-silicon and all-perovskite two-terminal tandems employing currently available bandgap materials as 42.0 % and 40.8 % respec…
▽ More
Here we use time-resolved and steady-state optical spectroscopy on state-of-the-art low- and high-bandgap perovskite films for tandems to quantify intrinsic recombination rates and absorption coefficients. We apply these data to calculate the limiting efficiency of perovskite-silicon and all-perovskite two-terminal tandems employing currently available bandgap materials as 42.0 % and 40.8 % respectively. By including luminescence coupling between sub-cells, i.e. the re-emission of photons from the high-bandgap sub-cell and their absorption in the low-bandgap sub-cell, we reveal the stringent need for current matching is relaxed when the high-bandgap sub-cell is a luminescent perovskite compared to calculations that do not consider luminescence coupling. We show luminescence coupling becomes important in all-perovskite tandems when charge carrier trap** rates are < 10$^{6}$ s$^{-1}$ (corresponding to carrier lifetimes longer than 1 $μ$s at low excitation densities) in the high-bandgap sub-cell, which is lowered to 10$^{5}$ s$^{-1}$ in the better-bandgap-matched perovskite-silicon cells. We demonstrate luminescence coupling endows greater flexibility in both sub-cell thicknesses, increased tolerance to different spectral conditions and a reduction in the total thickness of light absorbing layers. To maximally exploit luminescence coupling we reveal a key design rule for luminescent perovskite-based tandems: the high-bandgap sub-cell should always have the higher short-circuit current. Importantly, this can be achieved by reducing the bandgap or increasing the thickness in the high-bandgap sub-cell with minimal reduction in efficiency, thus allowing for wider, unstable bandgap compositions (>1.7 eV) to be avoided. Finally, we experimentally visualise luminescence coupling in an all-perovskite tandem device stack through cross-section luminescence images.
△ Less
Submitted 1 December, 2020;
originally announced December 2020.
-
When Do You Need Billions of Words of Pretraining Data?
Authors:
Yian Zhang,
Alex Warstadt,
Haau-Sing Li,
Samuel R. Bowman
Abstract:
NLP is currently dominated by general-purpose pretrained language models like RoBERTa, which achieve strong performance on NLU tasks through pretraining on billions of words. But what exact knowledge or skills do Transformer LMs learn from large-scale pretraining that they cannot learn from less data? We adopt four probing methods---classifier probing, information-theoretic probing, unsupervised r…
▽ More
NLP is currently dominated by general-purpose pretrained language models like RoBERTa, which achieve strong performance on NLU tasks through pretraining on billions of words. But what exact knowledge or skills do Transformer LMs learn from large-scale pretraining that they cannot learn from less data? We adopt four probing methods---classifier probing, information-theoretic probing, unsupervised relative acceptability judgment, and fine-tuning on NLU tasks---and draw learning curves that track the growth of these different measures of linguistic ability with respect to pretraining data volume using the MiniBERTas, a group of RoBERTa models pretrained on 1M, 10M, 100M and 1B words. We find that LMs require only about 10M or 100M words to learn representations that reliably encode most syntactic and semantic features we test. A much larger quantity of data is needed in order to acquire enough commonsense knowledge and other skills required to master typical downstream NLU tasks. The results suggest that, while the ability to encode linguistic features is almost certainly necessary for language understanding, it is likely that other forms of knowledge are the major drivers of recent improvements in language understanding among large pretrained models.
△ Less
Submitted 10 November, 2020;
originally announced November 2020.
-
Asking Crowdworkers to Write Entailment Examples: The Best of Bad Options
Authors:
Clara Vania,
Ruijie Chen,
Samuel R. Bowman
Abstract:
Large-scale natural language inference (NLI) datasets such as SNLI or MNLI have been created by asking crowdworkers to read a premise and write three new hypotheses, one for each possible semantic relationships (entailment, contradiction, and neutral). While this protocol has been used to create useful benchmark data, it remains unclear whether the writing-based annotation protocol is optimal for…
▽ More
Large-scale natural language inference (NLI) datasets such as SNLI or MNLI have been created by asking crowdworkers to read a premise and write three new hypotheses, one for each possible semantic relationships (entailment, contradiction, and neutral). While this protocol has been used to create useful benchmark data, it remains unclear whether the writing-based annotation protocol is optimal for any purpose, since it has not been evaluated directly. Furthermore, there is ample evidence that crowdworker writing can introduce artifacts in the data. We investigate two alternative protocols which automatically create candidate (premise, hypothesis) pairs for annotators to label. Using these protocols and a writing-based baseline, we collect several new English NLI datasets of over 3k examples each, each using a fixed amount of annotator time, but a varying number of examples to fit that time budget. Our experiments on NLI and transfer learning show negative results: None of the alternative protocols outperforms the baseline in evaluations of generalization within NLI or on transfer to outside target tasks. We conclude that crowdworker writing still the best known option for entailment data, highlighting the need for further data collection work to focus on improving writing-based annotation processes.
△ Less
Submitted 12 October, 2020;
originally announced October 2020.
-
Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually)
Authors:
Alex Warstadt,
Yian Zhang,
Haau-Sing Li,
Haokun Liu,
Samuel R. Bowman
Abstract:
One reason pretraining on self-supervised linguistic tasks is effective is that it teaches models features that are helpful for language understanding. However, we want pretrained models to learn not only to represent linguistic features, but also to use those features preferentially during fine-turning. With this goal in mind, we introduce a new English-language diagnostic set called MSGS (the Mi…
▽ More
One reason pretraining on self-supervised linguistic tasks is effective is that it teaches models features that are helpful for language understanding. However, we want pretrained models to learn not only to represent linguistic features, but also to use those features preferentially during fine-turning. With this goal in mind, we introduce a new English-language diagnostic set called MSGS (the Mixed Signals Generalization Set), which consists of 20 ambiguous binary classification tasks that we use to test whether a pretrained model prefers linguistic or surface generalizations during fine-tuning. We pretrain RoBERTa models from scratch on quantities of data ranging from 1M to 1B words and compare their performance on MSGS to the publicly available RoBERTa-base. We find that models can learn to represent linguistic features with little pretraining data, but require far more data to learn to prefer linguistic generalizations over surface ones. Eventually, with about 30B words of pretraining data, RoBERTa-base does demonstrate a linguistic bias with some regularity. We conclude that while self-supervised pretraining is an effective way to learn helpful inductive biases, there is likely room to improve the rate at which models learn which features matter.
△ Less
Submitted 11 October, 2020;
originally announced October 2020.
-
Counterfactually-Augmented SNLI Training Data Does Not Yield Better Generalization Than Unaugmented Data
Authors:
William Huang,
Haokun Liu,
Samuel R. Bowman
Abstract:
A growing body of work shows that models exploit annotation artifacts to achieve state-of-the-art performance on standard crowdsourced benchmarks---datasets collected from crowdworkers to create an evaluation task---while still failing on out-of-domain examples for the same task. Recent work has explored the use of counterfactually-augmented data---data built by minimally editing a set of seed exa…
▽ More
A growing body of work shows that models exploit annotation artifacts to achieve state-of-the-art performance on standard crowdsourced benchmarks---datasets collected from crowdworkers to create an evaluation task---while still failing on out-of-domain examples for the same task. Recent work has explored the use of counterfactually-augmented data---data built by minimally editing a set of seed examples to yield counterfactual labels---to augment training data associated with these benchmarks and build more robust classifiers that generalize better. However, Khashabi et al. (2020) find that this type of augmentation yields little benefit on reading comprehension tasks when controlling for dataset size and cost of collection. We build upon this work by using English natural language inference data to test model generalization and robustness and find that models trained on a counterfactually-augmented SNLI dataset do not generalize better than unaugmented datasets of similar size and that counterfactual augmentation can hurt performance, yielding models that are less robust to challenge examples. Counterfactual augmentation of natural language understanding data through standard crowdsourcing techniques does not appear to be an effective way of collecting training data and further innovation is required to make this general line of work viable.
△ Less
Submitted 9 October, 2020;
originally announced October 2020.
-
Precise Task Formalization Matters in Winograd Schema Evaluations
Authors:
Haokun Liu,
William Huang,
Dhara A. Mungra,
Samuel R. Bowman
Abstract:
Performance on the Winograd Schema Challenge (WSC), a respected English commonsense reasoning benchmark, recently rocketed from chance accuracy to 89% on the SuperGLUE leaderboard, with relatively little corroborating evidence of a correspondingly large improvement in reasoning ability. We hypothesize that much of this improvement comes from recent changes in task formalization---the combination o…
▽ More
Performance on the Winograd Schema Challenge (WSC), a respected English commonsense reasoning benchmark, recently rocketed from chance accuracy to 89% on the SuperGLUE leaderboard, with relatively little corroborating evidence of a correspondingly large improvement in reasoning ability. We hypothesize that much of this improvement comes from recent changes in task formalization---the combination of input specification, loss function, and reuse of pretrained parameters---by users of the dataset, rather than improvements in the pretrained model's reasoning ability. We perform an ablation on two Winograd Schema datasets that interpolates between the formalizations used before and after this surge, and find (i) framing the task as multiple choice improves performance by 2-6 points and (ii) several additional techniques, including the reuse of a pretrained language modeling head, can mitigate the model's extreme sensitivity to hyperparameters. We urge future benchmark creators to impose additional structure to minimize the impact of formalization decisions on reported results.
△ Less
Submitted 8 October, 2020;
originally announced October 2020.
-
CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models
Authors:
Nikita Nangia,
Clara Vania,
Rasika Bhalerao,
Samuel R. Bowman
Abstract:
Pretrained language models, especially masked language models (MLMs) have seen success across many NLP tasks. However, there is ample evidence that they use the cultural biases that are undoubtedly present in the corpora they are trained on, implicitly creating harm with biased representations. To measure some forms of social bias in language models against protected demographic groups in the US,…
▽ More
Pretrained language models, especially masked language models (MLMs) have seen success across many NLP tasks. However, there is ample evidence that they use the cultural biases that are undoubtedly present in the corpora they are trained on, implicitly creating harm with biased representations. To measure some forms of social bias in language models against protected demographic groups in the US, we introduce the Crowdsourced Stereotype Pairs benchmark (CrowS-Pairs). CrowS-Pairs has 1508 examples that cover stereotypes dealing with nine types of bias, like race, religion, and age. In CrowS-Pairs a model is presented with two sentences: one that is more stereoty** and another that is less stereoty**. The data focuses on stereotypes about historically disadvantaged groups and contrasts them with advantaged groups. We find that all three of the widely-used MLMs we evaluate substantially favor sentences that express stereotypes in every category in CrowS-Pairs. As work on building less biased models advances, this dataset can be used as a benchmark to evaluate progress.
△ Less
Submitted 30 September, 2020;
originally announced October 2020.
-
Can neural networks acquire a structural bias from raw linguistic data?
Authors:
Alex Warstadt,
Samuel R. Bowman
Abstract:
We evaluate whether BERT, a widely used neural network for sentence processing, acquires an inductive bias towards forming structural generalizations through pretraining on raw data. We conduct four experiments testing its preference for structural vs. linear generalizations in different structure-dependent phenomena. We find that BERT makes a structural generalization in 3 out of 4 empirical doma…
▽ More
We evaluate whether BERT, a widely used neural network for sentence processing, acquires an inductive bias towards forming structural generalizations through pretraining on raw data. We conduct four experiments testing its preference for structural vs. linear generalizations in different structure-dependent phenomena. We find that BERT makes a structural generalization in 3 out of 4 empirical domains---subject-auxiliary inversion, reflexive binding, and verb tense detection in embedded clauses---but makes a linear generalization when tested on NPI licensing. We argue that these results are the strongest evidence so far from artificial learners supporting the proposition that a structural bias can be acquired from raw data. If this conclusion is correct, it is tentative evidence that some linguistic universals can be acquired by learners without innate biases. However, the precise implications for human language acquisition are unclear, as humans learn language from significantly less data than BERT.
△ Less
Submitted 23 September, 2020; v1 submitted 13 July, 2020;
originally announced July 2020.
-
Self-Training for Unsupervised Parsing with PRPN
Authors:
Anhad Mohananey,
Katharina Kann,
Samuel R. Bowman
Abstract:
Neural unsupervised parsing (UP) models learn to parse without access to syntactic annotations, while being optimized for another task like language modeling. In this work, we propose self-training for neural UP models: we leverage aggregated annotations predicted by copies of our model as supervision for future copies. To be able to use our model's predictions during training, we extend a recent…
▽ More
Neural unsupervised parsing (UP) models learn to parse without access to syntactic annotations, while being optimized for another task like language modeling. In this work, we propose self-training for neural UP models: we leverage aggregated annotations predicted by copies of our model as supervision for future copies. To be able to use our model's predictions during training, we extend a recent neural UP architecture, the PRPN (Shen et al., 2018a) such that it can be trained in a semi-supervised fashion. We then add examples with parses predicted by our model to our unlabeled UP training data. Our self-trained model outperforms the PRPN by 8.1% F1 and the previous state of the art by 1.6% F1. In addition, we show that our architecture can also be helpful for semi-supervised parsing in ultra-low-resource settings.
△ Less
Submitted 27 May, 2020;
originally announced May 2020.
-
English Intermediate-Task Training Improves Zero-Shot Cross-Lingual Transfer Too
Authors:
Jason Phang,
Iacer Calixto,
Phu Mon Htut,
Yada Pruksachatkun,
Haokun Liu,
Clara Vania,
Katharina Kann,
Samuel R. Bowman
Abstract:
Intermediate-task training---fine-tuning a pretrained model on an intermediate task before fine-tuning again on the target task---often improves model performance substantially on language understanding tasks in monolingual English settings. We investigate whether English intermediate-task training is still helpful on non-English target tasks. Using nine intermediate language-understanding tasks,…
▽ More
Intermediate-task training---fine-tuning a pretrained model on an intermediate task before fine-tuning again on the target task---often improves model performance substantially on language understanding tasks in monolingual English settings. We investigate whether English intermediate-task training is still helpful on non-English target tasks. Using nine intermediate language-understanding tasks, we evaluate intermediate-task transfer in a zero-shot cross-lingual setting on the XTREME benchmark. We see large improvements from intermediate training on the BUCC and Tatoeba sentence retrieval tasks and moderate improvements on question-answering target tasks. MNLI, SQuAD and HellaSwag achieve the best overall results as intermediate tasks, while multi-task intermediate offers small additional improvements. Using our best intermediate-task models for each target task, we obtain a 5.4 point improvement over XLM-R Large on the XTREME benchmark, setting the state of the art as of June 2020. We also investigate continuing multilingual MLM during intermediate-task training and using machine-translated intermediate-task data, but neither consistently outperforms simply performing English intermediate-task training.
△ Less
Submitted 30 September, 2020; v1 submitted 26 May, 2020;
originally announced May 2020.
-
Materials for hydrogen-based energy storage: Past, recent progress and future outlook
Authors:
Volodymyr A. Yartys,
Marcello Baricco,
Jose Bellosta von Colbe,
Didier Blanchard,
Robert C. Bowman Jr.,
Darren P. Broom,
Craig E. Buckley,
Fei Chang,
** Chen,
Young Whan Cho,
Jean-Claude Crivello,
Fermin Cuevas,
William I. F. David,
Petra E. de Jongh,
Roman V. Denys,
Martin Dornheim,
Michael Felderhoff,
Yaroslav Filinchuk,
George E. Froudakis,
David M. Grant,
Bjørn C. Hauback,
Ladislav Havela,
Teng He,
Michael Hirscher,
Terry D. Humphries
, et al. (23 additional authors not shown)
Abstract:
Magnesium hydride owns the largest share of publications on solid materials for hydrogen storage. The Magnesium group of international experts contributing to IEA Task 32 Hydrogen Based Energy Storage recently published two review papers presenting the activities of the group focused on magnesium hydride based materials and on Mg based compounds for hydrogen and energy storage. This review article…
▽ More
Magnesium hydride owns the largest share of publications on solid materials for hydrogen storage. The Magnesium group of international experts contributing to IEA Task 32 Hydrogen Based Energy Storage recently published two review papers presenting the activities of the group focused on magnesium hydride based materials and on Mg based compounds for hydrogen and energy storage. This review article not only overviews the latest activities on both fundamental aspects of Mg-based hydrides and their applications, but also presents a historic overview on the topic and outlines projected future developments. Particular attention is paid to the theoretical and experimental studies of Mg-H system at extreme pressures, kinetics and thermodynamics of the systems based on MgH2,nanostructuring, new Mg-based compounds and novel composites, and catalysis in the Mg based H storage systems. Finally, thermal energy storage and upscaled H storage systems accommodating MgH2 are presented.
△ Less
Submitted 7 May, 2020;
originally announced May 2020.
-
Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work?
Authors:
Yada Pruksachatkun,
Jason Phang,
Haokun Liu,
Phu Mon Htut,
Xiaoyi Zhang,
Richard Yuanzhe Pang,
Clara Vania,
Katharina Kann,
Samuel R. Bowman
Abstract:
While pretrained models such as BERT have shown large gains across natural language understanding tasks, their performance can be improved by further training the model on a data-rich intermediate task, before fine-tuning it on a target task. However, it is still poorly understood when and why intermediate-task training is beneficial for a given target task. To investigate this, we perform a large…
▽ More
While pretrained models such as BERT have shown large gains across natural language understanding tasks, their performance can be improved by further training the model on a data-rich intermediate task, before fine-tuning it on a target task. However, it is still poorly understood when and why intermediate-task training is beneficial for a given target task. To investigate this, we perform a large-scale study on the pretrained RoBERTa model with 110 intermediate-target task combinations. We further evaluate all trained models with 25 probing tasks meant to reveal the specific skills that drive transfer. We observe that intermediate tasks requiring high-level inference and reasoning abilities tend to work best. We also observe that target task performance is strongly correlated with higher-level abilities such as coreference resolution. However, we fail to observe more granular correlations between probing and target task performance, highlighting the need for further work on broad-coverage probing benchmarks. We also observe evidence that the forgetting of knowledge learned during pretraining may limit our analysis, highlighting the need for further work on transfer learning methods in these settings.
△ Less
Submitted 9 May, 2020; v1 submitted 1 May, 2020;
originally announced May 2020.
-
Learning to Learn Morphological Inflection for Resource-Poor Languages
Authors:
Katharina Kann,
Samuel R. Bowman,
Kyunghyun Cho
Abstract:
We propose to cast the task of morphological inflection - map** a lemma to an indicated inflected form - for resource-poor languages as a meta-learning problem. Treating each language as a separate task, we use data from high-resource source languages to learn a set of model parameters that can serve as a strong initialization point for fine-tuning on a resource-poor target language. Experiments…
▽ More
We propose to cast the task of morphological inflection - map** a lemma to an indicated inflected form - for resource-poor languages as a meta-learning problem. Treating each language as a separate task, we use data from high-resource source languages to learn a set of model parameters that can serve as a strong initialization point for fine-tuning on a resource-poor target language. Experiments with two model architectures on 29 target languages from 3 families show that our suggested approach outperforms all baselines. In particular, it obtains a 31.7% higher absolute accuracy than a previously proposed cross-lingual transfer model and outperforms the previous state of the art by 1.7% absolute accuracy on average over languages.
△ Less
Submitted 28 April, 2020;
originally announced April 2020.
-
New Protocols and Negative Results for Textual Entailment Data Collection
Authors:
Samuel R. Bowman,
Jennimaria Palomaki,
Livio Baldini Soares,
Emily Pitler
Abstract:
Natural language inference (NLI) data has proven useful in benchmarking and, especially, as pretraining data for tasks requiring language understanding. However, the crowdsourcing protocol that was used to collect this data has known issues and was not explicitly optimized for either of these purposes, so it is likely far from ideal. We propose four alternative protocols, each aimed at improving e…
▽ More
Natural language inference (NLI) data has proven useful in benchmarking and, especially, as pretraining data for tasks requiring language understanding. However, the crowdsourcing protocol that was used to collect this data has known issues and was not explicitly optimized for either of these purposes, so it is likely far from ideal. We propose four alternative protocols, each aimed at improving either the ease with which annotators can produce sound training examples or the quality and diversity of those examples. Using these alternatives and a fifth baseline protocol, we collect and compare five new 8.5k-example training sets. In evaluations focused on transfer learning applications, our results are solidly negative, with models trained on our baseline dataset yielding good transfer performance to downstream tasks, but none of our four new methods (nor the recent ANLI) showing any improvements over that baseline. In a small silver lining, we observe that all four new protocols, especially those where annotators edit pre-filled text boxes, reduce previously observed issues with annotation artifacts.
△ Less
Submitted 29 September, 2020; v1 submitted 24 April, 2020;
originally announced April 2020.
-
jiant: A Software Toolkit for Research on General-Purpose Text Understanding Models
Authors:
Yada Pruksachatkun,
Phil Yeres,
Haokun Liu,
Jason Phang,
Phu Mon Htut,
Alex Wang,
Ian Tenney,
Samuel R. Bowman
Abstract:
We introduce jiant, an open source toolkit for conducting multitask and transfer learning experiments on English NLU tasks. jiant enables modular and configuration-driven experimentation with state-of-the-art models and implements a broad set of tasks for probing, transfer learning, and multitask training experiments. jiant implements over 50 NLU tasks, including all GLUE and SuperGLUE benchmark t…
▽ More
We introduce jiant, an open source toolkit for conducting multitask and transfer learning experiments on English NLU tasks. jiant enables modular and configuration-driven experimentation with state-of-the-art models and implements a broad set of tasks for probing, transfer learning, and multitask training experiments. jiant implements over 50 NLU tasks, including all GLUE and SuperGLUE benchmark tasks. We demonstrate that jiant reproduces published performance on a variety of tasks and models, including BERT and RoBERTa. jiant is available at https://jiant.info.
△ Less
Submitted 13 May, 2020; v1 submitted 4 March, 2020;
originally announced March 2020.
-
BLiMP: The Benchmark of Linguistic Minimal Pairs for English
Authors:
Alex Warstadt,
Alicia Parrish,
Haokun Liu,
Anhad Mohananey,
Wei Peng,
Sheng-Fu Wang,
Samuel R. Bowman
Abstract:
We introduce The Benchmark of Linguistic Minimal Pairs (shortened to BLiMP), a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics. The data is automatically generated according to expert-crafted grammars, and…
▽ More
We introduce The Benchmark of Linguistic Minimal Pairs (shortened to BLiMP), a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics. The data is automatically generated according to expert-crafted grammars, and aggregate human agreement with the labels is 96.4%. We use it to evaluate n-gram, LSTM, and Transformer (GPT-2 and Transformer-XL) LMs. We find that state-of-the-art models identify morphological contrasts reliably, but they struggle with semantic restrictions on the distribution of quantifiers and negative polarity items and subtle syntactic phenomena such as extraction islands.
△ Less
Submitted 14 February, 2023; v1 submitted 2 December, 2019;
originally announced December 2019.
-
Flat-field and colour correction for the Raspberry Pi camera module
Authors:
Richard Bowman,
Boyko Vodenicharski,
Joel Collins,
Julian Stirling
Abstract:
The Raspberry Pi camera module is widely used in open source hardware projects as a low cost camera sensor. However, when the stock lens is removed and replaced with other custom optics the sensor will return a non-uniform background and colour response which hampers the use of this excellent and popular image sensor. This effect is found to be due to the sensor's optical design as well as due to…
▽ More
The Raspberry Pi camera module is widely used in open source hardware projects as a low cost camera sensor. However, when the stock lens is removed and replaced with other custom optics the sensor will return a non-uniform background and colour response which hampers the use of this excellent and popular image sensor. This effect is found to be due to the sensor's optical design as well as due to built-in corrections in the GPU firmware, which is optimised for a short focal length lens. In this work we characterise and correct the vignetting and colour crosstalk found in the Raspberry Pi camera module v2, presenting two measures that greatly improve the quality of images using custom optics. First, we use a custom "lens shading table" to correct for vignetting of the image, which can be done in real time in the camera's existing processing pipeline (i.e. the camera's low-latency preview is corrected). The second correction is a colour unmixing matrix, which enables us to reverse the loss in saturation at the edge of the image, though this requires post-processing of the image. With both of these corrections in place, it is possible to obtain uniformly colour-corrected images, at the expense of slightly increased noise at the edges of the image.
△ Less
Submitted 29 November, 2019;
originally announced November 2019.
-
Do Attention Heads in BERT Track Syntactic Dependencies?
Authors:
Phu Mon Htut,
Jason Phang,
Shikha Bordia,
Samuel R. Bowman
Abstract:
We investigate the extent to which individual attention heads in pretrained transformer language models, such as BERT and RoBERTa, implicitly capture syntactic dependency relations. We employ two methods---taking the maximum attention weight and computing the maximum spanning tree---to extract implicit dependency relations from the attention weights of each layer/head, and compare them to the grou…
▽ More
We investigate the extent to which individual attention heads in pretrained transformer language models, such as BERT and RoBERTa, implicitly capture syntactic dependency relations. We employ two methods---taking the maximum attention weight and computing the maximum spanning tree---to extract implicit dependency relations from the attention weights of each layer/head, and compare them to the ground-truth Universal Dependency (UD) trees. We show that, for some UD relation types, there exist heads that can recover the dependency type significantly better than baselines on parsed English text, suggesting that some self-attention heads act as a proxy for syntactic structure. We also analyze BERT fine-tuned on two datasets---the syntax-oriented CoLA and the semantics-oriented MNLI---to investigate whether fine-tuning affects the patterns of their self-attention, but we do not observe substantial differences in the overall dependency relations extracted using our methods. Our results suggest that these models have some specialist attention heads that track individual dependency types, but no generalist head that performs holistic parsing significantly better than a trivial baseline, and that analyzing attention weights directly may not reveal much of the syntactic knowledge that BERT-style models are known to learn.
△ Less
Submitted 27 November, 2019;
originally announced November 2019.
-
The OpenFlexure Block Stage: Sub-100 nm fibre alignment with a monolithic plastic flexure stage
Authors:
Qingxin Meng,
Kerrianne Harrington,
Julian Stirling,
Richard Bowman
Abstract:
As 3D printers become more widely available, researchers are able to rapidly produce components that may have previously taken weeks to have machined. The resulting plastic components, having high surface roughness, are often not suitable for high-precision optomechanics. However, by playing to the strengths of 3D printing---namely the ability to print complex internal geometries---it is possible…
▽ More
As 3D printers become more widely available, researchers are able to rapidly produce components that may have previously taken weeks to have machined. The resulting plastic components, having high surface roughness, are often not suitable for high-precision optomechanics. However, by playing to the strengths of 3D printing---namely the ability to print complex internal geometries---it is possible to design monolithic mechanisms that do not rely on tight integration of high-precision parts. Here we present a motorised monolithic 3D-printed plastic flexure stage with sub-100 nm resolution, that can perform automated optical fibre alignment.
△ Less
Submitted 22 November, 2019;
originally announced November 2019.
-
Inducing Constituency Trees through Neural Machine Translation
Authors:
Phu Mon Htut,
Kyunghyun Cho,
Samuel R. Bowman
Abstract:
Latent tree learning(LTL) methods learn to parse sentences using only indirect supervision from a downstream task. Recent advances in latent tree learning have made it possible to recover moderately high quality tree structures by training with language modeling or auto-encoding objectives. In this work, we explore the hypothesis that decoding in machine translation, as a conditional language mode…
▽ More
Latent tree learning(LTL) methods learn to parse sentences using only indirect supervision from a downstream task. Recent advances in latent tree learning have made it possible to recover moderately high quality tree structures by training with language modeling or auto-encoding objectives. In this work, we explore the hypothesis that decoding in machine translation, as a conditional language modeling task, will produce better tree structures since it offers a similar training signal as language modeling, but with more semantic signal. We adapt two existing latent-tree language models--PRPN andON-LSTM--for use in translation. We find that they indeed recover trees that are better in F1 score than those seen in language modeling on WSJ test set, while maintaining strong translation quality. We observe that translation is a better objective than language modeling for inducing trees, marking the first success at latent tree learning using a machine translation objective. Additionally, our findings suggest that, although translation provides better signal for inducing trees than language modeling, translation models can perform well without exploiting the latent tree structure.
△ Less
Submitted 22 September, 2019;
originally announced September 2019.
-
Investigating BERT's Knowledge of Language: Five Analysis Methods with NPIs
Authors:
Alex Warstadt,
Yu Cao,
Ioana Grosu,
Wei Peng,
Hagen Blix,
Yining Nie,
Anna Alsop,
Shikha Bordia,
Haokun Liu,
Alicia Parrish,
Sheng-Fu Wang,
Jason Phang,
Anhad Mohananey,
Phu Mon Htut,
Paloma Jeretič,
Samuel R. Bowman
Abstract:
Though state-of-the-art sentence representation models can perform tasks requiring significant knowledge of grammar, it is an open question how best to evaluate their grammatical knowledge. We explore five experimental methods inspired by prior work evaluating pretrained sentence representation models. We use a single linguistic phenomenon, negative polarity item (NPI) licensing in English, as a c…
▽ More
Though state-of-the-art sentence representation models can perform tasks requiring significant knowledge of grammar, it is an open question how best to evaluate their grammatical knowledge. We explore five experimental methods inspired by prior work evaluating pretrained sentence representation models. We use a single linguistic phenomenon, negative polarity item (NPI) licensing in English, as a case study for our experiments. NPIs like "any" are grammatical only if they appear in a licensing environment like negation ("Sue doesn't have any cats" vs. "Sue has any cats"). This phenomenon is challenging because of the variety of NPI licensing environments that exist. We introduce an artificially generated dataset that manipulates key features of NPI licensing for the experiments. We find that BERT has significant knowledge of these features, but its success varies widely across different experimental methods. We conclude that a variety of methods is necessary to reveal all relevant aspects of a model's grammatical knowledge in a given domain.
△ Less
Submitted 19 September, 2019; v1 submitted 5 September, 2019;
originally announced September 2019.
-
Towards Realistic Practices In Low-Resource Natural Language Processing: The Development Set
Authors:
Katharina Kann,
Kyunghyun Cho,
Samuel R. Bowman
Abstract:
Development sets are impractical to obtain for real low-resource languages, since using all available data for training is often more effective. However, development sets are widely used in research papers that purport to deal with low-resource natural language processing (NLP). Here, we aim to answer the following questions: Does using a development set for early stop** in the low-resource sett…
▽ More
Development sets are impractical to obtain for real low-resource languages, since using all available data for training is often more effective. However, development sets are widely used in research papers that purport to deal with low-resource natural language processing (NLP). Here, we aim to answer the following questions: Does using a development set for early stop** in the low-resource setting influence results as compared to a more realistic alternative, where the number of training epochs is tuned on development languages? And does it lead to overestimation or underestimation of performance? We repeat multiple experiments from recent work on neural models for low-resource NLP and compare results for models obtained by training with and without development sets. On average over languages, absolute accuracy differs by up to 1.4%. However, for some languages and tasks, differences are as big as 18.0% accuracy. Our results highlight the importance of realistic experimental setups in the publication of low-resource NLP research results.
△ Less
Submitted 14 September, 2019; v1 submitted 3 September, 2019;
originally announced September 2019.
-
Can Unconditional Language Models Recover Arbitrary Sentences?
Authors:
Nishant Subramani,
Samuel R. Bowman,
Kyunghyun Cho
Abstract:
Neural network-based generative language models like ELMo and BERT can work effectively as general purpose sentence encoders in text classification without further fine-tuning. Is it possible to adapt them in a similar way for use as general-purpose decoders? For this to be possible, it would need to be the case that for any target sentence of interest, there is some continuous representation that…
▽ More
Neural network-based generative language models like ELMo and BERT can work effectively as general purpose sentence encoders in text classification without further fine-tuning. Is it possible to adapt them in a similar way for use as general-purpose decoders? For this to be possible, it would need to be the case that for any target sentence of interest, there is some continuous representation that can be passed to the language model to cause it to reproduce that sentence. We set aside the difficult problem of designing an encoder that can produce such representations and, instead, ask directly whether such representations exist at all. To do this, we introduce a pair of effective, complementary methods for feeding representations into pretrained unconditional language models and a corresponding set of methods to map sentences into and out of this representation space, the reparametrized sentence space. We then investigate the conditions under which a language model can be made to generate a sentence through the identification of a point in such a space and find that it is possible to recover arbitrary sentences nearly perfectly with language models and representations of moderate size without modifying any model parameters.
△ Less
Submitted 9 January, 2020; v1 submitted 10 July, 2019;
originally announced July 2019.
-
Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark
Authors:
Nikita Nangia,
Samuel R. Bowman
Abstract:
The GLUE benchmark (Wang et al., 2019b) is a suite of language understanding tasks which has seen dramatic progress in the past year, with average performance moving from 70.0 at launch to 83.9, state of the art at the time of writing (May 24, 2019). Here, we measure human performance on the benchmark, in order to learn whether significant headroom remains for further progress. We provide a conser…
▽ More
The GLUE benchmark (Wang et al., 2019b) is a suite of language understanding tasks which has seen dramatic progress in the past year, with average performance moving from 70.0 at launch to 83.9, state of the art at the time of writing (May 24, 2019). Here, we measure human performance on the benchmark, in order to learn whether significant headroom remains for further progress. We provide a conservative estimate of human performance on the benchmark through crowdsourcing: Our annotators are non-experts who must learn each task from a brief set of instructions and 20 examples. In spite of limited training, these annotators robustly outperform the state of the art on six of the nine GLUE tasks and achieve an average score of 87.1. Given the fast pace of progress however, the headroom we observe is quite limited. To reproduce the data-poor setting that our annotators must learn in, we also train the BERT model (Devlin et al., 2019) in limited-data regimes, and conclude that low-resource sentence classification remains a challenge for modern neural network approaches to text understanding.
△ Less
Submitted 1 June, 2019; v1 submitted 24 May, 2019;
originally announced May 2019.
-
What do you learn from context? Probing for sentence structure in contextualized word representations
Authors:
Ian Tenney,
Patrick Xia,
Berlin Chen,
Alex Wang,
Adam Poliak,
R Thomas McCoy,
Najoung Kim,
Benjamin Van Durme,
Samuel R. Bowman,
Dipanjan Das,
Ellie Pavlick
Abstract:
Contextualized representation models such as ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2018) have recently achieved state-of-the-art results on a diverse array of downstream NLP tasks. Building on recent token-level probing work, we introduce a novel edge probing task design and construct a broad suite of sub-sentence tasks derived from the traditional structured NLP pipeline. We probe…
▽ More
Contextualized representation models such as ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2018) have recently achieved state-of-the-art results on a diverse array of downstream NLP tasks. Building on recent token-level probing work, we introduce a novel edge probing task design and construct a broad suite of sub-sentence tasks derived from the traditional structured NLP pipeline. We probe word-level contextual representations from four recent models and investigate how they encode sentence structure across a range of syntactic, semantic, local, and long-range phenomena. We find that existing models trained on language modeling and translation produce strong representations for syntactic phenomena, but only offer comparably small improvements on semantic tasks over a non-contextual baseline.
△ Less
Submitted 15 May, 2019;
originally announced May 2019.
-
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
Authors:
Alex Wang,
Yada Pruksachatkun,
Nikita Nangia,
Amanpreet Singh,
Julian Michael,
Felix Hill,
Omer Levy,
Samuel R. Bowman
Abstract:
In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced a little over one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently surpassed the level of non-expert h…
▽ More
In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced a little over one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research. In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard. SuperGLUE is available at super.gluebenchmark.com.
△ Less
Submitted 12 February, 2020; v1 submitted 1 May, 2019;
originally announced May 2019.
-
Probing What Different NLP Tasks Teach Machines about Function Word Comprehension
Authors:
Najoung Kim,
Roma Patel,
Adam Poliak,
Alex Wang,
Patrick Xia,
R. Thomas McCoy,
Ian Tenney,
Alexis Ross,
Tal Linzen,
Benjamin Van Durme,
Samuel R. Bowman,
Ellie Pavlick
Abstract:
We introduce a set of nine challenge tasks that test for the understanding of function words. These tasks are created by structurally mutating sentences from existing datasets to target the comprehension of specific types of function words (e.g., prepositions, wh-words). Using these probing tasks, we explore the effects of various pretraining objectives for sentence encoders (e.g., language modeli…
▽ More
We introduce a set of nine challenge tasks that test for the understanding of function words. These tasks are created by structurally mutating sentences from existing datasets to target the comprehension of specific types of function words (e.g., prepositions, wh-words). Using these probing tasks, we explore the effects of various pretraining objectives for sentence encoders (e.g., language modeling, CCG supertagging and natural language inference (NLI)) on the learned representations. Our results show that pretraining on language modeling performs the best on average across our probing tasks, supporting its widespread use for pretraining state-of-the-art NLP models, and CCG supertagging and NLI pretraining perform comparably. Overall, no pretraining objective dominates across the board, and our function word probing tasks highlight several intuitive differences between pretraining objectives, e.g., that NLI helps the comprehension of negation.
△ Less
Submitted 7 August, 2019; v1 submitted 25 April, 2019;
originally announced April 2019.
-
Identifying and Reducing Gender Bias in Word-Level Language Models
Authors:
Shikha Bordia,
Samuel R. Bowman
Abstract:
Many text corpora exhibit socially problematic biases, which can be propagated or amplified in the models trained on such data. For example, doctor cooccurs more frequently with male pronouns than female pronouns. In this study we (i) propose a metric to measure gender bias; (ii) measure bias in a text corpus and the text generated from a recurrent neural network language model trained on the text…
▽ More
Many text corpora exhibit socially problematic biases, which can be propagated or amplified in the models trained on such data. For example, doctor cooccurs more frequently with male pronouns than female pronouns. In this study we (i) propose a metric to measure gender bias; (ii) measure bias in a text corpus and the text generated from a recurrent neural network language model trained on the text corpus; (iii) propose a regularization loss term for the language model that minimizes the projection of encoder-trained embeddings onto an embedding subspace that encodes gender; (iv) finally, evaluate efficacy of our proposed method on reducing gender bias. We find this regularization method to be effective in reducing gender bias up to an optimal weight assigned to the loss term, beyond which the model becomes unstable as the perplexity increases. We replicate this study on three training corpora---Penn Treebank, WikiText-2, and CNN/Daily Mail---resulting in similar conclusions.
△ Less
Submitted 5 April, 2019;
originally announced April 2019.
-
On Measuring Social Biases in Sentence Encoders
Authors:
Chandler May,
Alex Wang,
Shikha Bordia,
Samuel R. Bowman,
Rachel Rudinger
Abstract:
The Word Embedding Association Test shows that GloVe and word2vec word embeddings exhibit human-like implicit biases based on gender, race, and other social constructs (Caliskan et al., 2017). Meanwhile, research on learning reusable text representations has begun to explore sentence-level texts, with some sentence encoders seeing enthusiastic adoption. Accordingly, we extend the Word Embedding As…
▽ More
The Word Embedding Association Test shows that GloVe and word2vec word embeddings exhibit human-like implicit biases based on gender, race, and other social constructs (Caliskan et al., 2017). Meanwhile, research on learning reusable text representations has begun to explore sentence-level texts, with some sentence encoders seeing enthusiastic adoption. Accordingly, we extend the Word Embedding Association Test to measure bias in sentence encoders. We then test several sentence encoders, including state-of-the-art methods such as ELMo and BERT, for the social biases studied in prior work and two important biases that are difficult or impossible to test at the word level. We observe mixed results including suspicious patterns of sensitivity that suggest the test's assumptions may not hold in general. We conclude by proposing directions for future work on measuring bias in sentence encoders.
△ Less
Submitted 25 March, 2019;
originally announced March 2019.
-
Linguistic Analysis of Pretrained Sentence Encoders with Acceptability Judgments
Authors:
Alex Warstadt,
Samuel R. Bowman
Abstract:
Recent work on evaluating grammatical knowledge in pretrained sentence encoders gives a fine-grained view of a small number of phenomena. We introduce a new analysis dataset that also has broad coverage of linguistic phenomena. We annotate the development set of the Corpus of Linguistic Acceptability (CoLA; Warstadt et al., 2018) for the presence of 13 classes of syntactic phenomena including vari…
▽ More
Recent work on evaluating grammatical knowledge in pretrained sentence encoders gives a fine-grained view of a small number of phenomena. We introduce a new analysis dataset that also has broad coverage of linguistic phenomena. We annotate the development set of the Corpus of Linguistic Acceptability (CoLA; Warstadt et al., 2018) for the presence of 13 classes of syntactic phenomena including various forms of argument alternations, movement, and modification. We use this analysis set to investigate the grammatical knowledge of three pretrained encoders: BERT (Devlin et al., 2018), GPT (Radford et al., 2018), and the BiLSTM baseline from Warstadt et al. We find that these models have a strong command of complex or non-canonical argument structures like ditransitives (Sue gave Dan a book) and passives (The book was read). Sentences with long distance dependencies like questions (What do you think I ate?) challenge all models, but for these, BERT and GPT have a distinct advantage over the baseline. We conclude that recent sentence encoders, despite showing near-human performance on acceptability classification overall, still fail to make fine-grained grammaticality distinctions for many complex syntactic structures.
△ Less
Submitted 21 May, 2020; v1 submitted 10 January, 2019;
originally announced January 2019.
-
Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling
Authors:
Alex Wang,
Jan Hula,
Patrick Xia,
Raghavendra Pappagari,
R. Thomas McCoy,
Roma Patel,
Najoung Kim,
Ian Tenney,
Yinghui Huang,
Katherin Yu,
Shuning **,
Berlin Chen,
Benjamin Van Durme,
Edouard Grave,
Ellie Pavlick,
Samuel R. Bowman
Abstract:
Natural language understanding has recently seen a surge of progress with the use of sentence encoders like ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2019) which are pretrained on variants of language modeling. We conduct the first large-scale systematic study of candidate pretraining tasks, comparing 19 different tasks both as alternatives and complements to language modeling. Our prim…
▽ More
Natural language understanding has recently seen a surge of progress with the use of sentence encoders like ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2019) which are pretrained on variants of language modeling. We conduct the first large-scale systematic study of candidate pretraining tasks, comparing 19 different tasks both as alternatives and complements to language modeling. Our primary results support the use language modeling, especially when combined with pretraining on additional labeled-data tasks. However, our results are mixed across pretraining tasks and show some concerning trends: In ELMo's pretrain-then-freeze paradigm, random baselines are worryingly strong and results vary strikingly across target tasks. In addition, fine-tuning BERT on an intermediate task often negatively impacts downstream transfer. In a more positive trend, we see modest gains from multitask training, suggesting the development of more sophisticated multitask and transfer learning techniques as an avenue for further research.
△ Less
Submitted 22 July, 2019; v1 submitted 27 December, 2018;
originally announced December 2018.
-
Long-range depth imaging using a single-photon detector array and non-local data fusion
Authors:
Susan Chan,
Abderrahim Halimi,
Feng Zhu,
Istvan Gyongy,
Robert K. Henderson,
Richard Bowman,
Steve McLaughlin,
Gerald S. Buller,
Jonathan Leach
Abstract:
The ability to measure and record high-resolution depth images at long stand-off distances is important for a wide range of applications, including connected and automotive vehicles, defense and security, and agriculture and mining. In LIDAR (light detection and ranging) applications, single-photon sensitive detection is an emerging approach, offering high sensitivity to light and picosecond tempo…
▽ More
The ability to measure and record high-resolution depth images at long stand-off distances is important for a wide range of applications, including connected and automotive vehicles, defense and security, and agriculture and mining. In LIDAR (light detection and ranging) applications, single-photon sensitive detection is an emerging approach, offering high sensitivity to light and picosecond temporal resolution, and consequently excellent surface-to-surface resolution. The use of large format CMOS single-photon detector arrays provides high spatial resolution and allows the timing information to be acquired simultaneously across many pixels. In this work, we combine state-of-the-art single-photon detector array technology with non-local data fusion to generate high resolution three-dimensional depth information of long-range targets. The system is based on a visible pulsed illumination system at 670~nm and a 240~$\times$ 320 pixel array sensor, achieving sub-centimeter precision in all three spatial dimensions at a distance of 150 meters. The non-local data fusion combines information from an optical image with sparse sampling of the single-photon array data, providing accurate depth information at low signature regions of the target.
△ Less
Submitted 11 December, 2018;
originally announced December 2018.
-
Verb Argument Structure Alternations in Word and Sentence Embeddings
Authors:
Katharina Kann,
Alex Warstadt,
Adina Williams,
Samuel R. Bowman
Abstract:
Verbs occur in different syntactic environments, or frames. We investigate whether artificial neural networks encode grammatical distinctions necessary for inferring the idiosyncratic frame-selectional properties of verbs. We introduce five datasets, collectively called FAVA, containing in aggregate nearly 10k sentences labeled for grammatical acceptability, illustrating different verbal argument…
▽ More
Verbs occur in different syntactic environments, or frames. We investigate whether artificial neural networks encode grammatical distinctions necessary for inferring the idiosyncratic frame-selectional properties of verbs. We introduce five datasets, collectively called FAVA, containing in aggregate nearly 10k sentences labeled for grammatical acceptability, illustrating different verbal argument structure alternations. We then test whether models can distinguish acceptable English verb-frame combinations from unacceptable ones using a sentence embedding alone. For converging evidence, we further construct LaVA, a corresponding word-level dataset, and investigate whether the same syntactic features can be extracted from word embeddings. Our models perform reliable classifications for some verbal alternations but not others, suggesting that while these representations do encode fine-grained lexical information, it is incomplete or can be hard to extract. Further, differences between the word- and sentence-level models show that some information present in word embeddings is not passed on to the down-stream sentence embeddings.
△ Less
Submitted 26 November, 2018;
originally announced November 2018.
-
Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks
Authors:
Jason Phang,
Thibault Févry,
Samuel R. Bowman
Abstract:
Pretraining sentence encoders with language modeling and related unsupervised tasks has recently been shown to be very effective for language understanding tasks. By supplementing language model-style pretraining with further training on data-rich supervised tasks, such as natural language inference, we obtain additional performance improvements on the GLUE benchmark. Applying supplementary traini…
▽ More
Pretraining sentence encoders with language modeling and related unsupervised tasks has recently been shown to be very effective for language understanding tasks. By supplementing language model-style pretraining with further training on data-rich supervised tasks, such as natural language inference, we obtain additional performance improvements on the GLUE benchmark. Applying supplementary training on BERT (Devlin et al., 2018), we attain a GLUE score of 81.8---the state of the art (as of 02/24/2019) and a 1.4 point improvement over BERT. We also observe reduced variance across random restarts in this setting. Our approach yields similar improvements when applied to ELMo (Peters et al., 2018a) and Radford et al. (2018)'s model. In addition, the benefits of supplementary training are particularly pronounced in data-constrained regimes, as we show in experiments with artificially limited training data.
△ Less
Submitted 27 February, 2019; v1 submitted 2 November, 2018;
originally announced November 2018.
-
Language Modeling Teaches You More Syntax than Translation Does: Lessons Learned Through Auxiliary Task Analysis
Authors:
Kelly W. Zhang,
Samuel R. Bowman
Abstract:
Recent work using auxiliary prediction task classifiers to investigate the properties of LSTM representations has begun to shed light on why pretrained representations, like ELMo (Peters et al., 2018) and CoVe (McCann et al., 2017), are so beneficial for neural language understanding models. We still, though, do not yet have a clear understanding of how the choice of pretraining objective affects…
▽ More
Recent work using auxiliary prediction task classifiers to investigate the properties of LSTM representations has begun to shed light on why pretrained representations, like ELMo (Peters et al., 2018) and CoVe (McCann et al., 2017), are so beneficial for neural language understanding models. We still, though, do not yet have a clear understanding of how the choice of pretraining objective affects the type of linguistic information that models learn. With this in mind, we compare four objectives---language modeling, translation, skip-thought, and autoencoding---on their ability to induce syntactic and part-of-speech information. We make a fair comparison between the tasks by holding constant the quantity and genre of the training data, as well as the LSTM architecture. We find that representations from language models consistently perform best on our syntactic auxiliary prediction tasks, even when trained on relatively small amounts of data. These results suggest that language modeling may be the best data-rich pretraining task for transfer learning applications requiring syntactic information. We also find that the representations from randomly-initialized, frozen LSTMs perform strikingly well on our syntactic auxiliary tasks, but this effect disappears when the amount of training data for the auxiliary tasks is reduced.
△ Less
Submitted 7 January, 2019; v1 submitted 26 September, 2018;
originally announced September 2018.
-
XNLI: Evaluating Cross-lingual Sentence Representations
Authors:
Alexis Conneau,
Guillaume Lample,
Ruty Rinott,
Adina Williams,
Samuel R. Bowman,
Holger Schwenk,
Veselin Stoyanov
Abstract:
State-of-the-art natural language processing systems rely on supervision in the form of annotated data to learn competent models. These models are generally trained on data in a single language (usually English), and cannot be directly used beyond that language. Since collecting data in every language is not realistic, there has been a growing interest in cross-lingual language understanding (XLU)…
▽ More
State-of-the-art natural language processing systems rely on supervision in the form of annotated data to learn competent models. These models are generally trained on data in a single language (usually English), and cannot be directly used beyond that language. Since collecting data in every language is not realistic, there has been a growing interest in cross-lingual language understanding (XLU) and low-resource cross-language transfer. In this work, we construct an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus (MultiNLI) to 15 languages, including low-resource languages such as Swahili and Urdu. We hope that our dataset, dubbed XNLI, will catalyze research in cross-lingual sentence understanding by providing an informative standard evaluation task. In addition, we provide several baselines for multilingual sentence understanding, including two based on machine translation systems, and two that use parallel data to train aligned multilingual bag-of-words and LSTM encoders. We find that XNLI represents a practical and challenging evaluation suite, and that directly translating the test data yields the best performance among available baselines.
△ Less
Submitted 13 September, 2018;
originally announced September 2018.
-
Grammar Induction with Neural Language Models: An Unusual Replication
Authors:
Phu Mon Htut,
Kyunghyun Cho,
Samuel R. Bowman
Abstract:
A substantial thread of recent work on latent tree learning has attempted to develop neural network models with parse-valued latent variables and train them on non-parsing tasks, in the hope of having them discover interpretable tree structure. In a recent paper, Shen et al. (2018) introduce such a model and report near-state-of-the-art results on the target task of language modeling, and the firs…
▽ More
A substantial thread of recent work on latent tree learning has attempted to develop neural network models with parse-valued latent variables and train them on non-parsing tasks, in the hope of having them discover interpretable tree structure. In a recent paper, Shen et al. (2018) introduce such a model and report near-state-of-the-art results on the target task of language modeling, and the first strong latent tree learning result on constituency parsing. In an attempt to reproduce these results, we discover issues that make the original results hard to trust, including tuning and even training on what is effectively the test set. Here, we attempt to reproduce these results in a fair experiment and to extend them to two new datasets. We find that the results of this work are robust: All variants of the model under study outperform all latent tree learning baselines, and perform competitively with symbolic grammar induction systems. We find that this model represents the first empirical success for latent tree learning, and that neural network language modeling warrants further study as a setting for grammar induction.
△ Less
Submitted 29 August, 2018;
originally announced August 2018.
-
Neural Network Acceptability Judgments
Authors:
Alex Warstadt,
Amanpreet Singh,
Samuel R. Bowman
Abstract:
This paper investigates the ability of artificial neural networks to judge the grammatical acceptability of a sentence, with the goal of testing their linguistic competence. We introduce the Corpus of Linguistic Acceptability (CoLA), a set of 10,657 English sentences labeled as grammatical or ungrammatical from published linguistics literature. As baselines, we train several recurrent neural netwo…
▽ More
This paper investigates the ability of artificial neural networks to judge the grammatical acceptability of a sentence, with the goal of testing their linguistic competence. We introduce the Corpus of Linguistic Acceptability (CoLA), a set of 10,657 English sentences labeled as grammatical or ungrammatical from published linguistics literature. As baselines, we train several recurrent neural network models on acceptability classification, and find that our models outperform unsupervised models by Lau et al (2016) on CoLA. Error-analysis on specific grammatical phenomena reveals that both Lau et al.'s models and ours learn systematic generalizations like subject-verb-object order. However, all models we test perform far below human level on a wide range of grammatical constructions.
△ Less
Submitted 1 October, 2019; v1 submitted 31 May, 2018;
originally announced May 2018.
-
Choice of adaptive sampling strategy impacts state discovery, transition probabilities, and the apparent mechanism of conformational changes
Authors:
Maxwell I. Zimmerman,
Justin R. Porter,
Xianqiang Sun,
Roseane R. Silva,
Gregory R. Bowman
Abstract:
Interest in equilibrium-based sampling methods has grown with recent advances in computational hardware and Markov state modeling (MSM) methods, yet outstanding questions remain that hinder widespread adoption. Namely, how do sampling strategies explore conformational space and how might this influence predictions? Here, we seek to answer these questions for four commonly used sampling methods: 1)…
▽ More
Interest in equilibrium-based sampling methods has grown with recent advances in computational hardware and Markov state modeling (MSM) methods, yet outstanding questions remain that hinder widespread adoption. Namely, how do sampling strategies explore conformational space and how might this influence predictions? Here, we seek to answer these questions for four commonly used sampling methods: 1) a long simulation, 2) many short simulations, 3) adaptive sampling, and 4) FAST. We first develop a theoretical framework for analytically calculating the probability of discovering states and uncover the drastic effects of varying the number and length of simulations. We then use kinetic Monte Carlo simulations on a variety of physically inspired landscapes to characterize state discovery and transition pathways. Consistently, we find that FAST simulations discover target states with the highest probability and traverse realistic pathways. Furthermore, we uncover the pathology that short parallel simulations sometimes predict an incorrect transition pathway by crossing large energy barriers that long simulations would typically circumnavigate, which we refer to as pathway tunneling. To protect against tunneling, we introduce FAST-string, which samples along the highest-flux transition paths to refine an MSMs transition probabilities and discriminate between competing pathways. Additionally, we compare MSM estimators in describing thermodynamics and kinetics. For adaptive sampling, we recommend normalizing the transition counts out of each state after adding pseudo-counts to avoid creating sources or sinks. Lastly, we evaluate our insights from simple landscapes with all-atom molecular dynamics simulations of the folding of the λ-repressor protein. We find that FAST-contacts predicts the same folding pathway as long simulations but with orders of magnitude less simulation time.
△ Less
Submitted 11 May, 2018;
originally announced May 2018.