-
A Parameterized Nonlinear Magnetic Equivalent Circuit for Design and Fast Analysis of Radial Flux Magnetic Gears
Authors:
Danial Kazemikia,
Matthew Gardner
Abstract:
Magnetic gears offer advantages over mechanical gears, including contactless power transfer, but require robust analysis tools for optimization and commercialization. This study proposes a rapid and accurate 2D nonlinear magnetic equivalent circuit (MEC) model for radial flux magnetic gears (RFMG). The model, featuring a parameterized gear geometry and adjustable flux tube distribution, accommodat…
▽ More
Magnetic gears offer advantages over mechanical gears, including contactless power transfer, but require robust analysis tools for optimization and commercialization. This study proposes a rapid and accurate 2D nonlinear magnetic equivalent circuit (MEC) model for radial flux magnetic gears (RFMG). The model, featuring a parameterized gear geometry and adjustable flux tube distribution, accommodates nonlinear effects like magnetic saturation while maintaining quick simulation times. Comparison with a nonlinear finite element analysis (FEA) model demonstrates the MEC's accuracy in torque and flux density predictions across diverse designs. Additionally, a parametric optimization study of 140,000 designs confirms the MEC's high accuracy, achieving close agreement with FEA torque predictions, with simulations running up to 100 times faster. Finally, the MEC shows good agreement with 2D FEA for a prototype RFMG.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Towards establishing best practice in the analysis of hydrogen and deuterium by atom probe tomography
Authors:
Baptiste Gault,
Aparna Saksena,
Xavier Sauvage,
Paul Bagot,
Leonardo S. Aota,
Jonas Arlt,
Lisa T. Belkacemi,
Torben Boll,
Yi-Sheng Chen,
Luke Daly,
Milos B. Djukic,
James O. Douglas,
Maria J. Duarte,
Peter J. Felfer,
Richard G. Forbes,
**g Fu,
Hazel M. Gardner,
Ryota Gemma,
Stephan S. A. Gerstl,
Yilun Gong,
Guillaume Hachet,
Severin Jakob,
Benjamin M. Jenkins,
Megan E. Jones,
Heena Khanchandani
, et al. (20 additional authors not shown)
Abstract:
As hydrogen is touted as a key player in the decarbonization of modern society, it is critical to enable quantitative H analysis at high spatial resolution, if possible at the atomic scale. Indeed, H has a known deleterious impact on the mechanical properties (strength, ductility, toughness) of most materials that can hinder their use as part of the infrastructure of a hydrogen-based economy. Enab…
▽ More
As hydrogen is touted as a key player in the decarbonization of modern society, it is critical to enable quantitative H analysis at high spatial resolution, if possible at the atomic scale. Indeed, H has a known deleterious impact on the mechanical properties (strength, ductility, toughness) of most materials that can hinder their use as part of the infrastructure of a hydrogen-based economy. Enabling H map**, including local hydrogen concentration analyses at specific microstructural features, is essential for understanding the multiple ways that H affect the properties of materials, including for instance embrittlement mechanisms and their synergies, but also spatial map** and quantification of hydrogen isotopes is essential to accurately predict tritium inventory of future fusion power plants, ensuring their safe and efficient operation for example. Atom probe tomography (APT) has the intrinsic capabilities for detecting hydrogen (H), and deuterium (D), and in principle the capacity for performing quantitative map** of H within a material's microstructure. Yet the accuracy and precision of H analysis by APT remain affected by the influence of residual hydrogen from the ultra-high vacuum chamber that can obscure the signal of H from within the material, along with a complex field evaporation behavior. The present article reports the essence of discussions at a focused workshop held at the Max-Planck Institute for Sustainable Materials in April 2024. The workshop was organized to pave the way to establishing best practices in reporting APT data for the analysis of H. We first summarize the key aspects of the intricacies of H analysis by APT and propose a path for better reporting of the relevant data to support interpretation of APT-based H analysis in materials.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
Advancing Real-time Pandemic Forecasting Using Large Language Models: A COVID-19 Case Study
Authors:
Hongru Du,
Jianan Zhao,
Yang Zhao,
Shaochong Xu,
Xihong Lin,
Yiran Chen,
Lauren M. Gardner,
Hao Frank Yang
Abstract:
Forecasting the short-term spread of an ongoing disease outbreak is a formidable challenge due to the complexity of contributing factors, some of which can be characterized through interlinked, multi-modality variables such as epidemiological time series data, viral biology, population demographics, and the intersection of public policy and human behavior. Existing forecasting model frameworks str…
▽ More
Forecasting the short-term spread of an ongoing disease outbreak is a formidable challenge due to the complexity of contributing factors, some of which can be characterized through interlinked, multi-modality variables such as epidemiological time series data, viral biology, population demographics, and the intersection of public policy and human behavior. Existing forecasting model frameworks struggle with the multifaceted nature of relevant data and robust results translation, which hinders their performances and the provision of actionable insights for public health decision-makers. Our work introduces PandemicLLM, a novel framework with multi-modal Large Language Models (LLMs) that reformulates real-time forecasting of disease spread as a text reasoning problem, with the ability to incorporate real-time, complex, non-numerical information that previously unattainable in traditional forecasting models. This approach, through a unique AI-human cooperative prompt design and time series representation learning, encodes multi-modal data for LLMs. The model is applied to the COVID-19 pandemic, and trained to utilize textual public health policies, genomic surveillance, spatial, and epidemiological time series data, and is subsequently tested across all 50 states of the U.S. Empirically, PandemicLLM is shown to be a high-performing pandemic forecasting framework that effectively captures the impact of emerging variants and can provide timely and accurate predictions. The proposed PandemicLLM opens avenues for incorporating various pandemic-related data in heterogeneous formats and exhibits performance benefits over existing models. This study illuminates the potential of adapting LLMs and representation learning to enhance pandemic forecasting, illustrating how AI innovations can strengthen pandemic responses and crisis management in the future.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
High angular momentum coupling for enhanced Rydberg-atom sensing in the VHF band
Authors:
Nikunjkumar Prajapati,
Jakob W. Kunzler,
Alexandra B. Artusio-Glimpse,
Andrew Rotunno,
Samuel Berweger,
Matthew T. Simons,
Christopher L. Holloway,
Chad M. Gardner,
Michael S. Mcbeth,
Robert A. Younts
Abstract:
Recent advances in Rydberg atom electrometry detail promising applications in radio frequency (RF) communications. Presently, most applications use carrier frequencies greater than 1~GHz where resonant Autler-Townes splitting provides the highest sensitivity. This letter documents a series of experiments with Rydberg atomic sensors to collect and process waveforms from the automated identification…
▽ More
Recent advances in Rydberg atom electrometry detail promising applications in radio frequency (RF) communications. Presently, most applications use carrier frequencies greater than 1~GHz where resonant Autler-Townes splitting provides the highest sensitivity. This letter documents a series of experiments with Rydberg atomic sensors to collect and process waveforms from the automated identification system (AIS) used in maritime navigation in the Very High Frequency (VHF) band. Detection in this band is difficult with conventional resonant Autler-Townes based Rydberg sensing and requires a new approach. We show the results from a new method called High Angular Momentum Matching Excited Raman (HAMMER), which enhances low frequency detection and exhibits superior sensitivity compared to the traditional AC Stark effect. From measurements of electromagnetically induced transparency (EIT) in rubidium and cesium vapor cells, we show the relationship between incident electric field strength and observed signal-to-noise ratio and find that the sensitivity of the HAMMER scheme in rubidium achieved an equivalent single VHF tone sensitivity of $\mathrm{100~μV/m/\sqrt{Hz}}$. With these results, we estimate the usable range of the atomic vapor cell antenna for AIS waveforms given current technology and detection techniques.
△ Less
Submitted 3 October, 2023;
originally announced October 2023.
-
Coverage-based Example Selection for In-Context Learning
Authors:
Shivanshu Gupta,
Matt Gardner,
Sameer Singh
Abstract:
In-context learning (ICL), the ability of large language models to perform novel tasks by conditioning on a prompt with a few task examples, requires these examples to be informative about the test instance. The standard approach of independently ranking and selecting the most similar examples selects redundant examples while omitting important information. In this work, we show that BERTScore-Rec…
▽ More
In-context learning (ICL), the ability of large language models to perform novel tasks by conditioning on a prompt with a few task examples, requires these examples to be informative about the test instance. The standard approach of independently ranking and selecting the most similar examples selects redundant examples while omitting important information. In this work, we show that BERTScore-Recall (BSR) selects better examples that demonstrate more of the salient aspects, e.g. reasoning patterns, of the test input. We further extend BSR and many standard metrics to easily optimizable set-level metrics, giving still better coverage of those salient aspects. On 15 datasets spanning 6 tasks and with 7 diverse LLMs, we show that (1) BSR is the superior metric for in-context example selection across the board, and (2) for compositional tasks, set selection using Set-BSR outperforms independent ranking by up to 17 points on average and, despite being training-free, surpasses methods that leverage task or LLM-specific training.
△ Less
Submitted 6 November, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
UQpy v4.1: Uncertainty Quantification with Python
Authors:
Dimitrios Tsapetis,
Michael D. Shields,
Dimitris G. Giovanis,
Audrey Olivier,
Lukas Novak,
Promit Chakroborty,
Himanshu Sharma,
Mohit Chauhan,
Katiana Kontolati,
Lohit Vandanapu,
Dimitrios Loukrezis,
Michael Gardner
Abstract:
This paper presents the latest improvements introduced in Version 4 of the UQpy, Uncertainty Quantification with Python, library. In the latest version, the code was restructured to conform with the latest Python coding conventions, refactored to simplify previous tightly coupled features, and improve its extensibility and modularity. To improve the robustness of UQpy, software engineering best pr…
▽ More
This paper presents the latest improvements introduced in Version 4 of the UQpy, Uncertainty Quantification with Python, library. In the latest version, the code was restructured to conform with the latest Python coding conventions, refactored to simplify previous tightly coupled features, and improve its extensibility and modularity. To improve the robustness of UQpy, software engineering best practices were adopted. A new software development workflow significantly improved collaboration between team members, and continous integration and automated testing ensured the robustness and reliability of software performance. Continuous deployment of UQpy allowed its automated packaging and distribution in system agnostic format via multiple channels, while a Docker image enables the use of the toolbox regardless of operating system limitations.
△ Less
Submitted 16 May, 2023;
originally announced May 2023.
-
A scalable approach to undergraduate research in physics
Authors:
Amanda L. Baxter,
Rafael F. Lang,
Craig Zywicki,
Stephanie M. Gardner,
Abigail Kopec,
Andreas Jung
Abstract:
Course-based undergraduate research experiences (CUREs) increase students' access to research. This lesson plan describes an interdisciplinary CURE developed to be able to involve over 60 students per semester in original research using data from large particle physics experiments and telescopes, although the methods described can easily be adopted by other areas of data science. Students are divi…
▽ More
Course-based undergraduate research experiences (CUREs) increase students' access to research. This lesson plan describes an interdisciplinary CURE developed to be able to involve over 60 students per semester in original research using data from large particle physics experiments and telescopes, although the methods described can easily be adopted by other areas of data science. Students are divided into research teams of four, which greatly leverages the instruction time needed for mentoring, while increasing research productivity by creating accountability amongst the students. This CURE provides a strong framework, which minimizes barriers that students may perceive. This helps increase the number of students that benefit from a research opportunity while providing guidance and certainty. Through this CURE, students can engage in original research with the potential for publication-quality results, develop communication skills in various modes, and gain confidence in their performance as a scientist.
△ Less
Submitted 9 March, 2023;
originally announced March 2023.
-
Successive Prompting for Decomposing Complex Questions
Authors:
Dheeru Dua,
Shivanshu Gupta,
Sameer Singh,
Matt Gardner
Abstract:
Answering complex questions that require making latent decisions is a challenging task, especially when limited supervision is available. Recent works leverage the capabilities of large language models (LMs) to perform complex question answering in a few-shot setting by demonstrating how to output intermediate rationalizations while solving the complex question in a single pass. We introduce ``Suc…
▽ More
Answering complex questions that require making latent decisions is a challenging task, especially when limited supervision is available. Recent works leverage the capabilities of large language models (LMs) to perform complex question answering in a few-shot setting by demonstrating how to output intermediate rationalizations while solving the complex question in a single pass. We introduce ``Successive Prompting'', where we iteratively break down a complex task into a simple task, solve it, and then repeat the process until we get the final solution. Successive prompting decouples the supervision for decomposing complex questions from the supervision for answering simple questions, allowing us to (1) have multiple opportunities to query in-context examples at each reasoning step (2) learn question decomposition separately from question answering, including using synthetic data, and (3) use bespoke (fine-tuned) components for reasoning steps where a large LM does not perform well. The intermediate supervision is typically manually written, which can be expensive to collect. We introduce a way to generate a synthetic dataset which can be used to bootstrap a model's ability to decompose and answer intermediate questions. Our best model (with successive prompting) achieves an improvement of ~5% absolute F1 on a few-shot version of the DROP dataset when compared with a state-of-the-art model with the same supervision.
△ Less
Submitted 8 December, 2022;
originally announced December 2022.
-
CONDAQA: A Contrastive Reading Comprehension Dataset for Reasoning about Negation
Authors:
Abhilasha Ravichander,
Matt Gardner,
Ana Marasović
Abstract:
The full power of human language-based communication cannot be realized without negation. All human languages have some form of negation. Despite this, negation remains a challenging phenomenon for current natural language understanding systems. To facilitate the future development of models that can process negation effectively, we present CONDAQA, the first English reading comprehension dataset…
▽ More
The full power of human language-based communication cannot be realized without negation. All human languages have some form of negation. Despite this, negation remains a challenging phenomenon for current natural language understanding systems. To facilitate the future development of models that can process negation effectively, we present CONDAQA, the first English reading comprehension dataset which requires reasoning about the implications of negated statements in paragraphs. We collect paragraphs with diverse negation cues, then have crowdworkers ask questions about the implications of the negated statement in the passage. We also have workers make three kinds of edits to the passage -- paraphrasing the negated statement, changing the scope of the negation, and reversing the negation -- resulting in clusters of question-answer pairs that are difficult for models to answer with spurious shortcuts. CONDAQA features 14,182 question-answer pairs with over 200 unique negation cues and is challenging for current state-of-the-art models. The best performing model on CONDAQA (UnifiedQA-v2-3b) achieves only 42% on our consistency metric, well below human performance which is 81%. We release our dataset, along with fully-finetuned, few-shot, and zero-shot evaluations, to facilitate the development of future NLP methods that work on negated language.
△ Less
Submitted 1 November, 2022;
originally announced November 2022.
-
When to Use Multi-Task Learning vs Intermediate Fine-Tuning for Pre-Trained Encoder Transfer Learning
Authors:
Orion Weller,
Kevin Seppi,
Matt Gardner
Abstract:
Transfer learning (TL) in natural language processing (NLP) has seen a surge of interest in recent years, as pre-trained models have shown an impressive ability to transfer to novel tasks. Three main strategies have emerged for making use of multiple supervised datasets during fine-tuning: training on an intermediate task before training on the target task (STILTs), using multi-task learning (MTL)…
▽ More
Transfer learning (TL) in natural language processing (NLP) has seen a surge of interest in recent years, as pre-trained models have shown an impressive ability to transfer to novel tasks. Three main strategies have emerged for making use of multiple supervised datasets during fine-tuning: training on an intermediate task before training on the target task (STILTs), using multi-task learning (MTL) to train jointly on a supplementary task and the target task (pairwise MTL), or simply using MTL to train jointly on all available datasets (MTL-ALL). In this work, we compare all three TL methods in a comprehensive analysis on the GLUE dataset suite. We find that there is a simple heuristic for when to use one of these techniques over the other: pairwise MTL is better than STILTs when the target task has fewer instances than the supporting task and vice versa. We show that this holds true in more than 92% of applicable cases on the GLUE dataset and validate this hypothesis with experiments varying dataset size. The simplicity and effectiveness of this heuristic is surprising and warrants additional exploration by the TL community. Furthermore, we find that MTL-ALL is worse than the pairwise methods in almost every case. We hope this study will aid others as they choose between TL methods for NLP tasks.
△ Less
Submitted 17 May, 2022;
originally announced May 2022.
-
ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension
Authors:
Sanjay Subramanian,
William Merrill,
Trevor Darrell,
Matt Gardner,
Sameer Singh,
Anna Rohrbach
Abstract:
Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain. While large-scale pre-trained models are useful for image classification across domains, it remains unclear if they can be applied in a zero-shot manner to more complex tasks like ReC. We present ReCLIP,…
▽ More
Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain. While large-scale pre-trained models are useful for image classification across domains, it remains unclear if they can be applied in a zero-shot manner to more complex tasks like ReC. We present ReCLIP, a simple but strong zero-shot baseline that repurposes CLIP, a state-of-the-art large-scale model, for ReC. Motivated by the close connection between ReC and CLIP's contrastive pre-training objective, the first component of ReCLIP is a region-scoring method that isolates object proposals via crop** and blurring, and passes them to CLIP. However, through controlled experiments on a synthetic dataset, we find that CLIP is largely incapable of performing spatial reasoning off-the-shelf. Thus, the second component of ReCLIP is a spatial relation resolver that handles several types of spatial relations. We reduce the gap between zero-shot baselines from prior work and supervised models by as much as 29% on RefCOCOg, and on RefGTA (video game imagery), ReCLIP's relative improvement over supervised ReC models trained on real images is 8%.
△ Less
Submitted 2 May, 2022; v1 submitted 12 April, 2022;
originally announced April 2022.
-
Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets
Authors:
Yuxiang Wu,
Matt Gardner,
Pontus Stenetorp,
Pradeep Dasigi
Abstract:
Natural language processing models often exploit spurious correlations between task-independent features and labels in datasets to perform well only within the distributions they are trained on, while not generalising to different task distributions. We propose to tackle this problem by generating a debiased version of a dataset, which can then be used to train a debiased, off-the-shelf model, by…
▽ More
Natural language processing models often exploit spurious correlations between task-independent features and labels in datasets to perform well only within the distributions they are trained on, while not generalising to different task distributions. We propose to tackle this problem by generating a debiased version of a dataset, which can then be used to train a debiased, off-the-shelf model, by simply replacing its training data. Our approach consists of 1) a method for training data generators to generate high-quality, label-consistent data samples; and 2) a filtering mechanism for removing data points that contribute to spurious correlations, measured in terms of z-statistics. We generate debiased versions of the SNLI and MNLI datasets, and we evaluate on a large suite of debiased, out-of-distribution, and adversarial test sets. Results show that models trained on our debiased datasets generalise better than those trained on the original datasets in all settings. On the majority of the datasets, our method outperforms or performs comparably to previous state-of-the-art debiasing strategies, and when combined with an orthogonal technique, product-of-experts, it improves further and outperforms previous best results of SNLI-hard and MNLI-hard.
△ Less
Submitted 24 March, 2022;
originally announced March 2022.
-
Structurally Diverse Sampling for Sample-Efficient Training and Comprehensive Evaluation
Authors:
Shivanshu Gupta,
Sameer Singh,
Matt Gardner
Abstract:
A growing body of research has demonstrated the inability of NLP models to generalize compositionally and has tried to alleviate it through specialized architectures, training schemes, and data augmentation, among other approaches. In this work, we study a different approach: training on instances with diverse structures. We propose a model-agnostic algorithm for subsampling such sets of instances…
▽ More
A growing body of research has demonstrated the inability of NLP models to generalize compositionally and has tried to alleviate it through specialized architectures, training schemes, and data augmentation, among other approaches. In this work, we study a different approach: training on instances with diverse structures. We propose a model-agnostic algorithm for subsampling such sets of instances from a labeled instance pool with structured outputs. Evaluating on both compositional template splits and traditional IID splits of 5 semantic parsing datasets of varying complexity, we show that structurally diverse training using our algorithm leads to comparable or better generalization than prior algorithms in 9 out of 10 dataset-split type pairs. In general, we find structural diversity to consistently improve sample efficiency compared to random train sets. Moreover, we show that structurally diverse sampling yields comprehensive test sets that are a lot more challenging than IID test sets. Finally, we provide two explanations for improved generalization from diverse train sets: 1) improved coverage of output substructures, and 2) a reduction in spurious correlations between these substructures.
△ Less
Submitted 1 November, 2022; v1 submitted 16 March, 2022;
originally announced March 2022.
-
Identifying Oscillations Injected by Inverter-Based Solar Energy Sources
Authors:
Chen Wang,
Luigi Vanfretti,
Chetan Mishra,
Kevin D. Jones,
R. Matthew Gardner
Abstract:
Inverter-based solar energy sources are becoming widely integrated into modern power systems. However, their impacts on the system in the frequency domain are rarely investigated at a higher frequency range than conventional electromechanical oscillations. This paper presents evidence of the emergence of an oscillation mode injected by inverter-based solar energy sources in Dominion Energy's servi…
▽ More
Inverter-based solar energy sources are becoming widely integrated into modern power systems. However, their impacts on the system in the frequency domain are rarely investigated at a higher frequency range than conventional electromechanical oscillations. This paper presents evidence of the emergence of an oscillation mode injected by inverter-based solar energy sources in Dominion Energy's service territory. This new mode was recognized from the analysis of real-world ambient synchrophasor and point-of-wave data. The analysis was performed by develo** customized synchrophasor analysis tools deployed on the PredictiveGrid^{TM} platform implemented at Dominion Energy. Herein, we describe and illustrate the preliminary analysis results acquired from spectrogram observations, power spectral density plots, and mode shape estimation. The emergence and propagation of this new mode in Dominion Energy's footprint is illustrated using a heatmap based on a proposed frequency component energy metric, which helps to assess this oscillation's spread and impact.
△ Less
Submitted 23 February, 2022;
originally announced February 2022.
-
Impact of Pretraining Term Frequencies on Few-Shot Reasoning
Authors:
Yasaman Razeghi,
Robert L. Logan IV,
Matt Gardner,
Sameer Singh
Abstract:
Pretrained Language Models (LMs) have demonstrated ability to perform numerical reasoning by extrapolating from a few examples in few-shot settings. However, the extent to which this extrapolation relies on robust reasoning is unclear. In this paper, we investigate how well these models reason with terms that are less frequent in the pretraining data. In particular, we examine the correlations bet…
▽ More
Pretrained Language Models (LMs) have demonstrated ability to perform numerical reasoning by extrapolating from a few examples in few-shot settings. However, the extent to which this extrapolation relies on robust reasoning is unclear. In this paper, we investigate how well these models reason with terms that are less frequent in the pretraining data. In particular, we examine the correlations between the model performance on test instances and the frequency of terms from those instances in the pretraining data. We measure the strength of this correlation for a number of GPT-based language models (pretrained on the Pile dataset) on various numerical deduction tasks (e.g., arithmetic and unit conversion). Our results consistently demonstrate that models are more accurate on instances whose terms are more prevalent, in some cases above $70\%$ (absolute) more accurate on the top 10\% frequent terms in comparison to the bottom 10\%. Overall, although LMs exhibit strong performance at few-shot numerical reasoning tasks, our results raise the question of how much models actually generalize beyond pretraining data, and we encourage researchers to take the pretraining data into account when interpreting evaluation results.
△ Less
Submitted 23 May, 2022; v1 submitted 15 February, 2022;
originally announced February 2022.
-
Evidentiality-guided Generation for Knowledge-Intensive NLP Tasks
Authors:
Akari Asai,
Matt Gardner,
Hannaneh Hajishirzi
Abstract:
Retrieval-augmented generation models have shown state-of-the-art performance across many knowledge-intensive NLP tasks such as open question answering and fact verification. These models are trained to generate the final output given the retrieved passages, which can be irrelevant to the original query, leading to learning spurious cues or answer memorization. This work introduces a method to inc…
▽ More
Retrieval-augmented generation models have shown state-of-the-art performance across many knowledge-intensive NLP tasks such as open question answering and fact verification. These models are trained to generate the final output given the retrieved passages, which can be irrelevant to the original query, leading to learning spurious cues or answer memorization. This work introduces a method to incorporate the evidentiality of passages -- whether a passage contains correct evidence to support the output -- into training the generator. We introduce a multi-task learning framework to jointly generate the final output and predict the evidentiality of each passage, leveraging a new task-agnostic method to obtain silver evidentiality labels for supervision. Our experiments on five datasets across three knowledge-intensive tasks show that our new evidentiality-guided generator significantly outperforms its direct counterpart with the same-size model and advances the state of the art on FaVIQ-Ambig. We attribute these improvements to both the auxiliary multi-task learning and silver evidentiality mining techniques.
△ Less
Submitted 14 May, 2022; v1 submitted 16 December, 2021;
originally announced December 2021.
-
COVR: A test-bed for Visually Grounded Compositional Generalization with real images
Authors:
Ben Bogin,
Shivanshu Gupta,
Matt Gardner,
Jonathan Berant
Abstract:
While interest in models that generalize at test time to new compositions has risen in recent years, benchmarks in the visually-grounded domain have thus far been restricted to synthetic images. In this work, we propose COVR, a new test-bed for visually-grounded compositional generalization with real images. To create COVR, we use real images annotated with scene graphs, and propose an almost full…
▽ More
While interest in models that generalize at test time to new compositions has risen in recent years, benchmarks in the visually-grounded domain have thus far been restricted to synthetic images. In this work, we propose COVR, a new test-bed for visually-grounded compositional generalization with real images. To create COVR, we use real images annotated with scene graphs, and propose an almost fully automatic procedure for generating question-answer pairs along with a set of context images. COVR focuses on questions that require complex reasoning, including higher-order operations such as quantification and aggregation. Due to the automatic generation process, COVR facilitates the creation of compositional splits, where models at test time need to generalize to new concepts and compositions in a zero- or few-shot setting. We construct compositional splits using COVR and demonstrate a myriad of cases where state-of-the-art pre-trained language-and-vision models struggle to compositionally generalize.
△ Less
Submitted 22 September, 2021;
originally announced September 2021.
-
Permanent Magnet Linear Generator Design for Surface Riding Wave Energy Converters
Authors:
Farid Naghavi,
Shrikesh Sheshaprasad,
Matthew Gardner,
Aghamarshana Meduri,
HeonYong Kang,
Hamid Toliyat
Abstract:
This paper describes the detailed analysis for the design of a linear generator developed for a Surface Riding Wave Energy Converter (SR-WEC), which was designed to improve energy capture over a wider range of sea states. The study starts with an analysis of the power take-off (PTO) control strategy to harness the maximum output power from given sea states. Passive, reactive, and discrete PTO cont…
▽ More
This paper describes the detailed analysis for the design of a linear generator developed for a Surface Riding Wave Energy Converter (SR-WEC), which was designed to improve energy capture over a wider range of sea states. The study starts with an analysis of the power take-off (PTO) control strategy to harness the maximum output power from given sea states. Passive, reactive, and discrete PTO control are explored. For the random wave excitation and limited sliding distance of the generator, the discrete strategy provides the highest average power output. The paper discusses the sizing requirement for the linear generator. Based on the force and power rating of the system and the application requirements, a slotless permanent magnet tubular generator is designed for the wave energy converter.
△ Less
Submitted 18 August, 2021;
originally announced August 2021.
-
QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension
Authors:
Anna Rogers,
Matt Gardner,
Isabelle Augenstein
Abstract:
Alongside huge volumes of research on deep learning models in NLP in the recent years, there has been also much work on benchmark datasets needed to track modeling progress. Question answering and reading comprehension have been particularly prolific in this regard, with over 80 new datasets appearing in the past two years. This study is the largest survey of the field to date. We provide an overv…
▽ More
Alongside huge volumes of research on deep learning models in NLP in the recent years, there has been also much work on benchmark datasets needed to track modeling progress. Question answering and reading comprehension have been particularly prolific in this regard, with over 80 new datasets appearing in the past two years. This study is the largest survey of the field to date. We provide an overview of the various formats and domains of the current resources, highlighting the current lacunae for future work. We further discuss the current classifications of "skills" that question answering/reading comprehension systems are supposed to acquire, and propose a new taxonomy. The supplementary materials survey the current multilingual resources and monolingual resources for languages other than English, and we discuss the implications of over-focusing on English. The study is aimed at both practitioners looking for pointers to the wealth of existing data, and at researchers working on new resources.
△ Less
Submitted 19 September, 2022; v1 submitted 27 July, 2021;
originally announced July 2021.
-
Tailor: Generating and Perturbing Text with Semantic Controls
Authors:
Alexis Ross,
Tongshuang Wu,
Hao Peng,
Matthew E. Peters,
Matt Gardner
Abstract:
Controlled text perturbation is useful for evaluating and improving model generalizability. However, current techniques rely on training a model for every target perturbation, which is expensive and hard to generalize. We present Tailor, a semantically-controlled text generation system. Tailor builds on a pretrained seq2seq model and produces textual outputs conditioned on control codes derived fr…
▽ More
Controlled text perturbation is useful for evaluating and improving model generalizability. However, current techniques rely on training a model for every target perturbation, which is expensive and hard to generalize. We present Tailor, a semantically-controlled text generation system. Tailor builds on a pretrained seq2seq model and produces textual outputs conditioned on control codes derived from semantic representations. We craft a set of operations to modify the control codes, which in turn steer generation towards targeted attributes. These operations can be further composed into higher-level ones, allowing for flexible perturbation strategies. We demonstrate the effectiveness of these perturbations in multiple applications. First, we use Tailor to automatically create high-quality contrast sets for four distinct natural language processing (NLP) tasks. These contrast sets contain fewer spurious artifacts and are complementary to manually annotated ones in their lexical diversity. Second, we show that Tailor perturbations can improve model generalization through data augmentation. Perturbing just 2% of training data leads to a 5.8-point gain on an NLI challenge set measuring reliance on syntactic heuristics.
△ Less
Submitted 17 March, 2022; v1 submitted 15 July, 2021;
originally announced July 2021.
-
Enforcing Consistency in Weakly Supervised Semantic Parsing
Authors:
Nitish Gupta,
Sameer Singh,
Matt Gardner
Abstract:
The predominant challenge in weakly supervised semantic parsing is that of spurious programs that evaluate to correct answers for the wrong reasons. Prior work uses elaborate search strategies to mitigate the prevalence of spurious programs; however, they typically consider only one input at a time. In this work we explore the use of consistency between the output programs for related inputs to re…
▽ More
The predominant challenge in weakly supervised semantic parsing is that of spurious programs that evaluate to correct answers for the wrong reasons. Prior work uses elaborate search strategies to mitigate the prevalence of spurious programs; however, they typically consider only one input at a time. In this work we explore the use of consistency between the output programs for related inputs to reduce the impact of spurious programs. We bias the program search (and thus the model's training signal) towards programs that map the same phrase in related inputs to the same sub-parts in their respective programs. Additionally, we study the importance of designing logical formalisms that facilitate this kind of consAistency-based training. We find that a more consistent formalism leads to improved model performance even without consistency-based training. When combined together, these two insights lead to a 10% absolute improvement over the best prior result on the Natural Language Visual Reasoning dataset.
△ Less
Submitted 12 July, 2021;
originally announced July 2021.
-
A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers
Authors:
Pradeep Dasigi,
Kyle Lo,
Iz Beltagy,
Arman Cohan,
Noah A. Smith,
Matt Gardner
Abstract:
Readers of academic research papers often read with the goal of answering specific questions. Question Answering systems that can answer those questions can make consumption of the content much more efficient. However, building such tools requires data that reflect the difficulty of the task arising from complex reasoning about claims made in multiple parts of a paper. In contrast, existing inform…
▽ More
Readers of academic research papers often read with the goal of answering specific questions. Question Answering systems that can answer those questions can make consumption of the content much more efficient. However, building such tools requires data that reflect the difficulty of the task arising from complex reasoning about claims made in multiple parts of a paper. In contrast, existing information-seeking question answering datasets usually contain questions about generic factoid-type information. We therefore present QASPER, a dataset of 5,049 questions over 1,585 Natural Language Processing papers. Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text. The questions are then answered by a separate set of NLP practitioners who also provide supporting evidence to answers. We find that existing models that do well on other QA tasks do not perform well on answering these questions, underperforming humans by at least 27 F1 points when answering them from entire papers, motivating further research in document-grounded, information-seeking QA, which our dataset is designed to facilitate.
△ Less
Submitted 6 May, 2021;
originally announced May 2021.
-
On Generating and Labeling Network Traffic with Realistic, Self-Propagating Malware
Authors:
Molly Buchanan,
Jeffrey W. Collyer,
Jack W. Davidson,
Saikat Dey,
Mark Gardner,
Jason D. Hiser,
Jeffry Lang,
Alastair Nottingham,
Alina Oprea
Abstract:
Research and development of techniques which detect or remediate malicious network activity require access to diverse, realistic, contemporary data sets containing labeled malicious connections. In the absence of such data, said techniques cannot be meaningfully trained, tested, and evaluated. Synthetically produced data containing fabricated or merged network traffic is of limited value as it is…
▽ More
Research and development of techniques which detect or remediate malicious network activity require access to diverse, realistic, contemporary data sets containing labeled malicious connections. In the absence of such data, said techniques cannot be meaningfully trained, tested, and evaluated. Synthetically produced data containing fabricated or merged network traffic is of limited value as it is easily distinguishable from real traffic by even simple machine learning (ML) algorithms. Real network data is preferable, but while ubiquitous is broadly both sensitive and lacking in ground truth labels, limiting its utility for ML research.
This paper presents a multi-faceted approach to generating a data set of labeled malicious connections embedded within anonymized network traffic collected from large production networks. Real-world malware is defanged and introduced to simulated, secured nodes within those networks to generate realistic traffic while maintaining sufficient isolation to protect real data and infrastructure. Network sensor data, including this embedded malware traffic, is collected at a network edge and anonymized for research use.
Network traffic was collected and produced in accordance with the aforementioned methods at two major educational institutions. The result is a highly realistic, long term, multi-institution data set with embedded data labels spanning over 1.5 trillion connections and over a petabyte of sensor log data. The usability of this data set is demonstrated by its utility to our artificial intelligence and machine learning (AI/ML) research program.
△ Less
Submitted 27 May, 2022; v1 submitted 20 April, 2021;
originally announced April 2021.
-
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
Authors:
Jesse Dodge,
Maarten Sap,
Ana Marasović,
William Agnew,
Gabriel Ilharco,
Dirk Groeneveld,
Margaret Mitchell,
Matt Gardner
Abstract:
Large language models have led to remarkable progress on many NLP tasks, and researchers are turning to ever-larger text corpora to train them. Some of the largest corpora available are made by scra** significant portions of the internet, and are frequently introduced with only minimal documentation. In this work we provide some of the first documentation for the Colossal Clean Crawled Corpus (C…
▽ More
Large language models have led to remarkable progress on many NLP tasks, and researchers are turning to ever-larger text corpora to train them. Some of the largest corpora available are made by scra** significant portions of the internet, and are frequently introduced with only minimal documentation. In this work we provide some of the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl. We begin by investigating where the data came from, and find a significant amount of text from unexpected sources like patents and US military websites. Then we explore the content of the text itself, and find machine-generated text (e.g., from machine translation systems) and evaluation examples from other benchmark NLP datasets. To understand the impact of the filters applied to create this dataset, we evaluate the text that was removed, and show that blocklist filtering disproportionately removes text from and about minority individuals. Finally, we conclude with some recommendations for how to created and document web-scale datasets from a scrape of the internet.
△ Less
Submitted 30 September, 2021; v1 submitted 18 April, 2021;
originally announced April 2021.
-
Generative Context Pair Selection for Multi-hop Question Answering
Authors:
Dheeru Dua,
Cicero Nogueira dos Santos,
Patrick Ng,
Ben Athiwaratkun,
Bing Xiang,
Matt Gardner,
Sameer Singh
Abstract:
Compositional reasoning tasks like multi-hop question answering, require making latent decisions to get the final answer, given a question. However, crowdsourced datasets often capture only a slice of the underlying task distribution, which can induce unanticipated biases in models performing compositional reasoning. Furthermore, discriminatively trained models exploit such biases to get a better…
▽ More
Compositional reasoning tasks like multi-hop question answering, require making latent decisions to get the final answer, given a question. However, crowdsourced datasets often capture only a slice of the underlying task distribution, which can induce unanticipated biases in models performing compositional reasoning. Furthermore, discriminatively trained models exploit such biases to get a better held-out performance, without learning the right way to reason, as they do not necessitate paying attention to the question representation (conditioning variable) in its entirety, to estimate the answer likelihood. In this work, we propose a generative context selection model for multi-hop question answering that reasons about how the given question could have been generated given a context pair. While being comparable to the state-of-the-art answering performance, our proposed generative passage selection model has a better performance (4.9% higher than baseline) on adversarial held-out set which tests robustness of model's multi-hop reasoning capabilities.
△ Less
Submitted 18 April, 2021;
originally announced April 2021.
-
Learning with Instance Bundles for Reading Comprehension
Authors:
Dheeru Dua,
Pradeep Dasigi,
Sameer Singh,
Matt Gardner
Abstract:
When training most modern reading comprehension models, all the questions associated with a context are treated as being independent from each other. However, closely related questions and their corresponding answers are not independent, and leveraging these relationships could provide a strong supervision signal to a model. Drawing on ideas from contrastive estimation, we introduce several new su…
▽ More
When training most modern reading comprehension models, all the questions associated with a context are treated as being independent from each other. However, closely related questions and their corresponding answers are not independent, and leveraging these relationships could provide a strong supervision signal to a model. Drawing on ideas from contrastive estimation, we introduce several new supervision techniques that compare question-answer scores across multiple related instances. Specifically, we normalize these scores across various neighborhoods of closely contrasting questions and/or answers, adding another cross entropy loss term that is used in addition to traditional maximum likelihood estimation. Our techniques require bundles of related question-answer pairs, which we can either mine from within existing data or create using various automated heuristics. We empirically demonstrate the effectiveness of training with instance bundles on two datasets -- HotpotQA and ROPES -- showing up to 11% absolute gains in accuracy.
△ Less
Submitted 18 April, 2021;
originally announced April 2021.
-
Competency Problems: On Finding and Removing Artifacts in Language Data
Authors:
Matt Gardner,
William Merrill,
Jesse Dodge,
Matthew E. Peters,
Alexis Ross,
Sameer Singh,
Noah A. Smith
Abstract:
Much recent work in NLP has documented dataset artifacts, bias, and spurious correlations between input features and output labels. However, how to tell which features have "spurious" instead of legitimate correlations is typically left unspecified. In this work we argue that for complex language understanding tasks, all simple feature correlations are spurious, and we formalize this notion into a…
▽ More
Much recent work in NLP has documented dataset artifacts, bias, and spurious correlations between input features and output labels. However, how to tell which features have "spurious" instead of legitimate correlations is typically left unspecified. In this work we argue that for complex language understanding tasks, all simple feature correlations are spurious, and we formalize this notion into a class of problems which we call competency problems. For example, the word "amazing" on its own should not give information about a sentiment label independent of the context in which it appears, which could include negation, metaphor, sarcasm, etc. We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account, showing that realistic datasets will increasingly deviate from competency problems as dataset size increases. This analysis gives us a simple statistical test for dataset artifacts, which we use to show more subtle biases than were described in prior work, including demonstrating that models are inappropriately affected by these less extreme biases. Our theoretical treatment of this problem also allows us to analyze proposed solutions, such as making local edits to dataset instances, and to give recommendations for future data collection and model design efforts that target competency problems.
△ Less
Submitted 28 December, 2021; v1 submitted 17 April, 2021;
originally announced April 2021.
-
Test beam characterization of sensor prototypes for the CMS Barrel MIP Timing Detector
Authors:
R. Abbott,
A. Abreu,
F. Addesa,
M. Alhusseini,
T. Anderson,
Y. Andreev,
A. Apresyan,
R. Arcidiacono,
M. Arenton,
E. Auffray,
D. Bastos,
L. A. T. Bauerdick,
R. Bellan,
M. Bellato,
A. Benaglia,
M. Benettoni,
R. Bertoni,
M. Besancon,
S. Bharthuar,
A. Bornheim,
E. Brücken,
J. N. Butler,
C. Campagnari,
M. Campana,
R. Carlin
, et al. (174 additional authors not shown)
Abstract:
The MIP Timing Detector will provide additional timing capabilities for detection of minimum ionizing particles (MIPs) at CMS during the High Luminosity LHC era, improving event reconstruction and pileup rejection. The central portion of the detector, the Barrel Timing Layer (BTL), will be instrumented with LYSO:Ce crystals and Silicon Photomultipliers (SiPMs) providing a time resolution of about…
▽ More
The MIP Timing Detector will provide additional timing capabilities for detection of minimum ionizing particles (MIPs) at CMS during the High Luminosity LHC era, improving event reconstruction and pileup rejection. The central portion of the detector, the Barrel Timing Layer (BTL), will be instrumented with LYSO:Ce crystals and Silicon Photomultipliers (SiPMs) providing a time resolution of about 30 ps at the beginning of operation, and degrading to 50-60 ps at the end of the detector lifetime as a result of radiation damage. In this work, we present the results obtained using a 120 GeV proton beam at the Fermilab Test Beam Facility to measure the time resolution of unirradiated sensors. A proof-of-concept of the sensor layout proposed for the barrel region of the MTD, consisting of elongated crystal bars with dimensions of about 3 x 3 x 57 mm$^3$ and with double-ended SiPM readout, is demonstrated. This design provides a robust time measurement independent of the impact point of the MIP along the crystal bar. We tested LYSO:Ce bars of different thickness (2, 3, 4 mm) with a geometry close to the reference design and coupled to SiPMs manufactured by Hamamatsu and Fondazione Bruno Kessler. The various aspects influencing the timing performance such as the crystal thickness, properties of the SiPMs (e.g. photon detection efficiency), and impact angle of the MIP are studied. A time resolution of about 28 ps is measured for MIPs crossing a 3 mm thick crystal bar, corresponding to an MPV energy deposition of 2.6 MeV, and of 22 ps for the 4.2 MeV MPV energy deposition expected in the BTL, matching the detector performance target for unirradiated devices.
△ Less
Submitted 16 July, 2021; v1 submitted 15 April, 2021;
originally announced April 2021.
-
Paired Examples as Indirect Supervision in Latent Decision Models
Authors:
Nitish Gupta,
Sameer Singh,
Matt Gardner,
Dan Roth
Abstract:
Compositional, structured models are appealing because they explicitly decompose problems and provide interpretable intermediate outputs that give confidence that the model is not simply latching onto data artifacts. Learning these models is challenging, however, because end-task supervision only provides a weak indirect signal on what values the latent decisions should take. This often results in…
▽ More
Compositional, structured models are appealing because they explicitly decompose problems and provide interpretable intermediate outputs that give confidence that the model is not simply latching onto data artifacts. Learning these models is challenging, however, because end-task supervision only provides a weak indirect signal on what values the latent decisions should take. This often results in the model failing to learn to perform the intermediate tasks correctly. In this work, we introduce a way to leverage paired examples that provide stronger cues for learning latent decisions. When two related training examples share internal substructure, we add an additional training objective to encourage consistency between their latent decisions. Such an objective does not require external supervision for the values of the latent output, or even the end task, yet provides an additional training signal to that provided by individual training examples themselves. We apply our method to improve compositional question answering using neural module networks on the DROP dataset. We explore three ways to acquire paired questions in DROP: (a) discovering naturally occurring paired examples within the dataset, (b) constructing paired examples using templates, and (c) generating paired examples using a question generation model. We empirically demonstrate that our proposed approach improves both in- and out-of-distribution generalization and leads to correct latent decision predictions.
△ Less
Submitted 4 April, 2021;
originally announced April 2021.
-
Mitigating False-Negative Contexts in Multi-document Question Answering with Retrieval Marginalization
Authors:
Ansong Ni,
Matt Gardner,
Pradeep Dasigi
Abstract:
Question Answering (QA) tasks requiring information from multiple documents often rely on a retrieval model to identify relevant information for reasoning. The retrieval model is typically trained to maximize the likelihood of the labeled supporting evidence. However, when retrieving from large text corpora such as Wikipedia, the correct answer can often be obtained from multiple evidence candidat…
▽ More
Question Answering (QA) tasks requiring information from multiple documents often rely on a retrieval model to identify relevant information for reasoning. The retrieval model is typically trained to maximize the likelihood of the labeled supporting evidence. However, when retrieving from large text corpora such as Wikipedia, the correct answer can often be obtained from multiple evidence candidates. Moreover, not all such candidates are labeled as positive during annotation, rendering the training signal weak and noisy. This problem is exacerbated when the questions are unanswerable or when the answers are Boolean, since the model cannot rely on lexical overlap to make a connection between the answer and supporting evidence. We develop a new parameterization of set-valued retrieval that handles unanswerable queries, and we show that marginalizing over this set during training allows a model to mitigate false negatives in supporting evidence annotations. We test our method on two multi-document QA datasets, IIRC and HotpotQA. On IIRC, we show that joint modeling with marginalization improves model performance by 5.5 F1 points and achieves a new state-of-the-art performance of 50.5 F1. We also show that retrieval marginalization results in 4.1 QA F1 improvement over a non-marginalized baseline on HotpotQA in the fullwiki setting.
△ Less
Submitted 8 September, 2021; v1 submitted 22 March, 2021;
originally announced March 2021.
-
Learning from Task Descriptions
Authors:
Orion Weller,
Nicholas Lourie,
Matt Gardner,
Matthew E. Peters
Abstract:
Typically, machine learning systems solve new tasks by training on thousands of examples. In contrast, humans can solve new tasks by reading some instructions, with perhaps an example or two. To take a step toward closing this gap, we introduce a framework for develo** NLP systems that solve new tasks after reading their descriptions, synthesizing prior work in this area. We instantiate this fra…
▽ More
Typically, machine learning systems solve new tasks by training on thousands of examples. In contrast, humans can solve new tasks by reading some instructions, with perhaps an example or two. To take a step toward closing this gap, we introduce a framework for develo** NLP systems that solve new tasks after reading their descriptions, synthesizing prior work in this area. We instantiate this framework with a new English language dataset, ZEST, structured for task-oriented evaluation on unseen tasks. Formulating task descriptions as questions, we ensure each is general enough to apply to many possible inputs, thus comprehensively evaluating a model's ability to solve each task. Moreover, the dataset's structure tests specific types of systematic generalization. We find that the state-of-the-art T5 model achieves a score of 12% on ZEST, leaving a significant challenge for NLP researchers.
△ Less
Submitted 16 November, 2020;
originally announced November 2020.
-
IIRC: A Dataset of Incomplete Information Reading Comprehension Questions
Authors:
James Ferguson,
Matt Gardner,
Hannaneh Hajishirzi,
Tushar Khot,
Pradeep Dasigi
Abstract:
Humans often have to read multiple documents to address their information needs. However, most existing reading comprehension (RC) tasks only focus on questions for which the contexts provide all the information required to answer them, thus not evaluating a system's performance at identifying a potential lack of sufficient information and locating sources for that information. To fill this gap, w…
▽ More
Humans often have to read multiple documents to address their information needs. However, most existing reading comprehension (RC) tasks only focus on questions for which the contexts provide all the information required to answer them, thus not evaluating a system's performance at identifying a potential lack of sufficient information and locating sources for that information. To fill this gap, we present a dataset, IIRC, with more than 13K questions over paragraphs from English Wikipedia that provide only partial information to answer them, with the missing information occurring in one or more linked documents. The questions were written by crowd workers who did not have access to any of the linked documents, leading to questions that have little lexical overlap with the contexts where the answers appear. This process also gave many questions without answers, and those that require discrete reasoning, increasing the difficulty of the task. We follow recent modeling work on various reading comprehension datasets to construct a baseline model for this dataset, finding that it achieves 31.1% F1 on this task, while estimated human performance is 88.4%. The dataset, code for the baseline system, and a leaderboard can be found at https://allennlp.org/iirc.
△ Less
Submitted 13 November, 2020;
originally announced November 2020.
-
Easy, Reproducible and Quality-Controlled Data Collection with Crowdaq
Authors:
Qiang Ning,
Hao Wu,
Pradeep Dasigi,
Dheeru Dua,
Matt Gardner,
Robert L. Logan IV,
Ana Marasovic,
Zhen Nie
Abstract:
High-quality and large-scale data are key to success for AI systems. However, large-scale data annotation efforts are often confronted with a set of common challenges: (1) designing a user-friendly annotation interface; (2) training enough annotators efficiently; and (3) reproducibility. To address these problems, we introduce Crowdaq, an open-source platform that standardizes the data collection…
▽ More
High-quality and large-scale data are key to success for AI systems. However, large-scale data annotation efforts are often confronted with a set of common challenges: (1) designing a user-friendly annotation interface; (2) training enough annotators efficiently; and (3) reproducibility. To address these problems, we introduce Crowdaq, an open-source platform that standardizes the data collection pipeline with customizable user-interface components, automated annotator qualification, and saved pipelines in a re-usable format. We show that Crowdaq simplifies data annotation significantly on a diverse set of data collection use cases and we hope it will be a convenient tool for the community.
△ Less
Submitted 5 October, 2020;
originally announced October 2020.
-
MedICaT: A Dataset of Medical Images, Captions, and Textual References
Authors:
Sanjay Subramanian,
Lucy Lu Wang,
Sachin Mehta,
Ben Bogin,
Madeleine van Zuylen,
Sravanthi Parasa,
Sameer Singh,
Matt Gardner,
Hannaneh Hajishirzi
Abstract:
Understanding the relationship between figures and text is key to scientific document understanding. Medical figures in particular are quite complex, often consisting of several subfigures (75% of figures in our dataset), with detailed text describing their content. Previous work studying figures in scientific papers focused on classifying figure content rather than understanding how images relate…
▽ More
Understanding the relationship between figures and text is key to scientific document understanding. Medical figures in particular are quite complex, often consisting of several subfigures (75% of figures in our dataset), with detailed text describing their content. Previous work studying figures in scientific papers focused on classifying figure content rather than understanding how images relate to the text. To address challenges in figure retrieval and figure-to-text alignment, we introduce MedICaT, a dataset of medical images in context. MedICaT consists of 217K images from 131K open access biomedical papers, and includes captions, inline references for 74% of figures, and manually annotated subfigures and subcaptions for a subset of figures. Using MedICaT, we introduce the task of subfigure to subcaption alignment in compound figures and demonstrate the utility of inline references in image-text matching. Our data and code can be accessed at https://github.com/allenai/medicat.
△ Less
Submitted 12 October, 2020;
originally announced October 2020.
-
Improving Compositional Generalization in Semantic Parsing
Authors:
Inbar Oren,
Jonathan Herzig,
Nitish Gupta,
Matt Gardner,
Jonathan Berant
Abstract:
Generalization of models to out-of-distribution (OOD) data has captured tremendous attention recently. Specifically, compositional generalization, i.e., whether a model generalizes to new structures built of components observed during training, has sparked substantial interest. In this work, we investigate compositional generalization in semantic parsing, a natural test-bed for compositional gener…
▽ More
Generalization of models to out-of-distribution (OOD) data has captured tremendous attention recently. Specifically, compositional generalization, i.e., whether a model generalizes to new structures built of components observed during training, has sparked substantial interest. In this work, we investigate compositional generalization in semantic parsing, a natural test-bed for compositional generalization, as output programs are constructed from sub-components. We analyze a wide variety of models and propose multiple extensions to the attention module of the semantic parser, aiming to improve compositional generalization. We find that the following factors improve compositional generalization: (a) using contextual representations, such as ELMo and BERT, (b) informing the decoder what input tokens have previously been attended to, (c) training the decoder attention to agree with pre-computed token alignments, and (d) downsampling examples corresponding to frequent program templates. While we substantially reduce the gap between in-distribution and OOD generalization, performance on OOD compositions is still substantially lower.
△ Less
Submitted 12 October, 2020;
originally announced October 2020.
-
MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics
Authors:
Anthony Chen,
Gabriel Stanovsky,
Sameer Singh,
Matt Gardner
Abstract:
Posing reading comprehension as a generation problem provides a great deal of flexibility, allowing for open-ended questions with few restrictions on possible answers. However, progress is impeded by existing generation metrics, which rely on token overlap and are agnostic to the nuances of reading comprehension. To address this, we introduce a benchmark for training and evaluating generative read…
▽ More
Posing reading comprehension as a generation problem provides a great deal of flexibility, allowing for open-ended questions with few restrictions on possible answers. However, progress is impeded by existing generation metrics, which rely on token overlap and are agnostic to the nuances of reading comprehension. To address this, we introduce a benchmark for training and evaluating generative reading comprehension metrics: MOdeling Correctness with Human Annotations. MOCHA contains 40K human judgement scores on model outputs from 6 diverse question answering datasets and an additional set of minimal pairs for evaluation. Using MOCHA, we train a Learned Evaluation metric for Reading Comprehension, LERC, to mimic human judgement scores. LERC outperforms baseline metrics by 10 to 36 absolute Pearson points on held-out annotations. When we evaluate robustness on minimal pairs, LERC achieves 80% accuracy, outperforming baselines by 14 to 26 absolute percentage points while leaving significant room for improvement. MOCHA presents a challenging problem for develo** accurate and robust generative reading comprehension metrics.
△ Less
Submitted 15 October, 2020; v1 submitted 7 October, 2020;
originally announced October 2020.
-
Understanding Mention Detector-Linker Interaction in Neural Coreference Resolution
Authors:
Zhaofeng Wu,
Matt Gardner
Abstract:
Despite significant recent progress in coreference resolution, the quality of current state-of-the-art systems still considerably trails behind human-level performance. Using the CoNLL-2012 and PreCo datasets, we dissect the best instantiation of the mainstream end-to-end coreference resolution model that underlies most current best-performing coreference systems, and empirically analyze the behav…
▽ More
Despite significant recent progress in coreference resolution, the quality of current state-of-the-art systems still considerably trails behind human-level performance. Using the CoNLL-2012 and PreCo datasets, we dissect the best instantiation of the mainstream end-to-end coreference resolution model that underlies most current best-performing coreference systems, and empirically analyze the behavior of its two components: mention detector and mention linker. While the detector traditionally focuses heavily on recall as a design decision, we demonstrate the importance of precision, calling for their balance. However, we point out the difficulty in building a precise detector due to its inability to make important anaphoricity decisions. We also highlight the enormous room for improving the linker and show that the rest of its errors mainly involve pronoun resolution. We propose promising next steps and hope our findings will help future research in coreference resolution.
△ Less
Submitted 8 September, 2021; v1 submitted 20 September, 2020;
originally announced September 2020.
-
Quantifying the effect of oxygen on micro-mechanical properties of a near-alpha titanium alloy
Authors:
H. M. Gardner,
P. Gopon,
C. M. Magazzeni,
A. Radecka,
K. Fox,
D. Rugg,
J. Wade,
D. E. J. Armstrong,
M. P. Moody,
P. A. J. Bagot
Abstract:
Atom probe tomography (APT), electron probe microanalysis (EPMA) and nanoindentation were used to characterise the oxygen-rich layer on an in-service jet engine compressor disc, manufactured from the titanium alloy TIMETAL 834. Oxygen ingress was quantified and related to changes in mechanical properties through nanoindentation studies. The relationship between oxygen concentration, microstructure…
▽ More
Atom probe tomography (APT), electron probe microanalysis (EPMA) and nanoindentation were used to characterise the oxygen-rich layer on an in-service jet engine compressor disc, manufactured from the titanium alloy TIMETAL 834. Oxygen ingress was quantified and related to changes in mechanical properties through nanoindentation studies. The relationship between oxygen concentration, microstructure, crystal orientation and hardness has been explored through correlative hardness map**, EPMA and electron backscatter diffraction (EBSD). The role of microstructure on oxygen ingress has been studied and oxygen ingress along a potential alpha/ beta interface was directly observed on the nanoscale using APT.
△ Less
Submitted 10 September, 2020;
originally announced September 2020.
-
Nanoindentation in multi-modal map combinations: A Correlative Approach to Local Mechanical Property Assessment
Authors:
C. M. Magazzeni,
H. M. Gardner,
I. Howe,
P. Gopon,
J. C. Waite,
D. Rugg,
D. E. J. Armstrong,
A. J. Wilkinson
Abstract:
A method is presented for the registration and correlation of intrinsic property maps of materials, including data from nanoindentation hardness, Electron Back-Scattered Diffraction (EBSD), Electron Micro-Probe Analysis (EPMA). This highly spatially resolved method allows for the study of micron-scale microstructural features, and has the capability to rapidly extract correlations between multiple…
▽ More
A method is presented for the registration and correlation of intrinsic property maps of materials, including data from nanoindentation hardness, Electron Back-Scattered Diffraction (EBSD), Electron Micro-Probe Analysis (EPMA). This highly spatially resolved method allows for the study of micron-scale microstructural features, and has the capability to rapidly extract correlations between multiple features of interest from datasets containing thousands of datapoints. Two case studies are presented in commercially pure (CP) titanium: in the first instance, the effect of crystal anisotropy on measured hardness and, in the second instance, the effect of an oxygen diffusion layer on hardness. The independently collected property maps are registered us-ing affine geometric transformations and are interpolated to allow for direct correlation. The results show strong agreement with trends observed in the literature, as well as providing a large dataset to facilitate future statistical analysis of microstructure-dependent mechanisms.
△ Less
Submitted 4 January, 2021; v1 submitted 27 August, 2020;
originally announced August 2020.
-
Latent Compositional Representations Improve Systematic Generalization in Grounded Question Answering
Authors:
Ben Bogin,
Sanjay Subramanian,
Matt Gardner,
Jonathan Berant
Abstract:
Answering questions that involve multi-step reasoning requires decomposing them and using the answers of intermediate steps to reach the final answer. However, state-of-the-art models in grounded question answering often do not explicitly perform decomposition, leading to difficulties in generalization to out-of-distribution examples. In this work, we propose a model that computes a representation…
▽ More
Answering questions that involve multi-step reasoning requires decomposing them and using the answers of intermediate steps to reach the final answer. However, state-of-the-art models in grounded question answering often do not explicitly perform decomposition, leading to difficulties in generalization to out-of-distribution examples. In this work, we propose a model that computes a representation and denotation for all question spans in a bottom-up, compositional manner using a CKY-style parser. Our model induces latent trees, driven by end-to-end (the answer) supervision only. We show that this inductive bias towards tree structures dramatically improves systematic generalization to out-of-distribution examples, compared to strong baselines on an arithmetic expressions benchmark as well as on CLOSURE, a dataset that focuses on systematic generalization for grounded question answering. On this challenging dataset, our model reaches an accuracy of 96.1%, significantly higher than prior models that almost perfectly solve the task on a random, in-distribution split.
△ Less
Submitted 10 November, 2020; v1 submitted 1 July, 2020;
originally announced July 2020.
-
Continuous data assimilation applied to a velocity-vorticity formulation of the 2D Navier-Stokes equations
Authors:
Matthew Gardner,
Adam Larios,
Leo G. Rebholz,
Duygu Vargun,
Camille Zerfas
Abstract:
We study a continuous data assimilation (CDA) algorithm for a velocity-vorticity formulation of the 2D Navier-Stokes equations in two cases: nudging applied to the velocity and vorticity, and nudging applied to the velocity only. We prove that under a typical finite element spatial discretization and backward Euler temporal discretization, application of CDA preserves the unconditional long-time s…
▽ More
We study a continuous data assimilation (CDA) algorithm for a velocity-vorticity formulation of the 2D Navier-Stokes equations in two cases: nudging applied to the velocity and vorticity, and nudging applied to the velocity only. We prove that under a typical finite element spatial discretization and backward Euler temporal discretization, application of CDA preserves the unconditional long-time stability property of the velocity-vorticity method and provides optimal long-time accuracy. These properties hold if nudging is applied only to the velocity, and if nudging is also applied to the vorticity then the optimal long-time accuracy is achieved more rapidly in time. Numerical tests illustrate the theory, and show its effectiveness on an application problem of channel flow past a flat plate.
△ Less
Submitted 12 June, 2020;
originally announced June 2020.
-
Obtaining Faithful Interpretations from Compositional Neural Networks
Authors:
Sanjay Subramanian,
Ben Bogin,
Nitish Gupta,
Tomer Wolfson,
Sameer Singh,
Jonathan Berant,
Matt Gardner
Abstract:
Neural module networks (NMNs) are a popular approach for modeling compositionality: they achieve high accuracy when applied to problems in language and vision, while reflecting the compositional structure of the problem in the network architecture. However, prior work implicitly assumed that the structure of the network modules, describing the abstract reasoning process, provides a faithful explan…
▽ More
Neural module networks (NMNs) are a popular approach for modeling compositionality: they achieve high accuracy when applied to problems in language and vision, while reflecting the compositional structure of the problem in the network architecture. However, prior work implicitly assumed that the structure of the network modules, describing the abstract reasoning process, provides a faithful explanation of the model's reasoning; that is, that all modules perform their intended behaviour. In this work, we propose and conduct a systematic evaluation of the intermediate outputs of NMNs on NLVR2 and DROP, two datasets which require composing multiple reasoning steps. We find that the intermediate outputs differ from the expected output, illustrating that the network structure does not provide a faithful explanation of model behaviour. To remedy that, we train the model with auxiliary supervision and propose particular choices for module architecture that yield much better faithfulness, at a minimal cost to accuracy.
△ Less
Submitted 8 September, 2020; v1 submitted 2 May, 2020;
originally announced May 2020.
-
TORQUE: A Reading Comprehension Dataset of Temporal Ordering Questions
Authors:
Qiang Ning,
Hao Wu,
Rujun Han,
Nanyun Peng,
Matt Gardner,
Dan Roth
Abstract:
A critical part of reading is being able to understand the temporal relationships between events described in a passage of text, even when those relationships are not explicitly stated. However, current machine reading comprehension benchmarks have practically no questions that test temporal phenomena, so systems trained on these benchmarks have no capacity to answer questions such as "what happen…
▽ More
A critical part of reading is being able to understand the temporal relationships between events described in a passage of text, even when those relationships are not explicitly stated. However, current machine reading comprehension benchmarks have practically no questions that test temporal phenomena, so systems trained on these benchmarks have no capacity to answer questions such as "what happened before/after [some event]?" We introduce TORQUE, a new English reading comprehension benchmark built on 3.2k news snippets with 21k human-generated questions querying temporal relationships. Results show that RoBERTa-large achieves an exact-match score of 51% on the test set of TORQUE, about 30% behind human performance.
△ Less
Submitted 5 October, 2020; v1 submitted 1 May, 2020;
originally announced May 2020.
-
Multi-Step Inference for Reasoning Over Paragraphs
Authors:
Jiangming Liu,
Matt Gardner,
Shay B. Cohen,
Mirella Lapata
Abstract:
Complex reasoning over text requires understanding and chaining together free-form predicates and logical connectives. Prior work has largely tried to do this either symbolically or with black-box transformers. We present a middle ground between these two extremes: a compositional model reminiscent of neural module networks that can perform chained logical reasoning. This model first finds relevan…
▽ More
Complex reasoning over text requires understanding and chaining together free-form predicates and logical connectives. Prior work has largely tried to do this either symbolically or with black-box transformers. We present a middle ground between these two extremes: a compositional model reminiscent of neural module networks that can perform chained logical reasoning. This model first finds relevant sentences in the context and then chains them together using neural modules. Our model gives significant performance improvements (up to 29\% relative error reduction when comfibined with a reranker) on ROPES, a recently introduced complex reasoning dataset.
△ Less
Submitted 7 June, 2021; v1 submitted 6 April, 2020;
originally announced April 2020.
-
Evaluating Models' Local Decision Boundaries via Contrast Sets
Authors:
Matt Gardner,
Yoav Artzi,
Victoria Basmova,
Jonathan Berant,
Ben Bogin,
Sihao Chen,
Pradeep Dasigi,
Dheeru Dua,
Yanai Elazar,
Ananth Gottumukkala,
Nitish Gupta,
Hanna Hajishirzi,
Gabriel Ilharco,
Daniel Khashabi,
Kevin Lin,
Jiangming Liu,
Nelson F. Liu,
Phoebe Mulcaire,
Qiang Ning,
Sameer Singh,
Noah A. Smith,
Sanjay Subramanian,
Reut Tsarfaty,
Eric Wallace,
Ally Zhang
, et al. (1 additional authors not shown)
Abstract:
Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systemati…
▽ More
Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets---up to 25\% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.
△ Less
Submitted 1 October, 2020; v1 submitted 6 April, 2020;
originally announced April 2020.
-
Break It Down: A Question Understanding Benchmark
Authors:
Tomer Wolfson,
Mor Geva,
Ankit Gupta,
Matt Gardner,
Yoav Goldberg,
Daniel Deutch,
Jonathan Berant
Abstract:
Understanding natural language questions entails the ability to break down a question into the requisite steps for computing its answer. In this work, we introduce a Question Decomposition Meaning Representation (QDMR) for questions. QDMR constitutes the ordered list of steps, expressed through natural language, that are necessary for answering a question. We develop a crowdsourcing pipeline, show…
▽ More
Understanding natural language questions entails the ability to break down a question into the requisite steps for computing its answer. In this work, we introduce a Question Decomposition Meaning Representation (QDMR) for questions. QDMR constitutes the ordered list of steps, expressed through natural language, that are necessary for answering a question. We develop a crowdsourcing pipeline, showing that quality QDMRs can be annotated at scale, and release the Break dataset, containing over 83K pairs of questions and their QDMRs. We demonstrate the utility of QDMR by showing that (a) it can be used to improve open-domain question answering on the HotpotQA dataset, (b) it can be deterministically converted to a pseudo-SQL formal language, which can alleviate annotation in semantic parsing applications. Last, we use Break to train a sequence-to-sequence model with copying that parses questions into QDMR structures, and show that it substantially outperforms several natural baselines.
△ Less
Submitted 31 January, 2020;
originally announced January 2020.
-
Optimal Dispatch of Electrified Autonomous Mobility on Demand Vehicles during Power Outages
Authors:
Colin Sheppard,
Laurel N. Dunn,
Sangjae Bae,
Max Gardner
Abstract:
The era of fully autonomous, electrified taxi fleets is rapidly approaching, and with it the opportunity to innovate myriad on-demand services that extend beyond the realm of human mobility. This project envisions a future where autonomous plug-in electric vehicle (PEV) fleets can be dispatched as both a taxi service and a source of on-demand power serving customers during power outages. We develo…
▽ More
The era of fully autonomous, electrified taxi fleets is rapidly approaching, and with it the opportunity to innovate myriad on-demand services that extend beyond the realm of human mobility. This project envisions a future where autonomous plug-in electric vehicle (PEV) fleets can be dispatched as both a taxi service and a source of on-demand power serving customers during power outages. We develop a PDE-based scheme to manage the optimal dispatch of an autonomous fleet to serve passengers and electric power demand during outages as an additional stream of revenue. We use real world power outage and taxi data from San Francisco for our case study, modeling the optimal dispatch of several fleet sizes over the course of one day; we examine both moderate and extreme outage scenarios. In the moderate scenario, the revenue earned serving power demand is negligible compared with revenue earned serving passenger trips. In the extreme scenario, supplying power accounts for between $1 and $2 million, amounting to between 32\% and 40\% more revenue than is earned serving mobility only, depending on fleet size. While the overall value of providing on-demand power depends on the frequency and severity of power outages, our results show that serving power demand during large-scale outages can provide a substantial value stream, comparable to the value to be earned providing grid services.
△ Less
Submitted 21 January, 2020;
originally announced January 2020.
-
ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine Reading Comprehension
Authors:
Dheeru Dua,
Ananth Gottumukkala,
Alon Talmor,
Sameer Singh,
Matt Gardner
Abstract:
Reading comprehension is one of the crucial tasks for furthering research in natural language understanding. A lot of diverse reading comprehension datasets have recently been introduced to study various phenomena in natural language, ranging from simple paraphrase matching and entity ty** to entity tracking and understanding the implications of the context. Given the availability of many such d…
▽ More
Reading comprehension is one of the crucial tasks for furthering research in natural language understanding. A lot of diverse reading comprehension datasets have recently been introduced to study various phenomena in natural language, ranging from simple paraphrase matching and entity ty** to entity tracking and understanding the implications of the context. Given the availability of many such datasets, comprehensive and reliable evaluation is tedious and time-consuming for researchers working on this problem. We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets, encouraging and facilitating testing a single model's capability in understanding a wide variety of reading phenomena. The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning for general reading facility. As more suitable datasets are released, they will be added to the evaluation server. We also collect and include synthetic augmentations for these datasets, testing how well models can handle out-of-domain questions.
△ Less
Submitted 29 December, 2019;
originally announced December 2019.
-
Neural Module Networks for Reasoning over Text
Authors:
Nitish Gupta,
Kevin Lin,
Dan Roth,
Sameer Singh,
Matt Gardner
Abstract:
Answering compositional questions that require multiple steps of reasoning against text is challenging, especially when they involve discrete, symbolic operations. Neural module networks (NMNs) learn to parse such questions as executable programs composed of learnable modules, performing well on synthetic visual QA domains. However, we find that it is challenging to learn these models for non-synt…
▽ More
Answering compositional questions that require multiple steps of reasoning against text is challenging, especially when they involve discrete, symbolic operations. Neural module networks (NMNs) learn to parse such questions as executable programs composed of learnable modules, performing well on synthetic visual QA domains. However, we find that it is challenging to learn these models for non-synthetic questions on open-domain text, where a model needs to deal with the diversity of natural language and perform a broader range of reasoning. We extend NMNs by: (a) introducing modules that reason over a paragraph of text, performing symbolic reasoning (such as arithmetic, sorting, counting) over numbers and dates in a probabilistic and differentiable manner; and (b) proposing an unsupervised auxiliary loss to help extract arguments associated with the events in text. Additionally, we show that a limited amount of heuristically-obtained question program and intermediate module output supervision provides sufficient inductive bias for accurate learning. Our proposed model significantly outperforms state-of-the-art models on a subset of the DROP dataset that poses a variety of reasoning challenges that are covered by our modules.
△ Less
Submitted 15 February, 2020; v1 submitted 10 December, 2019;
originally announced December 2019.
-
Deep Parametric Indoor Lighting Estimation
Authors:
Marc-André Gardner,
Yannick Hold-Geoffroy,
Kalyan Sunkavalli,
Christian Gagné,
Jean-François Lalonde
Abstract:
We present a method to estimate lighting from a single image of an indoor scene. Previous work has used an environment map representation that does not account for the localized nature of indoor lighting. Instead, we represent lighting as a set of discrete 3D lights with geometric and photometric parameters. We train a deep neural network to regress these parameters from a single image, on a datas…
▽ More
We present a method to estimate lighting from a single image of an indoor scene. Previous work has used an environment map representation that does not account for the localized nature of indoor lighting. Instead, we represent lighting as a set of discrete 3D lights with geometric and photometric parameters. We train a deep neural network to regress these parameters from a single image, on a dataset of environment maps annotated with depth. We propose a differentiable layer to convert these parameters to an environment map to compute our loss; this bypasses the challenge of establishing correspondences between estimated and ground truth lights. We demonstrate, via quantitative and qualitative evaluations, that our representation and training scheme lead to more accurate results compared to previous work, while allowing for more realistic 3D object compositing with spatially-varying lighting.
△ Less
Submitted 19 October, 2019;
originally announced October 2019.