Skip to main content

Showing 1–16 of 16 results for author: Vania, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2305.14293  [pdf, other

    cs.CL

    WebIE: Faithful and Robust Information Extraction on the Web

    Authors: Chenxi Whitehouse, Clara Vania, Alham Fikri Aji, Christos Christodoulopoulos, Andrea Pierleoni

    Abstract: Extracting structured and grounded fact triples from raw text is a fundamental task in Information Extraction (IE). Existing IE datasets are typically collected from Wikipedia articles, using hyperlinks to link entities to the Wikidata knowledge base. However, models trained only on Wikipedia have limitations when applied to web domains, which often contain noisy text or text that does not have an… ▽ More

    Submitted 15 June, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: ACL 2023 Main Conference

  2. arXiv:2206.13163  [pdf, other

    cs.CL cs.AI

    Endowing Language Models with Multimodal Knowledge Graph Representations

    Authors: Ningyuan Huang, Yash R. Deshpande, Yibo Liu, Houda Alberts, Kyunghyun Cho, Clara Vania, Iacer Calixto

    Abstract: We propose a method to make natural language understanding models more parameter efficient by storing knowledge in an external knowledge graph (KG) and retrieving from this KG using a dense index. Given (possibly multilingual) downstream task data, e.g., sentences in German, we retrieve entities from the KG and use their multimodal representations to improve downstream task performance. We use the… ▽ More

    Submitted 27 June, 2022; originally announced June 2022.

    Comments: 14 pages with appendix, 2 figures, 15 tables

    MSC Class: 68T50 ACM Class: I.2.7; I.2.10; I.2.4

  3. arXiv:2205.03608  [pdf, other

    cs.CL

    UniMorph 4.0: Universal Morphology

    Authors: Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kieraś, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yustinus Ghanggo Ate, Maria Ryskina, Sabrina J. Mielke, Elena Budianskaya, Charbel El-Khaissi, Tiago Pimentel, Michael Gasser, William Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Benoît Sagot, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay , et al. (71 additional authors not shown)

    Abstract: The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This pa… ▽ More

    Submitted 19 June, 2022; v1 submitted 7 May, 2022; originally announced May 2022.

    Comments: LREC 2022; The first two authors made equal contributions

  4. IndoNLI: A Natural Language Inference Dataset for Indonesian

    Authors: Rahmad Mahendra, Alham Fikri Aji, Samuel Louvan, Fahrurrozi Rahman, Clara Vania

    Abstract: We present IndoNLI, the first human-elicited NLI dataset for Indonesian. We adapt the data collection protocol for MNLI and collect nearly 18K sentence pairs annotated by crowd workers and experts. The expert-annotated data is used exclusively as a test set. It is designed to provide a challenging test-bed for Indonesian NLI by explicitly incorporating various linguistic phenomena such as numerica… ▽ More

    Submitted 27 October, 2021; originally announced October 2021.

    Comments: Accepted at EMNLP 2021 main conference

    Journal ref: https://aclanthology.org/2021.emnlp-main.821/

  5. arXiv:2106.00840  [pdf, other

    cs.CL

    Comparing Test Sets with Item Response Theory

    Authors: Clara Vania, Phu Mon Htut, William Huang, Dhara Mungra, Richard Yuanzhe Pang, Jason Phang, Haokun Liu, Kyunghyun Cho, Samuel R. Bowman

    Abstract: Recent years have seen numerous NLP datasets introduced to evaluate the performance of fine-tuned models on natural language understanding tasks. Recent results from large pretrained models, though, show that many of these datasets are largely saturated and unlikely to be able to detect further progress. What kind of datasets are still effective at discriminating among strong models, and what kind… ▽ More

    Submitted 1 June, 2021; originally announced June 2021.

    Comments: ACL 2021

  6. arXiv:2106.00794  [pdf, other

    cs.CL cs.AI cs.HC

    What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?

    Authors: Nikita Nangia, Saku Sugawara, Harsh Trivedi, Alex Warstadt, Clara Vania, Samuel R. Bowman

    Abstract: Crowdsourcing is widely used to create data for common natural language understanding tasks. Despite the importance of these datasets for measuring and refining model understanding of language, there has been little focus on the crowdsourcing methods used for collecting the datasets. In this paper, we compare the efficacy of interventions that have been proposed in prior work as ways of improving… ▽ More

    Submitted 1 June, 2021; originally announced June 2021.

    Comments: ACL 2021

  7. arXiv:2010.06122  [pdf, other

    cs.CL

    Asking Crowdworkers to Write Entailment Examples: The Best of Bad Options

    Authors: Clara Vania, Ruijie Chen, Samuel R. Bowman

    Abstract: Large-scale natural language inference (NLI) datasets such as SNLI or MNLI have been created by asking crowdworkers to read a premise and write three new hypotheses, one for each possible semantic relationships (entailment, contradiction, and neutral). While this protocol has been used to create useful benchmark data, it remains unclear whether the writing-based annotation protocol is optimal for… ▽ More

    Submitted 12 October, 2020; originally announced October 2020.

    Comments: AACL 2020

  8. arXiv:2010.00133  [pdf, other

    cs.CL cs.AI

    CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

    Authors: Nikita Nangia, Clara Vania, Rasika Bhalerao, Samuel R. Bowman

    Abstract: Pretrained language models, especially masked language models (MLMs) have seen success across many NLP tasks. However, there is ample evidence that they use the cultural biases that are undoubtedly present in the corpora they are trained on, implicitly creating harm with biased representations. To measure some forms of social bias in language models against protected demographic groups in the US,… ▽ More

    Submitted 30 September, 2020; originally announced October 2020.

    Comments: EMNLP 2020

  9. arXiv:2008.09150  [pdf, other

    cs.CL cs.AI cs.CV

    VisualSem: A High-quality Knowledge Graph for Vision and Language

    Authors: Houda Alberts, Teresa Huang, Yash Deshpande, Yibo Liu, Kyunghyun Cho, Clara Vania, Iacer Calixto

    Abstract: An exciting frontier in natural language understanding (NLU) and generation (NLG) calls for (vision-and-) language models that can efficiently access external structured knowledge repositories. However, many existing knowledge bases only cover limited domains, or suffer from noisy data, and most of all are typically hard to integrate into neural language pipelines. To fill this gap, we release Vis… ▽ More

    Submitted 20 October, 2021; v1 submitted 20 August, 2020; originally announced August 2020.

    Comments: Accepted for publication at the 1st Multilingual Representation Learning workshop (MRL 2021) co-located with EMNLP 2021. 15 pages, 8 figures, 6 tables

    ACM Class: E.0; E.2

  10. arXiv:2005.13013  [pdf, other

    cs.CL

    English Intermediate-Task Training Improves Zero-Shot Cross-Lingual Transfer Too

    Authors: Jason Phang, Iacer Calixto, Phu Mon Htut, Yada Pruksachatkun, Haokun Liu, Clara Vania, Katharina Kann, Samuel R. Bowman

    Abstract: Intermediate-task training---fine-tuning a pretrained model on an intermediate task before fine-tuning again on the target task---often improves model performance substantially on language understanding tasks in monolingual English settings. We investigate whether English intermediate-task training is still helpful on non-English target tasks. Using nine intermediate language-understanding tasks,… ▽ More

    Submitted 30 September, 2020; v1 submitted 26 May, 2020; originally announced May 2020.

  11. arXiv:2005.00628  [pdf, other

    cs.CL

    Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work?

    Authors: Yada Pruksachatkun, Jason Phang, Haokun Liu, Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe Pang, Clara Vania, Katharina Kann, Samuel R. Bowman

    Abstract: While pretrained models such as BERT have shown large gains across natural language understanding tasks, their performance can be improved by further training the model on a data-rich intermediate task, before fine-tuning it on a target task. However, it is still poorly understood when and why intermediate-task training is beneficial for a given target task. To investigate this, we perform a large… ▽ More

    Submitted 9 May, 2020; v1 submitted 1 May, 2020; originally announced May 2020.

    Comments: ACL 2020

  12. arXiv:1909.02857  [pdf, other

    cs.CL

    A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages

    Authors: Clara Vania, Yova Kementchedjhieva, Anders Søgaard, Adam Lopez

    Abstract: Parsers are available for only a handful of the world's languages, since they require lots of training data. How far can we get with just a small amount of training data? We systematically compare a set of simple strategies for improving low-resource parsers: data augmentation, which has not been tested before; cross-lingual training; and transliteration. Experimenting on three typologically diver… ▽ More

    Submitted 6 September, 2019; originally announced September 2019.

    Comments: EMNLP 2019

  13. arXiv:1903.09442  [pdf, other

    cs.CL

    LINSPECTOR: Multilingual Probing Tasks for Word Representations

    Authors: Gözde Gül Şahin, Clara Vania, Ilia Kuznetsov, Iryna Gurevych

    Abstract: Despite an ever growing number of word representation models introduced for a large number of languages, there is a lack of a standardized technique to provide insights into what is captured by these models. Such insights would help the community to get an estimate of the downstream task performance, as well as to design more informed neural architectures, while avoiding extensive experimentation… ▽ More

    Submitted 11 December, 2019; v1 submitted 22 March, 2019; originally announced March 2019.

    Comments: Demo is available from: https://linspector.ukp.informatik.tu-darmstadt.de/

  14. arXiv:1808.09180  [pdf, other

    cs.CL

    What do character-level models learn about morphology? The case of dependency parsing

    Authors: Clara Vania, Andreas Grivas, Adam Lopez

    Abstract: When parsing morphologically-rich languages with neural models, it is beneficial to model input at the character level, and it has been claimed that this is because character-level models learn morphology. We test these claims by comparing character-level models to an oracle with access to explicit morphological analysis on twelve languages with varying morphological typologies. Our results highli… ▽ More

    Submitted 28 August, 2018; originally announced August 2018.

    Comments: EMNLP 2018

  15. arXiv:1704.08352  [pdf, other

    cs.CL

    From Characters to Words to in Between: Do We Capture Morphology?

    Authors: Clara Vania, Adam Lopez

    Abstract: Words can be represented by composing the representations of subword units such as word segments, characters, and/or character n-grams. While such representations are effective and may capture the morphological regularities of words, they have not been systematically compared, and it is not understood how they interact with different morphological typologies. On a language modeling task, we presen… ▽ More

    Submitted 26 April, 2017; originally announced April 2017.

    Comments: Accepted at ACL 2017

  16. arXiv:1606.09042  [pdf, other

    cs.CR

    Bayesian Attack Model for Dynamic Risk Assessment

    Authors: Aguessy François-Xavier, Bettan Olivier, Blanc Grégory, Conan Vania, Debar Hervé

    Abstract: Because of the threat of advanced multi-step attacks, it is often difficult for security operators to completely cover all vulnerabilities when deploying remediations. Deploying sensors to monitor attacks exploiting residual vulnerabilities is not sufficient and new tools are needed to assess the risk associated to the security events produced by these sensors. Although attack graphs were proposed… ▽ More

    Submitted 29 June, 2016; originally announced June 2016.