Skip to main content

Showing 1–12 of 12 results for author: Mayhew, S

.
  1. arXiv:2406.03030  [pdf, other

    cs.CL cs.LG

    From Tarzan to Tolkien: Controlling the Language Proficiency Level of LLMs for Content Generation

    Authors: Ali Malik, Stephen Mayhew, Chris Piech, Klinton Bicknell

    Abstract: We study the problem of controlling the difficulty level of text generated by Large Language Models (LLMs) for contexts where end-users are not fully proficient, such as language learners. Using a novel framework, we evaluate the effectiveness of several key approaches for this task, including few-shot prompting, supervised finetuning, and reinforcement learning (RL), utilising both GPT-4 and open… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Journal ref: In Findings of the Association for Computational Linguistics (ACL 2024)

  2. arXiv:2311.09122  [pdf, other

    cs.CL

    Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

    Authors: Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Šuppa, Hila Gonen, Joseph Marvin Imperial, Börje F. Karlsson, Peiqin Lin, Nikola Ljubešić, LJ Miranda, Barbara Plank, Arij Riabi, Yuval Pinter

    Abstract: We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse langu… ▽ More

    Submitted 29 June, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Comments: NAACL 2024 Camera-ready

  3. arXiv:2103.11811  [pdf

    cs.CL cs.AI

    MasakhaNER: Named Entity Recognition for African Languages

    Authors: David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D'souza, Julia Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder, Stephen Mayhew, Israel Abebe Azime, Shamsuddeen Muhammad, Chris Chinenye Emezue, Joyce Nakatumba-Nabende, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jesujoba Alabi, Seid Muhie Yimam, Tajuddeen Gwadabe, Ignatius Ezeani, Rubungo Andre Niyongabo, Jonathan Mukiibi , et al. (36 additional authors not shown)

    Abstract: We take a step towards addressing the under-representation of the African continent in NLP research by creating the first large publicly available high-quality dataset for named entity recognition (NER) in ten African languages, bringing together a variety of stakeholders. We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER. We… ▽ More

    Submitted 5 July, 2021; v1 submitted 22 March, 2021; originally announced March 2021.

    Comments: Accepted to TACL 2021, pre-MIT Press publication version

  4. arXiv:2006.09627  [pdf

    cs.CL

    Building Low-Resource NER Models Using Non-Speaker Annotation

    Authors: Tatiana Tsygankova, Francesca Marini, Stephen Mayhew, Dan Roth

    Abstract: In low-resource natural language processing (NLP), the key problems are a lack of target language training data, and a lack of native speakers to create it. Cross-lingual methods have had notable success in addressing these concerns, but in certain common circumstances, such as insufficient pre-training corpora or languages far from the source language, their performance suffers. In this work we p… ▽ More

    Submitted 26 April, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

    Comments: Accepted to DASH-LA 2021, workshop at NAACL 2021

  5. arXiv:2004.13640  [pdf, other

    cs.CL

    Extending Multilingual BERT to Low-Resource Languages

    Authors: Zihan Wang, Karthikeyan K, Stephen Mayhew, Dan Roth

    Abstract: Multilingual BERT (M-BERT) has been a huge success in both supervised and zero-shot cross-lingual transfer learning. However, this success has focused only on the top 104 languages in Wikipedia that it was trained on. In this paper, we propose a simple but effective approach to extend M-BERT (E-BERT) so that it can benefit any new language, and show that our approach benefits languages that are al… ▽ More

    Submitted 28 April, 2020; originally announced April 2020.

  6. arXiv:1912.07840  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Cross-Lingual Ability of Multilingual BERT: An Empirical Study

    Authors: Karthikeyan K, Zihan Wang, Stephen Mayhew, Dan Roth

    Abstract: Recent work has exhibited the surprising cross-lingual abilities of multilingual BERT (M-BERT) -- surprising since it is trained without any cross-lingual objective and with no aligned data. In this work, we provide a comprehensive study of the contribution of different components in M-BERT to its cross-lingual ability. We study the impact of linguistic properties of the languages, the architectur… ▽ More

    Submitted 15 February, 2020; v1 submitted 17 December, 2019; originally announced December 2019.

  7. arXiv:1912.07095  [pdf, other

    cs.CL

    Robust Named Entity Recognition with Truecasing Pretraining

    Authors: Stephen Mayhew, Nitish Gupta, Dan Roth

    Abstract: Although modern named entity recognition (NER) systems show impressive performance on standard datasets, they perform poorly when presented with noisy data. In particular, capitalization is a strong signal for entities in many languages, and even state of the art models overfit to this feature, with drastically lower performance on uncapitalized text. In this work, we address the problem of robust… ▽ More

    Submitted 15 December, 2019; originally announced December 2019.

    Comments: Accepted to AAAI 2020

  8. arXiv:1909.09270  [pdf, other

    cs.CL cs.LG

    Named Entity Recognition with Partially Annotated Training Data

    Authors: Stephen Mayhew, Snigdha Chaturvedi, Chen-Tse Tsai, Dan Roth

    Abstract: Supervised machine learning assumes the availability of fully-labeled data, but in many cases, such as low-resource languages, the only data available is partially annotated. We study the problem of Named Entity Recognition (NER) with partially annotated training data in which a fraction of the named entities are labeled, and all other tokens, entities or otherwise, are labeled as non-entity by de… ▽ More

    Submitted 19 September, 2019; originally announced September 2019.

    Comments: Accepted to CoNLL 2019

  9. arXiv:1903.11222  [pdf, other

    cs.CL

    ner and pos when nothing is capitalized

    Authors: Stephen Mayhew, Tatiana Tsygankova, Dan Roth

    Abstract: For those languages which use it, capitalization is an important signal for the fundamental NLP tasks of Named Entity Recognition (NER) and Part of Speech (POS) tagging. In fact, it is such a strong signal that model performance on these tasks drops sharply in common lowercased scenarios, such as noisy web text or machine translation outputs. In this work, we perform a systematic analysis of solut… ▽ More

    Submitted 31 August, 2019; v1 submitted 26 March, 2019; originally announced March 2019.

    Comments: Accepted to EMNLP2019

  10. arXiv:1809.05157  [pdf, other

    cs.CL cs.IR

    On the Strength of Character Language Models for Multilingual Named Entity Recognition

    Authors: Xiaodong Yu, Stephen Mayhew, Mark Sammons, Dan Roth

    Abstract: Character-level patterns have been widely used as features in English Named Entity Recognition (NER) systems. However, to date there has been no direct investigation of the inherent differences between name and non-name tokens in text, nor whether this property holds across multiple languages. This paper analyzes the capabilities of corpus-agnostic Character-level Language Models (CLMs) in the bin… ▽ More

    Submitted 20 September, 2018; v1 submitted 13 September, 2018; originally announced September 2018.

    Comments: 5 pages, EMNLP 2018 short paper

    Journal ref: EMNLP 2018

  11. arXiv:1611.04122  [pdf, ps, other

    cs.CL

    Cross-lingual Dataless Classification for Languages with Small Wikipedia Presence

    Authors: Yangqiu Song, Stephen Mayhew, Dan Roth

    Abstract: This paper presents an approach to classify documents in any language into an English topical label space, without any text categorization training data. The approach, Cross-Lingual Dataless Document Classification (CLDDC) relies on map** the English labels or short category description into a Wikipedia-based semantic representation, and on the use of the target language Wikipedia. Consequently,… ▽ More

    Submitted 13 November, 2016; originally announced November 2016.

  12. arXiv:1609.04325  [pdf, other

    cs.CL

    Transliteration in Any Language with Surrogate Languages

    Authors: Stephen Mayhew, Christos Christodoulopoulos, Dan Roth

    Abstract: We introduce a method for transliteration generation that can produce transliterations in every language. Where previous results are only as multilingual as Wikipedia, we show how to use training data from Wikipedia as surrogate training for any language. Thus, the problem becomes one of ranking Wikipedia languages in order of suitability with respect to a target language. We introduce several tas… ▽ More

    Submitted 14 September, 2016; originally announced September 2016.