Skip to main content

Showing 1–9 of 9 results for author: Gow-Smith, E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.09279  [pdf, other

    cs.CL cs.AI

    Sign of the Times: Evaluating the use of Large Language Models for Idiomaticity Detection

    Authors: Dylan Phelps, Thomas Pickard, Maggie Mi, Edward Gow-Smith, Aline Villavicencio

    Abstract: Despite the recent ubiquity of large language models and their high zero-shot prompted performance across a wide range of tasks, it is still not known how well they perform on tasks which require processing of potentially idiomatic language. In particular, how well do such models perform in comparison to encoder-only models fine-tuned specifically for idiomaticity tasks? In this work, we attempt t… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

    Comments: Presented at the MWE-UD Workshop at LREC-COLING 2024

  2. arXiv:2401.07923  [pdf, other

    cs.CL

    Word Boundary Information Isn't Useful for Encoder Language Models

    Authors: Edward Gow-Smith, Dylan Phelps, Harish Tayyar Madabushi, Carolina Scarton, Aline Villavicencio

    Abstract: All existing transformer-based approaches to NLP using subword tokenisation algorithms encode whitespace (word boundary information) through the use of special space symbols (such as \#\# or \_) forming part of tokens. These symbols have been shown to a) lead to reduced morphological validity of tokenisations, and b) give substantial vocabulary redundancy. As such, removing these symbols has been… ▽ More

    Submitted 15 January, 2024; originally announced January 2024.

    Comments: Preprint

  3. arXiv:2306.09830  [pdf, other

    cs.CL

    Sheffield's Submission to the AmericasNLP Shared Task on Machine Translation into Indigenous Languages

    Authors: Edward Gow-Smith, Danae Sánchez Villegas

    Abstract: In this paper we describe the University of Sheffield's submission to the AmericasNLP 2023 Shared Task on Machine Translation into Indigenous Languages which comprises the translation from Spanish to eleven indigenous languages. Our approach consists of extending, training, and ensembling different variations of NLLB-200. We use data provided by the organizers and data from various other sources s… ▽ More

    Submitted 16 June, 2023; originally announced June 2023.

    Comments: Best-performing submission overall to the AmericasNLP 2023 Shared Task. Code and models available here: https://github.com/edwardgowsmith/americasnlp-2023-sheffield

  4. arXiv:2306.07763  [pdf, other

    cs.CL

    NAVER LABS Europe's Multilingual Speech Translation Systems for the IWSLT 2023 Low-Resource Track

    Authors: Edward Gow-Smith, Alexandre Berard, Marcely Zanon Boito, Ioan Calapodescu

    Abstract: This paper presents NAVER LABS Europe's systems for Tamasheq-French and Quechua-Spanish speech translation in the IWSLT 2023 Low-Resource track. Our work attempts to maximize translation quality in low-resource settings using multilingual parameter-efficient solutions that leverage strong pre-trained models. Our primary submission for Tamasheq outperforms the previous state of the art by 7.5 BLEU… ▽ More

    Submitted 13 June, 2023; originally announced June 2023.

    Comments: IWSLT 2023: Tamasheq-French and Quechua-Spanish challenge winner

  5. arXiv:2205.11370  [pdf, other

    cs.CL

    Use of Transformer-Based Models for Word-Level Transliteration of the Book of the Dean of Lismore

    Authors: Edward Gow-Smith, Mark McConville, William Gillies, Jade Scott, Roibeard Ó Maolalaigh

    Abstract: The Book of the Dean of Lismore (BDL) is a 16th-century Scottish Gaelic manuscript written in a non-standard orthography. In this work, we outline the problem of transliterating the text of the BDL into a standardised orthography, and perform exploratory experiments using Transformer-based models for this task. In particular, we focus on the task of word-level transliteration, and achieve a charac… ▽ More

    Submitted 31 May, 2022; v1 submitted 23 May, 2022; originally announced May 2022.

    Comments: 4th Celtic Language Technology Workshop

  6. arXiv:2205.11306  [pdf, ps, other

    cs.CL

    Sample Efficient Approaches for Idiomaticity Detection

    Authors: Dylan Phelps, Xuan-Rui Fan, Edward Gow-Smith, Harish Tayyar Madabushi, Carolina Scarton, Aline Villavicencio

    Abstract: Deep neural models, in particular Transformer-based pre-trained language models, require a significant amount of data to train. This need for data tends to lead to problems when dealing with idiomatic multiword expressions (MWEs), which are inherently less frequent in natural text. As such, this work explores sample efficient methods of idiomaticity detection. In particular we study the impact of… ▽ More

    Submitted 23 May, 2022; originally announced May 2022.

  7. arXiv:2204.10050  [pdf, other

    cs.CL

    SemEval-2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding

    Authors: Harish Tayyar Madabushi, Edward Gow-Smith, Marcos Garcia, Carolina Scarton, Marco Idiart, Aline Villavicencio

    Abstract: This paper presents the shared task on Multilingual Idiomaticity Detection and Sentence Embedding, which consists of two subtasks: (a) a binary classification task aimed at identifying whether a sentence contains an idiomatic expression, and (b) a task based on semantic text similarity which requires the model to adequately represent potentially idiomatic expressions in context. Each subtask inclu… ▽ More

    Submitted 30 May, 2022; v1 submitted 21 April, 2022; originally announced April 2022.

    Comments: Data available at https://github.com/H-TayyarMadabushi/SemEval_2022_Task2-idiomaticity and competition website at https://sites.google.com/view/semeval2022task2-idiomaticity

  8. arXiv:2204.04058  [pdf, other

    cs.CL

    Improving Tokenisation by Alternative Treatment of Spaces

    Authors: Edward Gow-Smith, Harish Tayyar Madabushi, Carolina Scarton, Aline Villavicencio

    Abstract: Tokenisation is the first step in almost all NLP tasks, and state-of-the-art transformer-based language models all use subword tokenisation algorithms to process input text. Existing algorithms have problems, often producing tokenisations of limited linguistic validity, and representing equivalent strings differently depending on their position within a word. We hypothesise that these problems hin… ▽ More

    Submitted 22 October, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

    Comments: EMNLP 2022

  9. arXiv:2109.04413  [pdf, other

    cs.CL

    AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models

    Authors: Harish Tayyar Madabushi, Edward Gow-Smith, Carolina Scarton, Aline Villavicencio

    Abstract: Despite their success in a variety of NLP tasks, pre-trained language models, due to their heavy reliance on compositionality, fail in effectively capturing the meanings of multiword expressions (MWEs), especially idioms. Therefore, datasets and methods to improve the representation of MWEs are urgently needed. Existing datasets are limited to providing the degree of idiomaticity of expressions al… ▽ More

    Submitted 9 September, 2021; originally announced September 2021.

    Comments: Findings of EMNLP 2021. Code available at: https://github.com/H-TayyarMadabushi/AStitchInLanguageModels