Skip to main content

Showing 1–4 of 4 results for author: Chimoto, E A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.19462  [pdf, other

    cs.CL

    Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning

    Authors: Everlyn Asiko Chimoto, Jay Gala, Orevaoghene Ahia, Julia Kreutzer, Bruce A. Bassett, Sara Hooker

    Abstract: Neural Machine Translation models are extremely data and compute-hungry. However, not all data points contribute equally to model training and generalization. Data pruning to remove the low-value data points has the benefit of drastically reducing the compute budget without significant drop in model performance. In this paper, we propose a new data pruning technique: Checkpoints Across Time (CAT),… ▽ More

    Submitted 21 June, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

    Comments: Accepted to ACL 2024 Findings

  2. arXiv:2306.00410  [pdf, other

    cs.CL cs.SD eess.AS

    Towards hate speech detection in low-resource languages: Comparing ASR to acoustic word embeddings on Wolof and Swahili

    Authors: Christiaan Jacobs, Nathanaƫl Carraz Rakotonirina, Everlyn Asiko Chimoto, Bruce A. Bassett, Herman Kamper

    Abstract: We consider hate speech detection through keyword spotting on radio broadcasts. One approach is to build an automatic speech recognition (ASR) system for the target low-resource language. We compare this to using acoustic word embedding (AWE) models that map speech segments to a space where matching words have similar vectors. We specifically use a multilingual AWE model trained on labelled data f… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted to Interspeech 2023

  3. arXiv:2211.00046  [pdf, other

    cs.CL

    Very Low Resource Sentence Alignment: Luhya and Swahili

    Authors: Everlyn Asiko Chimoto, Bruce A. Bassett

    Abstract: Language-agnostic sentence embeddings generated by pre-trained models such as LASER and LaBSE are attractive options for mining large datasets to produce parallel corpora for low-resource machine translation. We test LASER and LaBSE in extracting bitext for two related low-resource African languages: Luhya and Swahili. For this work, we created a new parallel set of nearly 8000 Luhya-English sente… ▽ More

    Submitted 31 October, 2022; originally announced November 2022.

    Comments: Accepted to LoResMT 2022

  4. arXiv:2210.15696  [pdf, other

    cs.CL

    COMET-QE and Active Learning for Low-Resource Machine Translation

    Authors: Everlyn Asiko Chimoto, Bruce A. Bassett

    Abstract: Active learning aims to deliver maximum benefit when resources are scarce. We use COMET-QE, a reference-free evaluation metric, to select sentences for low-resource neural machine translation. Using Swahili, Kinyarwanda and Spanish for our experiments, we show that COMET-QE significantly outperforms two variants of Round Trip Translation Likelihood (RTTL) and random sentence selection by up to 5 B… ▽ More

    Submitted 27 October, 2022; originally announced October 2022.

    Comments: Accepted to Findings of EMNLP 2022