Skip to main content

Showing 1–9 of 9 results for author: Alam, M M i

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.01983  [pdf

    cs.CL

    Language and Speech Technology for Central Kurdish Varieties

    Authors: Sina Ahmadi, Daban Q. Jaff, Md Mahfuz Ibn Alam, Antonios Anastasopoulos

    Abstract: Kurdish, an Indo-European language spoken by over 30 million speakers, is considered a dialect continuum and known for its diversity in language varieties. Previous studies addressing language and speech technology for Kurdish handle it in a monolithic way as a macro-language, resulting in disparities for dialects and varieties for which there are few resources and tools available. In this paper,… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

    Comments: Accepted to LREC-COLING 2024

  2. arXiv:2402.01945  [pdf, other

    cs.CL

    A Case Study on Filtering for End-to-End Speech Translation

    Authors: Md Mahfuz Ibn Alam, Antonios Anastasopoulos

    Abstract: It is relatively easy to mine a large parallel corpus for any machine learning task, such as speech-to-text or speech-to-speech translation. Although these mined corpora are large in volume, their quality is questionable. This work shows that the simplest filtering technique can trim down these big, noisy datasets to a more manageable, clean dataset. We also show that using this clean dataset can… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

  3. arXiv:2402.01939  [pdf, other

    cs.CL

    A Morphologically-Aware Dictionary-based Data Augmentation Technique for Machine Translation of Under-Represented Languages

    Authors: Md Mahfuz Ibn Alam, Sina Ahmadi, Antonios Anastasopoulos

    Abstract: The availability of parallel texts is crucial to the performance of machine translation models. However, most of the world's languages face the predominant challenge of data scarcity. In this paper, we propose strategies to synthesize parallel data relying on morpho-syntactic information and using bilingual lexicons along with a small amount of seed parallel data. Our methodology adheres to a real… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

  4. arXiv:2305.17267  [pdf, other

    cs.CL

    CODET: A Benchmark for Contrastive Dialectal Evaluation of Machine Translation

    Authors: Md Mahfuz Ibn Alam, Sina Ahmadi, Antonios Anastasopoulos

    Abstract: Neural machine translation (NMT) systems exhibit limited robustness in handling source-side linguistic variations. Their performance tends to degrade when faced with even slight deviations in language usage, such as different domains or variations introduced by second-language speakers. It is intuitive to extend this observation to encompass dialectal variations as well, but the work allowing the… ▽ More

    Submitted 2 February, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

  5. arXiv:2305.17202  [pdf, other

    cs.CL

    BIG-C: a Multimodal Multi-Purpose Dataset for Bemba

    Authors: Claytone Sikasote, Eunice Mukonde, Md Mahfuz Ibn Alam, Antonios Anastasopoulos

    Abstract: We present BIG-C (Bemba Image Grounded Conversations), a large multimodal dataset for Bemba. While Bemba is the most populous language of Zambia, it exhibits a dearth of resources which render the development of language technologies or language processing research almost impossible. The dataset is comprised of multi-turn dialogues between Bemba speakers based on images, transcribed and translated… ▽ More

    Submitted 26 May, 2023; originally announced May 2023.

    Comments: accepted to ACL 2023

  6. arXiv:2305.14263  [pdf, other

    cs.CL cs.AI

    LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages

    Authors: Milind Agarwal, Md Mahfuz Ibn Alam, Antonios Anastasopoulos

    Abstract: Knowing the language of an input text/audio is a necessary first step for using almost every NLP tool such as taggers, parsers, or translation systems. Language identification is a well-studied problem, sometimes even considered solved; in reality, due to lack of data and computational challenges, current systems cannot accurately identify most of the world's 7000 languages. To tackle this bottlen… ▽ More

    Submitted 6 November, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: To appear at EMNLP 2023. 24 pages, 2 figures, 12 tables

  7. arXiv:2304.12979  [pdf, other

    cs.CL cs.LG

    GMNLP at SemEval-2023 Task 12: Sentiment Analysis with Phylogeny-Based Adapters

    Authors: Md Mahfuz Ibn Alam, Ruoyu Xie, Fahim Faisal, Antonios Anastasopoulos

    Abstract: This report describes GMU's sentiment analysis system for the SemEval-2023 shared task AfriSenti-SemEval. We participated in all three sub-tasks: Monolingual, Multilingual, and Zero-Shot. Our approach uses models initialized with AfroXLMR-large, a pre-trained multilingual language model trained on African languages and fine-tuned correspondingly. We also introduce augmented training data along wit… ▽ More

    Submitted 25 April, 2023; originally announced April 2023.

    Comments: Accepted at SemEval Workshop at ACL 2023

  8. arXiv:2109.12072  [pdf

    cs.CL

    SD-QA: Spoken Dialectal Question Answering for the Real World

    Authors: Fahim Faisal, Sharlina Keshava, Md Mahfuz ibn Alam, Antonios Anastasopoulos

    Abstract: Question answering (QA) systems are now available through numerous commercial applications for a wide variety of domains, serving millions of users that interact with them via speech interfaces. However, current benchmarks in QA research do not account for the errors that speech recognition models might introduce, nor do they consider the language variations (dialects) of the users. To address thi… ▽ More

    Submitted 24 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021 Findings

  9. arXiv:2106.11891  [pdf, other

    cs.CL

    On the Evaluation of Machine Translation for Terminology Consistency

    Authors: Md Mahfuz ibn Alam, Antonios Anastasopoulos, Laurent Besacier, James Cross, Matthias Gallé, Philipp Koehn, Vassilina Nikoulina

    Abstract: As neural machine translation (NMT) systems become an important part of professional translator pipelines, a growing body of work focuses on combining NMT with terminologies. In many scenarios and particularly in cases of domain adaptation, one expects the MT output to adhere to the constraints provided by a terminology. In this work, we propose metrics to measure the consistency of MT output with… ▽ More

    Submitted 24 June, 2021; v1 submitted 22 June, 2021; originally announced June 2021.

    Comments: preprint