Skip to main content

Showing 1–15 of 15 results for author: Khanuja, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.01247  [pdf, other

    cs.CL cs.CV

    An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance

    Authors: Simran Khanuja, Sathyanarayanan Ramamoorthy, Yueqi Song, Graham Neubig

    Abstract: Given the rise of multimedia content, human translators increasingly focus on culturally adapting not only words but also other modalities such as images to convey the same meaning. While several applications stand to benefit from this, machine translation systems remain confined to dealing with language in speech and text. In this work, we take a first step towards translating images to make them… ▽ More

    Submitted 19 June, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

  2. arXiv:2403.01404  [pdf, other

    cs.CL

    What Is Missing in Multilingual Visual Reasoning and How to Fix It

    Authors: Yueqi Song, Simran Khanuja, Graham Neubig

    Abstract: NLP models today strive for supporting multiple languages and modalities, improving accessibility for diverse users. In this paper, we evaluate their multilingual, multimodal capabilities by testing on a visual reasoning task. We observe that proprietary systems like GPT-4V obtain the best performance on this task now, but open models lag in comparison. Surprisingly, GPT-4V exhibits similar perfor… ▽ More

    Submitted 3 March, 2024; originally announced March 2024.

  3. arXiv:2311.06379  [pdf, other

    cs.CL

    DeMuX: Data-efficient Multilingual Learning

    Authors: Simran Khanuja, Srinivas Gowriraj, Lucio Dery, Graham Neubig

    Abstract: We consider the task of optimally fine-tuning pre-trained multilingual models, given small amounts of unlabelled target data and an annotation budget. In this paper, we introduce DEMUX, a framework that prescribes the exact data-points to label from vast amounts of unlabelled multilingual data, having unknown degrees of overlap with the target set. Unlike most prior works, our end-to-end framework… ▽ More

    Submitted 10 November, 2023; originally announced November 2023.

  4. arXiv:2305.16171  [pdf

    cs.CL

    Multi-lingual and Multi-cultural Figurative Language Understanding

    Authors: Anubha Kabra, Emmy Liu, Simran Khanuja, Alham Fikri Aji, Genta Indra Winata, Samuel Cahyawijaya, Anuoluwapo Aremu, Perez Ogayo, Graham Neubig

    Abstract: Figurative language permeates human communication, but at the same time is relatively understudied in NLP. Datasets have been created in English to accelerate progress towards measuring and improving figurative language processing in language models (LMs). However, the use of figurative language is an expression of our cultural and societal experiences, making it difficult for these phrases to be… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: ACL 2023 Findings

  5. arXiv:2305.14716  [pdf, other

    cs.CL

    GlobalBench: A Benchmark for Global Progress in Natural Language Processing

    Authors: Yueqi Song, Catherine Cui, Simran Khanuja, Pengfei Liu, Fahim Faisal, Alissa Ostapenko, Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Yulia Tsvetkov, Antonios Anastasopoulos, Graham Neubig

    Abstract: Despite the major advances in NLP, significant disparities in NLP system performance across languages still exist. Arguably, these are due to uneven resource allocation and sub-optimal incentives to work on less resourced languages. To track and further incentivize the global development of equitable language technology, we introduce GlobalBench. Prior multilingual benchmarks are static and have f… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: Preprint, 9 pages

  6. arXiv:2205.12676  [pdf, other

    cs.CL

    Evaluating the Diversity, Equity and Inclusion of NLP Technology: A Case Study for Indian Languages

    Authors: Simran Khanuja, Sebastian Ruder, Partha Talukdar

    Abstract: In order for NLP technology to be widely applicable, fair, and useful, it needs to serve a diverse set of speakers across the world's languages, be equitable, i.e., not unduly biased towards any particular language, and be inclusive of all users, particularly in low-resource settings where compute constraints are common. In this paper, we propose an evaluation paradigm that assesses NLP technologi… ▽ More

    Submitted 12 April, 2023; v1 submitted 25 May, 2022; originally announced May 2022.

    Comments: Accepted to EACL Findings, 2023

  7. arXiv:2205.12446  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech

    Authors: Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, Ankur Bapna

    Abstract: We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark. FLEURS is an n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark, with approximately 12 hours of speech supervision per language. FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Languag… ▽ More

    Submitted 24 May, 2022; originally announced May 2022.

  8. arXiv:2203.10752  [pdf, other

    cs.CL

    XTREME-S: Evaluating Cross-lingual Speech Representations

    Authors: Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan Van Esch, Vera Axelrod, Simran Khanuja, Jonathan H. Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson

    Abstract: We introduce XTREME-S, a new benchmark to evaluate universal cross-lingual speech representations in many languages. XTREME-S covers four task families: speech recognition, classification, speech-to-text translation and retrieval. Covering 102 languages from 10+ language families, 3 different domains and 4 task families, XTREME-S aims to simplify multilingual speech representation evaluation, as w… ▽ More

    Submitted 13 April, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

    Comments: Minor fix: language code for Filipino (Tagalog), "tg" -> "tl"

  9. arXiv:2202.01374  [pdf, other

    cs.CL cs.LG

    mSLAM: Massively multilingual joint pre-training for speech and text

    Authors: Ankur Bapna, Colin Cherry, Yu Zhang, Ye Jia, Melvin Johnson, Yong Cheng, Simran Khanuja, Jason Riesa, Alexis Conneau

    Abstract: We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages. mSLAM combines w2v-BERT pre-training on speech with SpanBERT pre-training on character-level text, along with Connectionist Temporal Classification (CTC) losses on paired spee… ▽ More

    Submitted 2 February, 2022; originally announced February 2022.

  10. arXiv:2106.02834  [pdf, other

    cs.CL

    MergeDistill: Merging Pre-trained Language Models using Distillation

    Authors: Simran Khanuja, Melvin Johnson, Partha Talukdar

    Abstract: Pre-trained multilingual language models (LMs) have achieved state-of-the-art results in cross-lingual transfer, but they often lead to an inequitable representation of languages due to limited capacity, skewed pre-training data, and sub-optimal vocabularies. This has prompted the creation of an ever-growing pre-trained model universe, where each model is trained on large amounts of language or do… ▽ More

    Submitted 5 June, 2021; originally announced June 2021.

    Comments: ACL 2021 Findings

  11. arXiv:2103.10730  [pdf, other

    cs.CL

    MuRIL: Multilingual Representations for Indian Languages

    Authors: Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu, Shachi Dave, Shruti Gupta, Subhash Chandra Bose Gali, Vish Subramanian, Partha Talukdar

    Abstract: India is a multilingual society with 1369 rationalized languages and dialects being spoken across the country (INDIA, 2011). Of these, the 22 scheduled languages have a staggering total of 1.17 billion speakers and 121 languages have more than 10,000 speakers (INDIA, 2011). India also has the second largest (and an ever growing) digital footprint (Statista, 2020). Despite this, today's state-of-th… ▽ More

    Submitted 2 April, 2021; v1 submitted 19 March, 2021; originally announced March 2021.

  12. arXiv:2011.06226  [pdf, other

    cs.CL

    Cross-lingual and Multilingual Spoken Term Detection for Low-Resource Indian Languages

    Authors: Sanket Shah, Satarupa Guha, Simran Khanuja, Sunayana Sitaram

    Abstract: Spoken Term Detection (STD) is the task of searching for words or phrases within audio, given either text or spoken input as a query. In this work, we use state-of-the-art Hindi, Tamil and Telugu ASR systems cross-lingually for lexical Spoken Term Detection in ten low-resource Indian languages. Since no publicly available dataset exists for Spoken Term Detection in these languages, we create a new… ▽ More

    Submitted 12 November, 2020; originally announced November 2020.

    Comments: 5 pages, 2 figures, 6 tables, 17 references

  13. arXiv:2004.12376  [pdf, other

    cs.CL

    GLUECoS : An Evaluation Benchmark for Code-Switched NLP

    Authors: Simran Khanuja, Sandipan Dandapat, Anirudh Srinivasan, Sunayana Sitaram, Monojit Choudhury

    Abstract: Code-switching is the use of more than one language in the same conversation or utterance. Recently, multilingual contextual embedding models, trained on multiple monolingual corpora, have shown promising results on cross-lingual and multilingual tasks. We present an evaluation benchmark, GLUECoS, for code-switched languages, that spans several NLP tasks in English-Hindi and English-Spanish. Speci… ▽ More

    Submitted 14 May, 2020; v1 submitted 26 April, 2020; originally announced April 2020.

    Comments: To appear at ACL 2020

  14. arXiv:2004.05051  [pdf, other

    cs.CL

    A New Dataset for Natural Language Inference from Code-mixed Conversations

    Authors: Simran Khanuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury

    Abstract: Natural Language Inference (NLI) is the task of inferring the logical relationship, typically entailment or contradiction, between a premise and hypothesis. Code-mixing is the use of more than one language in the same conversation or utterance, and is prevalent in multilingual communities all over the world. In this paper, we present the first dataset for code-mixed NLI, in which both the premises… ▽ More

    Submitted 13 April, 2020; v1 submitted 10 April, 2020; originally announced April 2020.

    Comments: To appear in CALCS, LREC 2020

  15. arXiv:1912.03457  [pdf, other

    cs.CL cs.CY

    Unsung Challenges of Building and Deploying Language Technologies for Low Resource Language Communities

    Authors: Pratik Joshi, Christain Barnes, Sebastin Santy, Simran Khanuja, Sanket Shah, Anirudh Srinivasan, Satwik Bhattamishra, Sunayana Sitaram, Monojit Choudhury, Kalika Bali

    Abstract: In this paper, we examine and analyze the challenges associated with develo** and introducing language technologies to low-resource language communities. While doing so, we bring to light the successes and failures of past work in this area, challenges being faced in doing so, and what they have achieved. Throughout this paper, we take a problem-facing approach and describe essential factors whi… ▽ More

    Submitted 7 December, 2019; originally announced December 2019.

    Comments: Accepted at ICON 2019; 9 pages