Skip to main content

Showing 1–2 of 2 results for author: Madhani, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2305.15814  [pdf, other

    cs.CL

    Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages

    Authors: Yash Madhani, Mitesh M. Khapra, Anoop Kunchukuttan

    Abstract: We create publicly available language identification (LID) datasets and models in all 22 Indian languages listed in the Indian constitution in both native-script and romanized text. First, we create Bhasha-Abhijnaanam, a language identification test set for native-script as well as romanized text which spans all 22 Indic languages. We also train IndicLID, a language identifier for all the above-me… ▽ More

    Submitted 26 October, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

    Comments: Accepted to ACL 2023

  2. arXiv:2205.03018  [pdf

    cs.CL

    Aksharantar: Open Indic-language Transliteration datasets and models for the Next Billion Users

    Authors: Yash Madhani, Sushane Parthan, Priyanka Bedekar, Gokul NC, Ruchi Khapra, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra

    Abstract: Transliteration is very important in the Indian language context due to the usage of multiple scripts and the widespread use of romanized inputs. However, few training and evaluation sets are publicly available. We introduce Aksharantar, the largest publicly available transliteration dataset for Indian languages created by mining from monolingual and parallel corpora, as well as collecting data fr… ▽ More

    Submitted 26 October, 2023; v1 submitted 6 May, 2022; originally announced May 2022.

    Comments: This manuscript is an extended version of the paper accepted to EMNLP Findings 2023. You can find the EMNLP Findings version at https://anoopkunchukuttan.gitlab.io/publications/emnlp_findings_2023_aksharantar.pdf