Skip to main content

Showing 1–1 of 1 results for author: C., G N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2005.00085  [pdf, ps, other

    cs.CL

    AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages

    Authors: Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Gokul N. C., Avik Bhattacharyya, Mitesh M. Khapra, Pratyush Kumar

    Abstract: We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We show that the IndicNLP embeddings significantly outperform publicly available pre-tr… ▽ More

    Submitted 30 April, 2020; originally announced May 2020.

    Comments: 7 pages, 8 tables, https://github.com/ai4bharat-indicnlp/indicnlp_corpus