Skip to main content

Showing 1–4 of 4 results for author: Kakwani, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2104.05596  [pdf

    cs.CL

    Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

    Authors: Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan AK, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Mahalakshmi J, Divyanshu Kakwani, Navneet Kumar, Aswin Pradeep, Srihari Nagaraj, Kumar Deepak, Vivek Raghavan, Anoop Kunchukuttan, Pratyush Kumar, Mitesh Shantadevi Khapra

    Abstract: We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages (from two language families). Specifically, we compile 12.4 million sentence pairs from existing, publicly-available parallel corpora, and additionally mine 37.4 million sentence pairs from the w… ▽ More

    Submitted 12 June, 2023; v1 submitted 12 April, 2021; originally announced April 2021.

    Comments: Accepted to the Transactions of the Association for Computational Linguistics (TACL)

  2. arXiv:2103.02830  [pdf, other

    cs.PL

    MonkeyDB: Effectively Testing Correctness against Weak Isolation Levels

    Authors: Ranadeep Biswas, Diptanshu Kakwani, Jyothi Vedurada, Constantin Enea, Akash Lal

    Abstract: Modern applications, such as social networking systems and e-commerce platforms are centered around using large-scale storage systems for storing and retrieving data. In the presence of concurrent accesses, these storage systems trade off isolation for performance. The weaker the isolation level, the more behaviors a storage system is allowed to exhibit and it is up to the developer to ensure that… ▽ More

    Submitted 3 March, 2021; originally announced March 2021.

  3. arXiv:2005.00085  [pdf, ps, other

    cs.CL

    AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages

    Authors: Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Gokul N. C., Avik Bhattacharyya, Mitesh M. Khapra, Pratyush Kumar

    Abstract: We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We show that the IndicNLP embeddings significantly outperform publicly available pre-tr… ▽ More

    Submitted 30 April, 2020; originally announced May 2020.

    Comments: 7 pages, 8 tables, https://github.com/ai4bharat-indicnlp/indicnlp_corpus

  4. arXiv:1905.08051  [pdf, other

    cs.DC

    Distributed Algorithms for Subgraph-Centric Graph Platforms

    Authors: Diptanshu Kakwani, Yogesh Simmhan

    Abstract: Graph analytics for large scale graphs has gained interest in recent years. Many graph algorithms have been designed for vertex-centric distributed graph processing frameworks to operate on large graphs with 100 M vertices and edges, using commodity clusters and Clouds. Subgraph-centric programming models have shown additional performance benefits than vertex-centric models. But direct map** of… ▽ More

    Submitted 20 May, 2019; originally announced May 2019.

    Journal ref: HiPC 2016 SRS