-
Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages
Authors:
Gowtham Ramesh,
Sumanth Doddapaneni,
Aravinth Bheemaraj,
Mayank Jobanputra,
Raghavan AK,
Ajitesh Sharma,
Sujit Sahoo,
Harshita Diddee,
Mahalakshmi J,
Divyanshu Kakwani,
Navneet Kumar,
Aswin Pradeep,
Srihari Nagaraj,
Kumar Deepak,
Vivek Raghavan,
Anoop Kunchukuttan,
Pratyush Kumar,
Mitesh Shantadevi Khapra
Abstract:
We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages (from two language families). Specifically, we compile 12.4 million sentence pairs from existing, publicly-available parallel corpora, and additionally mine 37.4 million sentence pairs from the w…
▽ More
We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages (from two language families). Specifically, we compile 12.4 million sentence pairs from existing, publicly-available parallel corpora, and additionally mine 37.4 million sentence pairs from the web, resulting in a 4x increase. We mine the parallel sentences from the web by combining many corpora, tools, and methods: (a) web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents, (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences. Human evaluation of samples from the newly mined corpora validate the high quality of the parallel sentences across 11 languages. Further, we extract 83.4 million sentence pairs between all 55 Indic language pairs from the English-centric parallel corpus using English as the pivot language. We trained multilingual NMT models spanning all these languages on Samanantar, which outperform existing models and baselines on publicly available benchmarks, such as FLORES, establishing the utility of Samanantar. Our data and models are available publicly at https://ai4bharat.iitm.ac.in/samanantar and we hope they will help advance research in NMT and multilingual NLP for Indic languages.
△ Less
Submitted 12 June, 2023; v1 submitted 12 April, 2021;
originally announced April 2021.
-
MonkeyDB: Effectively Testing Correctness against Weak Isolation Levels
Authors:
Ranadeep Biswas,
Diptanshu Kakwani,
Jyothi Vedurada,
Constantin Enea,
Akash Lal
Abstract:
Modern applications, such as social networking systems and e-commerce platforms are centered around using large-scale storage systems for storing and retrieving data. In the presence of concurrent accesses, these storage systems trade off isolation for performance. The weaker the isolation level, the more behaviors a storage system is allowed to exhibit and it is up to the developer to ensure that…
▽ More
Modern applications, such as social networking systems and e-commerce platforms are centered around using large-scale storage systems for storing and retrieving data. In the presence of concurrent accesses, these storage systems trade off isolation for performance. The weaker the isolation level, the more behaviors a storage system is allowed to exhibit and it is up to the developer to ensure that their application can tolerate those behaviors. However, these weak behaviors only occur rarely in practice, that too outside the control of the application, making it difficult for developers to test the robustness of their code against weak isolation levels.
This paper presents MonkeyDB, a mock storage system for testing storage-backed applications. MonkeyDB supports a Key-Value interface as well as SQL queries under multiple isolation levels. It uses a logical specification of the isolation level to compute, on a read operation, the set of all possible return values. MonkeyDB then returns a value randomly from this set. We show that MonkeyDB provides good coverage of weak behaviors, which is complete in the limit. We test a variety of applications for assertions that fail only under weak isolation. MonkeyDB is able to break each of those assertions in a small number of attempts.
△ Less
Submitted 3 March, 2021;
originally announced March 2021.
-
AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages
Authors:
Anoop Kunchukuttan,
Divyanshu Kakwani,
Satish Golla,
Gokul N. C.,
Avik Bhattacharyya,
Mitesh M. Khapra,
Pratyush Kumar
Abstract:
We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We show that the IndicNLP embeddings significantly outperform publicly available pre-tr…
▽ More
We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We show that the IndicNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks. We hope that the availability of the corpus will accelerate Indic NLP research. The resources are available at https://github.com/ai4bharat-indicnlp/indicnlp_corpus.
△ Less
Submitted 30 April, 2020;
originally announced May 2020.
-
Distributed Algorithms for Subgraph-Centric Graph Platforms
Authors:
Diptanshu Kakwani,
Yogesh Simmhan
Abstract:
Graph analytics for large scale graphs has gained interest in recent years. Many graph algorithms have been designed for vertex-centric distributed graph processing frameworks to operate on large graphs with 100 M vertices and edges, using commodity clusters and Clouds. Subgraph-centric programming models have shown additional performance benefits than vertex-centric models. But direct map** of…
▽ More
Graph analytics for large scale graphs has gained interest in recent years. Many graph algorithms have been designed for vertex-centric distributed graph processing frameworks to operate on large graphs with 100 M vertices and edges, using commodity clusters and Clouds. Subgraph-centric programming models have shown additional performance benefits than vertex-centric models. But direct map** of vertex-centric and shared-memory algorithms to subgraph-centric frameworks are either not possible, or lead to inefficient algorithms. In this paper, we present three subgraph-centric distributed graph algorithms for triangle counting, clustering and minimum spanning forest, using variations of shared and vertex-centric models. These augment existing subgraph-centric algorithms that exist in literature, and allow a broader evaluation of these three classes of graph processing algorithms and platforms.
△ Less
Submitted 20 May, 2019;
originally announced May 2019.