Skip to main content

Showing 1–50 of 104 results for author: Ramakrishnan, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.18135  [pdf

    cs.CL cs.SD eess.AS

    Automatic Speech Recognition for Hindi

    Authors: Anish Saha, A. G. Ramakrishnan

    Abstract: Automatic speech recognition (ASR) is a key area in computational linguistics, focusing on develo** technologies that enable computers to convert spoken language into text. This field combines linguistics and machine learning. ASR models, which map speech audio to transcripts through supervised learning, require handling real and unrestricted text. Text-to-speech systems directly work with real… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  2. arXiv:2406.17377  [pdf, other

    cs.CL

    A Three-Pronged Approach to Cross-Lingual Adaptation with Multilingual LLMs

    Authors: Vaibhav Singh, Amrith Krishna, Karthika NJ, Ganesh Ramakrishnan

    Abstract: Low-resource languages, by its very definition, tend to be under represented in the pre-training corpora of Large Language Models. In this work, we investigate three low-resource cross-lingual approaches that enable an LLM adapt to tasks in previously unseen languages. Llama-2 is an LLM where Indic languages, among many other language families, contribute to less than $0.005\%$ of the total $2$ tr… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  3. arXiv:2405.11200  [pdf, other

    cs.CL

    LexGen: Domain-aware Multilingual Lexicon Generation

    Authors: Karthika NJ, Ayush Maheshwari, Atul Kumar Singh, Preethi Jyothi, Ganesh Ramakrishnan, Krishnakant Bhatt

    Abstract: Lexicon or dictionary generation across domains is of significant societal importance, as it can potentially enhance information accessibility for a diverse user base while preserving language identity. Prior work in the field primarily focuses on bilingual lexical induction, which deals with word alignments using map**-based or corpora-based approaches. Though initiated by researchers, the rese… ▽ More

    Submitted 18 May, 2024; originally announced May 2024.

  4. arXiv:2403.08370  [pdf, other

    cs.CL cs.AI cs.LG

    SMART: Submodular Data Mixture Strategy for Instruction Tuning

    Authors: H S V N S Kowndinya Renduchintala, Sumit Bhatia, Ganesh Ramakrishnan

    Abstract: Instruction Tuning involves finetuning a language model on a collection of instruction-formatted datasets in order to enhance the generalizability of the model to unseen tasks. Studies have shown the importance of balancing different task proportions during finetuning, but finding the right balance remains challenging. Unfortunately, there's currently no systematic method beyond manual tuning or r… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

  5. arXiv:2403.04890  [pdf, other

    cs.CL

    Few shot chain-of-thought driven reasoning to prompt LLMs for open ended medical question answering

    Authors: Ojas Gramopadhye, Saeel Sandeep Nachane, Prateek Chanda, Ganesh Ramakrishnan, Kshitij Sharad Jadhav, Yatin Nandwani, Dinesh Raghu, Sachindra Joshi

    Abstract: Large Language models (LLMs) have demonstrated significant potential in transforming healthcare by automating tasks such as clinical documentation, information retrieval, and decision support. In this aspect, carefully engineered prompts have emerged as a powerful tool for using LLMs for medical scenarios, e.g., patient clinical scenarios. In this paper, we propose a modified version of the MedQA-… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

  6. arXiv:2402.15472  [pdf, other

    cs.LG

    FAIR: Filtering of Automatically Induced Rules

    Authors: Divya Jyoti Bajpai, Ayush Maheshwari, Manjesh Kumar Hanawal, Ganesh Ramakrishnan

    Abstract: The availability of large annotated data can be a critical bottleneck in training machine learning algorithms successfully, especially when applied to diverse domains. Weak supervision offers a promising alternative by accelerating the creation of labeled training data using domain-specific rules. However, it requires users to write a diverse set of high-quality rules to assign labels to the unlab… ▽ More

    Submitted 23 February, 2024; originally announced February 2024.

    Comments: Published at EACL 2024

  7. arXiv:2402.09811  [pdf, other

    cs.CV

    TEXTRON: Weakly Supervised Multilingual Text Detection through Data Programming

    Authors: Dhruv Kudale, Badri Vishal Kasuba, Venkatapathy Subramanian, Parag Chaudhuri, Ganesh Ramakrishnan

    Abstract: Several recent deep learning (DL) based techniques perform considerably well on image-based multilingual text detection. However, their performance relies heavily on the availability and quality of training data. There are numerous types of page-level document images consisting of information in several modalities, languages, fonts, and layouts. This makes text detection a challenging problem in t… ▽ More

    Submitted 15 February, 2024; originally announced February 2024.

    Comments: Accepted at the WACV 2024 Conference

  8. arXiv:2402.07173  [pdf, other

    cs.CV

    INSITE: labelling medical images using submodular functions and semi-supervised data programming

    Authors: Akshat Gautam, Anurag Shandilya, Akshit Srivastava, Venkatapathy Subramanian, Ganesh Ramakrishnan, Kshitij Jadhav

    Abstract: The necessity of large amounts of labeled data to train deep models, especially in medical imaging creates an implementation bottleneck in resource-constrained settings. In Insite (labelINg medical imageS usIng submodular funcTions and sEmi-supervised data programming) we apply informed subset selection to identify a small number of most representative or diverse images from a huge pool of unlabel… ▽ More

    Submitted 11 February, 2024; originally announced February 2024.

  9. arXiv:2401.06989  [pdf, other

    cs.LG

    Gradient Coreset for Federated Learning

    Authors: Durga Sivasubramanian, Lokesh Nagalapatti, Rishabh Iyer, Ganesh Ramakrishnan

    Abstract: Federated Learning (FL) is used to learn machine learning models with data that is partitioned across multiple clients, including resource-constrained edge devices. It is therefore important to devise solutions that are efficient in terms of compute, communication, and energy consumption, while ensuring compliance with the FL framework's privacy requirements. Conventional approaches to these probl… ▽ More

    Submitted 13 January, 2024; originally announced January 2024.

    Comments: Accepted at WACV-24

  10. arXiv:2311.13993  [pdf, other

    cs.CV

    EIGEN: Expert-Informed Joint Learning Aggregation for High-Fidelity Information Extraction from Document Images

    Authors: Abhishek Singh, Venkatapathy Subramanian, Ayush Maheshwari, Pradeep Narayan, Devi Prasad Shetty, Ganesh Ramakrishnan

    Abstract: Information Extraction (IE) from document images is challenging due to the high variability of layout formats. Deep models such as LayoutLM and BROS have been proposed to address this problem and have shown promising results. However, they still require a large amount of field-level annotations for training these models. Other approaches using rule-based methods have also been proposed based on th… ▽ More

    Submitted 23 November, 2023; originally announced November 2023.

    Comments: In Proceedings of ML for Health Conference, 2023 (co-located with Neurips)

  11. arXiv:2310.18590  [pdf, other

    cs.LG cs.AI

    Using Early Readouts to Mediate Featural Bias in Distillation

    Authors: Rishabh Tiwari, Durga Sivasubramanian, Anmol Mekala, Ganesh Ramakrishnan, Pradeep Shenoy

    Abstract: Deep networks tend to learn spurious feature-label correlations in real-world supervised learning tasks. This vulnerability is aggravated in distillation, where a student model may have lesser representational capacity than the corresponding teacher model. Often, knowledge of specific spurious correlations is used to reweight instances & rebalance the learning process. We propose a novel early rea… ▽ More

    Submitted 8 November, 2023; v1 submitted 28 October, 2023; originally announced October 2023.

  12. arXiv:2310.17138  [pdf, other

    cs.CV

    A Classifier Using Global Character Level and Local Sub-unit Level Features for Hindi Online Handwritten Character Recognition

    Authors: Anand Sharma, A. G. Ramakrishnan

    Abstract: A classifier is developed that defines a joint distribution of global character features, number of sub-units and local sub-unit features to model Hindi online handwritten characters. The classifier uses latent variables to model the structure of sub-units. The classifier uses histograms of points, orientations, and dynamics of orientations (HPOD) features to represent characters at global charact… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

    Comments: 23 pages, 8 jpg figures. arXiv admin note: text overlap with arXiv:2310.08222

  13. arXiv:2310.08222  [pdf, other

    cs.CV

    Structural analysis of Hindi online handwritten characters for character recognition

    Authors: Anand Sharma, A. G. Ramakrishnan

    Abstract: Direction properties of online strokes are used to analyze them in terms of homogeneous regions or sub-strokes with points satisfying common geometric properties. Such sub-strokes are called sub-units. These properties are used to extract sub-units from Hindi ideal online characters. These properties along with some heuristics are used to extract sub-units from Hindi online handwritten characters.… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

    Comments: 34 pages, 36 jpg figures

  14. arXiv:2310.06702  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Temporally Aligning Long Audio Interviews with Questions: A Case Study in Multimodal Data Integration

    Authors: Piyush Singh Pasi, Karthikeya Battepati, Preethi Jyothi, Ganesh Ramakrishnan, Tanmay Mahapatra, Manoj Singh

    Abstract: The problem of audio-to-text alignment has seen significant amount of research using complete supervision during training. However, this is typically not in the context of long audio recordings wherein the text being queried does not appear verbatim within the audio file. This work is a collaboration with a non-governmental organization called CARE India that collects long audio health surveys fro… ▽ More

    Submitted 10 October, 2023; originally announced October 2023.

    Comments: Work Accepted in IJCAI-23- AI and Social Good Track

  15. arXiv:2309.02067  [pdf, other

    cs.CV eess.SP

    Histograms of Points, Orientations, and Dynamics of Orientations Features for Hindi Online Handwritten Character Recognition

    Authors: Anand Sharma, A. G. Ramakrishnan

    Abstract: A set of features independent of character stroke direction and order variations is proposed for online handwritten character recognition. A method is developed that maps features like co-ordinates of points, orientations of strokes at points, and dynamics of orientations of strokes at points spatially as a function of co-ordinate values of the points and computes histograms of these features from… ▽ More

    Submitted 5 September, 2023; originally announced September 2023.

    Comments: 21 pages, 12 jpg figures

  16. arXiv:2305.14004  [pdf

    cs.CL

    Sāmayik: A Benchmark and Dataset for English-Sanskrit Translation

    Authors: Ayush Maheshwari, Ashim Gupta, Amrith Krishna, Atul Kumar Singh, Ganesh Ramakrishnan, G. Anil Kumar, Jitin Singla

    Abstract: We release Sāmayik, a dataset of around 53,000 parallel English-Sanskrit sentences, written in contemporary prose. Sanskrit is a classical language still in sustenance and has a rich documented heritage. However, due to the limited availability of digitized content, it still remains a low-resource language. Existing Sanskrit corpora, whether monolingual or bilingual, have predominantly focused on… ▽ More

    Submitted 29 March, 2024; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: LREC-COLING, 2024

  17. arXiv:2305.06677  [pdf, other

    cs.CL cs.AI cs.LG

    INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models

    Authors: H S V N S Kowndinya Renduchintala, Krishnateja Killamsetty, Sumit Bhatia, Milan Aggarwal, Ganesh Ramakrishnan, Rishabh Iyer, Balaji Krishnamurthy

    Abstract: A salient characteristic of pre-trained language models (PTLMs) is a remarkable improvement in their generalization capability and emergence of new capabilities with increasing model capacity and pre-training dataset size. Consequently, we are witnessing the development of enormous models pushing the state-of-the-art. It is, however, imperative to realize that this inevitably leads to prohibitivel… ▽ More

    Submitted 19 October, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

  18. arXiv:2305.02997  [pdf, other

    cs.LG cs.AI stat.ML

    When Do Neural Nets Outperform Boosted Trees on Tabular Data?

    Authors: Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Benjamin Feuer, Chinmay Hegde, Ganesh Ramakrishnan, Micah Goldblum, Colin White

    Abstract: Tabular data is one of the most commonly used types of data in machine learning. Despite recent advances in neural nets (NNs) for tabular data, there is still an active discussion on whether or not NNs generally outperform gradient-boosted decision trees (GBDTs) on tabular data, with several recent works arguing either that GBDTs consistently outperform NNs on tabular data, or vice versa. In this… ▽ More

    Submitted 30 October, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

    Comments: NeurIPS Datasets and Benchmarks Track 2023

  19. arXiv:2211.07980  [pdf, other

    cs.CL

    A Benchmark and Dataset for Post-OCR text correction in Sanskrit

    Authors: Ayush Maheshwari, Nikhil Singh, Amrith Krishna, Ganesh Ramakrishnan

    Abstract: Sanskrit is a classical language with about 30 million extant manuscripts fit for digitisation, available in written, printed or scannedimage forms. However, it is still considered to be a low-resource language when it comes to available digital resources. In this work, we release a post-OCR text correction dataset containing around 218,000 sentences, with 1.5 million words, from 30 different book… ▽ More

    Submitted 15 November, 2022; originally announced November 2022.

    Comments: Findings of EMNLP, 2022. Code and Data: https://github.com/ayushbits/pe-ocr-sanskrit

  20. arXiv:2211.01454  [pdf, other

    cs.LG

    Speeding up NAS with Adaptive Subset Selection

    Authors: Vishak Prasad C, Colin White, Paarth Jain, Sibasis Nayak, Ganesh Ramakrishnan

    Abstract: A majority of recent developments in neural architecture search (NAS) have been aimed at decreasing the computational cost of various techniques without affecting their final performance. Towards this goal, several low-fidelity and performance prediction methods have been considered, including those that train only on subsets of the training data. In this work, we present an adaptive subset select… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

  21. arXiv:2210.16892  [pdf, other

    cs.LG

    Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training

    Authors: Ashish Mittal, Durga Sivasubramanian, Rishabh Iyer, Preethi Jyothi, Ganesh Ramakrishnan

    Abstract: Training state-of-the-art ASR systems such as RNN-T often has a high associated financial and environmental cost. Training with a subset of training data could mitigate this problem if the subset selected could achieve on-par performance with training with the entire dataset. Although there are many data subset selection(DSS) algorithms, direct application to the RNN-T is difficult, especially the… ▽ More

    Submitted 30 October, 2022; originally announced October 2022.

  22. arXiv:2210.06996  [pdf, other

    cs.CL cs.LG

    DICTDIS: Dictionary Constrained Disambiguation for Improved NMT

    Authors: Ayush Maheshwari, Piyush Sharma, Preethi Jyothi, Ganesh Ramakrishnan

    Abstract: Domain-specific neural machine translation (NMT) systems (\eg, in educational applications) are socially significant with the potential to help make information accessible to a diverse set of users in multilingual societies. It is desirable that such NMT systems be lexically constrained and draw from domain-specific dictionaries. Dictionaries could present multiple candidate translations for a sou… ▽ More

    Submitted 21 May, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

  23. arXiv:2210.03324  [pdf, other

    cs.LG cs.AI stat.ML

    AutoML for Climate Change: A Call to Action

    Authors: Renbo Tu, Nicholas Roberts, Vishak Prasad, Sibasis Nayak, Paarth Jain, Frederic Sala, Ganesh Ramakrishnan, Ameet Talwalkar, Willie Neiswanger, Colin White

    Abstract: The challenge that climate change poses to humanity has spurred a rapidly develo** field of artificial intelligence research focused on climate change applications. The climate change AI (CCAI) community works on a diverse, challenging set of problems which often involve physics-constrained ML or heterogeneous spatiotemporal data. It would be desirable to use automated machine learning (AutoML)… ▽ More

    Submitted 7 October, 2022; originally announced October 2022.

  24. arXiv:2210.01526  [pdf, other

    cs.CV

    DIAGNOSE: Avoiding Out-of-distribution Data using Submodular Information Measures

    Authors: Suraj Kothawade, Akshit Srivastava, Venkat Iyer, Ganesh Ramakrishnan, Rishabh Iyer

    Abstract: Avoiding out-of-distribution (OOD) data is critical for training supervised machine learning models in the medical imaging domain. Furthermore, obtaining labeled medical data is difficult and expensive since it requires expert annotators like doctors, radiologists, etc. Active learning (AL) is a well-known method to mitigate labeling costs by selecting the most diverse or uncertain samples. Howeve… ▽ More

    Submitted 4 October, 2022; originally announced October 2022.

    Comments: Accepted to MICCAI 2022 MILLanD Workshop

  25. arXiv:2210.01520  [pdf, other

    cs.CV

    CLINICAL: Targeted Active Learning for Imbalanced Medical Image Classification

    Authors: Suraj Kothawade, Atharv Savarkar, Venkat Iyer, Lakshman Tamil, Ganesh Ramakrishnan, Rishabh Iyer

    Abstract: Training deep learning models on medical datasets that perform well for all classes is a challenging task. It is often the case that a suboptimal performance is obtained on some classes due to the natural class imbalance issue that comes with medical data. An effective way to tackle this problem is by using targeted active learning, where we iteratively add data points to the training data that be… ▽ More

    Submitted 4 October, 2022; originally announced October 2022.

    Comments: Accepted to MICCAI 2022 MILLanD Workshop

  26. arXiv:2204.04653  [pdf, other

    cs.CV cs.MM

    Counting in the 2020s: Binned Representations and Inclusive Performance Measures for Deep Crowd Counting Approaches

    Authors: Sravya Vardhani Shivapuja, Ashwin Gopinath, Ayush Gupta, Ganesh Ramakrishnan, Ravi Kiran Sarvadevabhatla

    Abstract: The data distribution in popular crowd counting datasets is typically heavy tailed and discontinuous. This skew affects all stages within the pipelines of deep crowd counting approaches. Specifically, the approaches exhibit unacceptably large standard deviation wrt statistical measures (MSE, MAE). To address such concerns in a holistic manner, we make two fundamental contributions. Firstly, we mod… ▽ More

    Submitted 10 April, 2022; originally announced April 2022.

    Comments: Extended version of arXiv:2108.08784. In review

  27. arXiv:2203.16860  [pdf, other

    cs.CV cs.MM cs.SD eess.AS eess.IV

    Investigating Modality Bias in Audio Visual Video Parsing

    Authors: Piyush Singh Pasi, Shubham Nemani, Preethi Jyothi, Ganesh Ramakrishnan

    Abstract: We focus on the audio-visual video parsing (AVVP) problem that involves detecting audio and visual event labels with temporal boundaries. The task is especially challenging since it is weakly supervised with only event labels available as a bag of labels for each video. An existing state-of-the-art model for AVVP uses a hybrid attention network (HAN) to generate cross-modal features for both audio… ▽ More

    Submitted 11 November, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

    Comments: Work under review for ICASSP 2023

  28. arXiv:2203.08212  [pdf, other

    cs.LG

    AUTOMATA: Gradient Based Data Subset Selection for Compute-Efficient Hyper-parameter Tuning

    Authors: Krishnateja Killamsetty, Guttu Sai Abhishek, Aakriti, Alexandre V. Evfimievski, Lucian Popa, Ganesh Ramakrishnan, Rishabh Iyer

    Abstract: Deep neural networks have seen great success in recent years; however, training a deep model is often challenging as its performance heavily depends on the hyper-parameters used. In addition, finding the optimal hyper-parameter configuration, even with state-of-the-art (SOTA) hyper-parameter optimization (HPO) algorithms, can be time-consuming, requiring multiple training runs over the entire data… ▽ More

    Submitted 15 March, 2022; originally announced March 2022.

  29. arXiv:2203.05651  [pdf, other

    cs.LG

    BASIL: Balanced Active Semi-supervised Learning for Class Imbalanced Datasets

    Authors: Suraj Kothawade, Pavan Kumar Reddy, Ganesh Ramakrishnan, Rishabh Iyer

    Abstract: Current semi-supervised learning (SSL) methods assume a balance between the number of data points available for each class in both the labeled and the unlabeled data sets. However, there naturally exists a class imbalance in most real-world datasets. It is known that training models on such imbalanced datasets leads to biased models, which in turn lead to biased predictions towards the more freque… ▽ More

    Submitted 10 March, 2022; originally announced March 2022.

  30. arXiv:2203.01644  [pdf, other

    cs.CL

    UDAAN: Machine Learning based Post-Editing tool for Document Translation

    Authors: Ayush Maheshwari, Ajay Ravindran, Venkatapathy Subramanian, Ganesh Ramakrishnan

    Abstract: We introduce UDAAN, an open-source post-editing tool that can reduce manual editing efforts to quickly produce publishable-standard documents in several Indic languages. UDAAN has an end-to-end Machine Translation (MT) plus post-editing pipeline wherein users can upload a document to obtain raw MT output. Further, users can edit the raw translations using our tool. UDAAN offers several advantages:… ▽ More

    Submitted 21 November, 2022; v1 submitted 3 March, 2022; originally announced March 2022.

    Comments: Demo paper at CoDS-COMAD 2023. Vist project website at https://udaanproject.org

  31. arXiv:2202.10680  [pdf, other

    cs.LG cs.IR

    Submodlib: A Submodular Optimization Library

    Authors: Vishal Kaushal, Ganesh Ramakrishnan, Rishabh Iyer

    Abstract: Submodular functions are a special class of set functions which naturally model the notion of representativeness, diversity, coverage etc. and have been shown to be computationally very efficient. A lot of past work has applied submodular optimization to find optimal subsets in various contexts. Some examples include data summarization for efficient human consumption, finding effective smaller sub… ▽ More

    Submitted 23 February, 2022; v1 submitted 22 February, 2022; originally announced February 2022.

    Comments: 23 pages with references, 10 figures, 5 tables

  32. arXiv:2202.03250  [pdf, other

    cs.LG

    Adaptive Mixing of Auxiliary Losses in Supervised Learning

    Authors: Durga Sivasubramanian, Ayush Maheshwari, Pradeep Shenoy, Prathosh AP, Ganesh Ramakrishnan

    Abstract: In several supervised learning scenarios, auxiliary losses are used in order to introduce additional information or constraints into the supervised learning objective. For instance, knowledge distillation aims to mimic outputs of a powerful teacher model; similarly, in rule-based approaches, weak labeling information is provided by labeling functions which may be noisy rule-based approximations to… ▽ More

    Submitted 7 December, 2022; v1 submitted 7 February, 2022; originally announced February 2022.

  33. arXiv:2202.01157  [pdf, other

    cs.CL cs.LG

    Error Correction in ASR using Sequence-to-Sequence Models

    Authors: Samrat Dutta, Shreyansh Jain, Ayush Maheshwari, Souvik Pal, Ganesh Ramakrishnan, Preethi Jyothi

    Abstract: Post-editing in Automatic Speech Recognition (ASR) entails automatically correcting common and systematic errors produced by the ASR system. The outputs of an ASR system are largely prone to phonetic and spelling errors. In this paper, we propose to use a powerful pre-trained sequence-to-sequence model, BART, further adaptively trained to serve as a denoising model, to correct errors of such types… ▽ More

    Submitted 23 August, 2022; v1 submitted 2 February, 2022; originally announced February 2022.

  34. arXiv:2110.04908  [pdf, other

    eess.AS cs.SD

    DITTO: Data-efficient and Fair Targeted Subset Selection for ASR Accent Adaptation

    Authors: Suraj Kothawade, Anmol Mekala, Chandra Sekhara D, Mayank Kothyari, Rishabh Iyer, Ganesh Ramakrishnan, Preethi Jyothi

    Abstract: State-of-the-art Automatic Speech Recognition (ASR) systems are known to exhibit disparate performance on varying speech accents. To improve performance on a specific target accent, a commonly adopted solution is to finetune the ASR model using accent-specific labeled speech. However, acquiring large amounts of labeled speech for specific target accents is challenging. Choosing an informative subs… ▽ More

    Submitted 5 June, 2023; v1 submitted 10 October, 2021; originally announced October 2021.

    Comments: ACL 2023

  35. arXiv:2109.11410  [pdf, other

    cs.LG

    Learning to Robustly Aggregate Labeling Functions for Semi-supervised Data Programming

    Authors: Ayush Maheshwari, Krishnateja Killamsetty, Ganesh Ramakrishnan, Rishabh Iyer, Marina Danilevsky, Lucian Popa

    Abstract: A critical bottleneck in supervised machine learning is the need for large amounts of labeled data which is expensive and time consuming to obtain. However, it has been shown that a small amount of labeled data, while insufficient to re-train a model, can be effectively used to generate human-interpretable labeling functions (LFs). These LFs, in turn, have been used to generate a large amount of a… ▽ More

    Submitted 10 March, 2022; v1 submitted 23 September, 2021; originally announced September 2021.

    Comments: Findings of ACL, 2022

  36. arXiv:2109.05494  [pdf, other

    cs.CL cs.SD eess.AS

    Unsupervised Domain Adaptation Schemes for Building ASR in Low-resource Languages

    Authors: Anoop C S, Prathosh A P, A G Ramakrishnan

    Abstract: Building an automatic speech recognition (ASR) system from scratch requires a large amount of annotated speech data, which is difficult to collect in many languages. However, there are cases where the low-resource language shares a common acoustic space with a high-resource language having enough annotated data to build an ASR. In such cases, we show that the domain-independent acoustic models lea… ▽ More

    Submitted 16 September, 2021; v1 submitted 12 September, 2021; originally announced September 2021.

    Comments: Submitted to ASRU 2021

  37. Wisdom of (Binned) Crowds: A Bayesian Stratification Paradigm for Crowd Counting

    Authors: Sravya Vardhani Shivapuja, Mansi Pradeep Khamkar, Divij Bajaj, Ganesh Ramakrishnan, Ravi Kiran Sarvadevabhatla

    Abstract: Datasets for training crowd counting deep networks are typically heavy-tailed in count distribution and exhibit discontinuities across the count range. As a result, the de facto statistical measures (MSE, MAE) exhibit large variance and tend to be unreliable indicators of performance across the count range. To address these concerns in a holistic manner, we revise processes at various stages of th… ▽ More

    Submitted 19 August, 2021; originally announced August 2021.

    Comments: Accepted at ACM Multimedia (ACMMM) 2021 . Code, pretrained models and interactive visualizations can be viewed at our project page https://deepcount.iiit.ac.in/

  38. arXiv:2108.00373  [pdf, other

    cs.LG

    SPEAR : Semi-supervised Data Programming in Python

    Authors: Guttu Sai Abhishek, Harshad Ingole, Parth Laturia, Vineeth Dorna, Ayush Maheshwari, Rishabh Iyer, Ganesh Ramakrishnan

    Abstract: We present SPEAR, an open-source python library for data programming with semi supervision. The package implements several recent data programming approaches including facility to programmatically label and build training data. SPEAR facilitates weak supervision in the form of heuristics (or rules) and association of noisy labels to the training dataset. These noisy labels are aggregated to assign… ▽ More

    Submitted 5 October, 2022; v1 submitted 1 August, 2021; originally announced August 2021.

    Comments: EMNLP Demonstrations - 2022

  39. arXiv:2106.15324  [pdf, other

    cs.CV cs.AI cs.LG

    Effective Evaluation of Deep Active Learning on Image Classification Tasks

    Authors: Nathan Beck, Durga Sivasubramanian, Apurva Dani, Ganesh Ramakrishnan, Rishabh Iyer

    Abstract: With the goal of making deep learning more label-efficient, a growing number of papers have been studying active learning (AL) for deep models. However, there are a number of issues in the prevalent experimental settings, mainly stemming from a lack of unified implementation and benchmarking. Issues in the current literature include sometimes contradictory observations on the performance of differ… ▽ More

    Submitted 2 November, 2021; v1 submitted 16 June, 2021; originally announced June 2021.

    Comments: 10 pages in main paper, 6 figures in main paper, 2 tables in main paper. 24 pages in total, 15 figures in total, 13 tables in total

  40. arXiv:2106.12491  [pdf, other

    cs.LG stat.ML

    Training Data Subset Selection for Regression with Controlled Generalization Error

    Authors: Durga Sivasubramanian, Rishabh Iyer, Ganesh Ramakrishnan, Abir De

    Abstract: Data subset selection from a large number of training instances has been a successful approach toward efficient and cost-effective machine learning. However, models trained on a smaller subset may show poor generalization ability. In this paper, our goal is to design an algorithm for selecting a subset of the training data, so that the model can be trained quickly, without significantly sacrificin… ▽ More

    Submitted 23 June, 2021; originally announced June 2021.

    Journal ref: ICML 2021

  41. arXiv:2106.05852  [pdf

    eess.AS cs.CL cs.LG cs.SD

    Automatic Speech Recognition in Sanskrit: A New Speech Corpus and Modelling Insights

    Authors: Devaraja Adiga, Rishabh Kumar, Amrith Krishna, Preethi Jyothi, Ganesh Ramakrishnan, Pawan Goyal

    Abstract: Automatic speech recognition (ASR) in Sanskrit is interesting, owing to the various linguistic peculiarities present in the language. The Sanskrit language is lexically productive, undergoes euphonic assimilation of phones at the word boundaries and exhibits variations in spelling conventions and in pronunciations. In this work, we propose the first large scale study of automatic speech recognitio… ▽ More

    Submitted 23 July, 2021; v1 submitted 2 June, 2021; originally announced June 2021.

    Comments: Accepted paper at the 59th Annual Meeting of the Association for Computational Linguistics (ACL 2021 Findings)

  42. arXiv:2105.10193  [pdf, other

    cs.CL cs.LG

    Rule Augmented Unsupervised Constituency Parsing

    Authors: Atul Sahay, Anshul Nasery, Ayush Maheshwari, Ganesh Ramakrishnan, Rishabh Iyer

    Abstract: Recently, unsupervised parsing of syntactic trees has gained considerable attention. A prototypical approach to such unsupervised parsing employs reinforcement learning and auto-encoders. However, no mechanism ensures that the learnt model leverages the well-understood language grammar. We propose an approach that utilizes very generic linguistic knowledge of the language present in the form of sy… ▽ More

    Submitted 21 May, 2021; originally announced May 2021.

    Comments: Accepted at Findings of ACL 2021. 10 Pages, 5 Tables, 2 Figures

  43. arXiv:2105.00043  [pdf, other

    cs.LG cs.CV

    Submodular Mutual Information for Targeted Data Subset Selection

    Authors: Suraj Kothawade, Vishal Kaushal, Ganesh Ramakrishnan, Jeff Bilmes, Rishabh Iyer

    Abstract: With the rapid growth of data, it is becoming increasingly difficult to train or improve deep learning models with the right subset of data. We show that this problem can be effectively solved at an additional labeling cost by targeted data subset selection(TSS) where a subset of unlabeled data points similar to an auxiliary set are added to the training data. We do so by using a rich class of Sub… ▽ More

    Submitted 30 April, 2021; originally announced May 2021.

    Comments: Accepted to ICLR 2021 S2D-OLAD Workshop; https://s2d-olad.github.io/. arXiv admin note: substantial text overlap with arXiv:2103.00128

  44. arXiv:2104.06722  [pdf, other

    cs.CL cs.LG

    WARM: A Weakly (+Semi) Supervised Model for Solving Math word Problems

    Authors: Oishik Chatterjee, Isha Pandey, Aashish Waikar, Vishwajeet Kumar, Ganesh Ramakrishnan

    Abstract: Solving math word problems (MWPs) is an important and challenging problem in natural language processing. Existing approaches to solve MWPs require full supervision in the form of intermediate equations. However, labeling every MWP with its corresponding equations is a time-consuming and expensive task. In order to address this challenge of equation annotation, we propose a weakly supervised model… ▽ More

    Submitted 13 June, 2023; v1 submitted 14 April, 2021; originally announced April 2021.

    Comments: Accepted in COLING'22

  45. arXiv:2104.04998  [pdf, other

    cs.CL cs.LG

    Unsupervised Learning of Explainable Parse Trees for Improved Generalisation

    Authors: Atul Sahay, Ayush Maheshwari, Ritesh Kumar, Ganesh Ramakrishnan, Manjesh Kumar Hanawal, Kavi Arya

    Abstract: Recursive neural networks (RvNN) have been shown useful for learning sentence representations and helped achieve competitive performance on several natural language inference tasks. However, recent RvNN-based models fail to learn simple grammar and meaningful semantics in their intermediate tree representation. In this work, we propose an attention mechanism over Tree-LSTMs to learn more meaningfu… ▽ More

    Submitted 11 April, 2021; originally announced April 2021.

    Comments: 8 Pages, 5 Tables, 4 Figures. To appear at IJCNN 2021

  46. arXiv:2104.04598  [pdf, other

    cs.SD cs.CV cs.LG eess.AS eess.IV

    Cross-Modal learning for Audio-Visual Video Parsing

    Authors: Jatin Lamba, Abhishek, Jayaprakash Akula, Rishabh Dabral, Preethi Jyothi, Ganesh Ramakrishnan

    Abstract: In this paper, we present a novel approach to the audio-visual video parsing (AVVP) task that demarcates events from a video separately for audio and visual modalities. The proposed parsing approach simultaneously detects the temporal boundaries in terms of start and end times of such events. We show how AVVP can benefit from the following techniques geared towards effective cross-modal learning:… ▽ More

    Submitted 21 June, 2021; v1 submitted 3 April, 2021; originally announced April 2021.

    Comments: Work accepted at Interspeech 2021

  47. Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering

    Authors: Aman Jain, Mayank Kothyari, Vishwajeet Kumar, Preethi Jyothi, Ganesh Ramakrishnan, Soumen Chakrabarti

    Abstract: Multimodal IR, spanning text corpus, knowledge graph and images, called outside knowledge visual question answering (OKVQA), is of much recent interest. However, the popular data set has serious limitations. A surprisingly large fraction of queries do not assess the ability to integrate cross-modal information. Instead, some are independent of the image, some depend on speculation, some require OC… ▽ More

    Submitted 10 August, 2021; v1 submitted 9 March, 2021; originally announced March 2021.

    Comments: Accepted at SIGIR 2021

  48. arXiv:2103.05457  [pdf, other

    cs.IR

    Rudder: A Cross Lingual Video and Text Retrieval Dataset

    Authors: Jayaprakash A, Abhishek, Rishabh Dabral, Ganesh Ramakrishnan, Preethi Jyothi

    Abstract: Video retrieval using natural language queries requires learning semantically meaningful joint embeddings between the text and the audio-visual input. Often, such joint embeddings are learnt using pairwise (or triplet) contrastive loss objectives which cannot give enough attention to 'difficult-to-retrieve' samples during training. This problem is especially pronounced in data-scarce settings wher… ▽ More

    Submitted 9 March, 2021; originally announced March 2021.

  49. arXiv:2103.00128  [pdf, other

    cs.CV

    PRISM: A Rich Class of Parameterized Submodular Information Measures for Guided Subset Selection

    Authors: Suraj Kothawade, Vishal Kaushal, Ganesh Ramakrishnan, Jeff Bilmes, Rishabh Iyer

    Abstract: With ever-increasing dataset sizes, subset selection techniques are becoming increasingly important for a plethora of tasks. It is often necessary to guide the subset selection to achieve certain desiderata, which includes focusing or targeting certain data points, while avoiding others. Examples of such problems include: i)targeted learning, where the goal is to find subsets with rare classes or… ▽ More

    Submitted 8 March, 2022; v1 submitted 26 February, 2021; originally announced March 2021.

    Comments: To Appear In 36th AAAI Conference on Artificial Intelligence, AAAI 2022

  50. arXiv:2103.00123  [pdf, other

    cs.LG

    GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient Deep Model Training

    Authors: Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Abir De, Rishabh Iyer

    Abstract: The great success of modern machine learning models on large datasets is contingent on extensive computational resources with high financial and environmental costs. One way to address this is by extracting subsets that generalize on par with the full data. In this work, we propose a general framework, GRAD-MATCH, which finds subsets that closely match the gradient of the training or validation se… ▽ More

    Submitted 11 June, 2021; v1 submitted 26 February, 2021; originally announced March 2021.

    Comments: To appear in Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021