Search | arXiv e-print repository

doi 10.1109/SLT54892.2023.10022475

Streaming Bilingual End-to-End ASR model using Attention over Multiple Softmax

Authors: Aditya Patil, Vikas Joshi, Purvi Agrawal, Rupesh Mehta

Abstract: Even with several advancements in multilingual modeling, it is challenging to recognize multiple languages using a single neural model, without knowing the input language and most multilingual models assume the availability of the input language. In this work, we propose a novel bilingual end-to-end (E2E) modeling approach, where a single neural model can recognize both languages and also support… ▽ More Even with several advancements in multilingual modeling, it is challenging to recognize multiple languages using a single neural model, without knowing the input language and most multilingual models assume the availability of the input language. In this work, we propose a novel bilingual end-to-end (E2E) modeling approach, where a single neural model can recognize both languages and also support switching between the languages, without any language input from the user. The proposed model has shared encoder and prediction networks, with language-specific joint networks that are combined via a self-attention mechanism. As the language-specific posteriors are combined, it produces a single posterior probability over all the output symbols, enabling a single beam search decoding and also allowing dynamic switching between the languages. The proposed approach outperforms the conventional bilingual baseline with 13.3%, 8.23% and 1.3% word error rate relative reduction on Hindi, English and code-mixed test sets, respectively. △ Less

Submitted 21 January, 2024; originally announced January 2024.

Comments: Published in IEEE's Spoken Language Technology (SLT) 2022, 8 pages (6 + 2 for references), 5 figures

Journal ref: 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 2023, pp. 252-259

arXiv:2309.11097 [pdf]

Evaluating Mental Stress Among College Students Using Heart Rate and Hand Acceleration Data Collected from Wearable Sensors

Authors: Moein Razavi, Anthony McDonald, Ranjana Mehta, Farzan Sasangohar

Abstract: Stress is various mental health disorders including depression and anxiety among college students. Early stress diagnosis and intervention may lower the risk of develo** mental illnesses. We examined a machine learning-based method for identification of stress using data collected in a naturalistic study utilizing self-reported stress as ground truth as well as physiological data such as heart r… ▽ More Stress is various mental health disorders including depression and anxiety among college students. Early stress diagnosis and intervention may lower the risk of develo** mental illnesses. We examined a machine learning-based method for identification of stress using data collected in a naturalistic study utilizing self-reported stress as ground truth as well as physiological data such as heart rate and hand acceleration. The study involved 54 college students from a large campus who used wearable wrist-worn sensors and a mobile health (mHealth) application continuously for 40 days. The app gathered physiological data including heart rate and hand acceleration at one hertz frequency. The application also enabled users to self-report stress by tap** on the watch face, resulting in a time-stamped record of the self-reported stress. We created, evaluated, and analyzed machine learning algorithms for identifying stress episodes among college students using heart rate and accelerometer data. The XGBoost method was the most reliable model with an AUC of 0.64 and an accuracy of 84.5%. The standard deviation of hand acceleration, standard deviation of heart rate, and the minimum heart rate were the most important features for stress detection. This evidence may support the efficacy of identifying patterns in physiological reaction to stress using smartwatch sensors and may inform the design of future tools for real-time detection of stress. △ Less

Submitted 20 September, 2023; originally announced September 2023.

arXiv:2308.10984 [pdf, other]

Debiasing Counterfactuals In the Presence of Spurious Correlations

Authors: Amar Kumar, Nima Fathi, Raghav Mehta, Brennan Nichyporuk, Jean-Pierre R. Falet, Sotirios Tsaftaris, Tal Arbel

Abstract: Deep learning models can perform well in complex medical imaging classification tasks, even when basing their conclusions on spurious correlations (i.e. confounders), should they be prevalent in the training dataset, rather than on the causal image markers of interest. This would thereby limit their ability to generalize across the population. Explainability based on counterfactual image generatio… ▽ More Deep learning models can perform well in complex medical imaging classification tasks, even when basing their conclusions on spurious correlations (i.e. confounders), should they be prevalent in the training dataset, rather than on the causal image markers of interest. This would thereby limit their ability to generalize across the population. Explainability based on counterfactual image generation can be used to expose the confounders but does not provide a strategy to mitigate the bias. In this work, we introduce the first end-to-end training framework that integrates both (i) popular debiasing classifiers (e.g. distributionally robust optimization (DRO)) to avoid latching onto the spurious correlations and (ii) counterfactual image generation to unveil generalizable imaging markers of relevance to the task. Additionally, we propose a novel metric, Spurious Correlation Latching Score (SCLS), to quantify the extent of the classifier reliance on the spurious correlation as exposed by the counterfactual images. Through comprehensive experiments on two public datasets (with the simulated and real visual artifacts), we demonstrate that the debiasing method: (i) learns generalizable markers across the population, and (ii) successfully ignores spurious correlations and focuses on the underlying disease pathology. △ Less

Submitted 21 August, 2023; originally announced August 2023.

Comments: Accepted to the FAIMI (Fairness of AI in Medical Imaging) workshop at MICCAI 2023

arXiv:2307.01738 [pdf, other]

Mitigating Calibration Bias Without Fixed Attribute Grou** for Improved Fairness in Medical Imaging Analysis

Authors: Changjian Shui, Justin Szeto, Raghav Mehta, Douglas L. Arnold, Tal Arbel

Abstract: Trustworthy deployment of deep learning medical imaging models into real-world clinical practice requires that they be calibrated. However, models that are well calibrated overall can still be poorly calibrated for a sub-population, potentially resulting in a clinician unwittingly making poor decisions for this group based on the recommendations of the model. Although methods have been shown to su… ▽ More Trustworthy deployment of deep learning medical imaging models into real-world clinical practice requires that they be calibrated. However, models that are well calibrated overall can still be poorly calibrated for a sub-population, potentially resulting in a clinician unwittingly making poor decisions for this group based on the recommendations of the model. Although methods have been shown to successfully mitigate biases across subgroups in terms of model accuracy, this work focuses on the open problem of mitigating calibration biases in the context of medical image analysis. Our method does not require subgroup attributes during training, permitting the flexibility to mitigate biases for different choices of sensitive attributes without re-training. To this end, we propose a novel two-stage method: Cluster-Focal to first identify poorly calibrated samples, cluster them into groups, and then introduce group-wise focal loss to improve calibration bias. We evaluate our method on skin lesion classification with the public HAM10000 dataset, and on predicting future lesional activity for multiple sclerosis (MS) patients. In addition to considering traditional sensitive attributes (e.g. age, sex) with demographic subgroups, we also consider biases among groups with different image-derived attributes, such as lesion load, which are required in medical image analysis. Our results demonstrate that our method effectively controls calibration error in the worst-performing subgroups while preserving prediction performance, and outperforming recent baselines. △ Less

Submitted 20 July, 2023; v1 submitted 4 July, 2023; originally announced July 2023.

arXiv:2210.17398 [pdf, other]

Rethinking Generalization: The Impact of Annotation Style on Medical Image Segmentation

Authors: Brennan Nichyporuk, Jillian Cardinell, Justin Szeto, Raghav Mehta, Jean-Pierre R. Falet, Douglas L. Arnold, Sotirios A. Tsaftaris, Tal Arbel

Abstract: Generalization is an important attribute of machine learning models, particularly for those that are to be deployed in a medical context, where unreliable predictions can have real world consequences. While the failure of models to generalize across datasets is typically attributed to a mismatch in the data distributions, performance gaps are often a consequence of biases in the 'ground-truth' lab… ▽ More Generalization is an important attribute of machine learning models, particularly for those that are to be deployed in a medical context, where unreliable predictions can have real world consequences. While the failure of models to generalize across datasets is typically attributed to a mismatch in the data distributions, performance gaps are often a consequence of biases in the 'ground-truth' label annotations. This is particularly important in the context of medical image segmentation of pathological structures (e.g. lesions), where the annotation process is much more subjective, and affected by a number underlying factors, including the annotation protocol, rater education/experience, and clinical aims, among others. In this paper, we show that modeling annotation biases, rather than ignoring them, poses a promising way of accounting for differences in annotation style across datasets. To this end, we propose a generalized conditioning framework to (1) learn and account for different annotation styles across multiple datasets using a single model, (2) identify similar annotation styles across different datasets in order to permit their effective aggregation, and (3) fine-tune a fully trained model to a new annotation style with just a few samples. Next, we present an image-conditioning approach to model annotation styles that correlate with specific image features, potentially enabling detection biases to be more easily identified. △ Less

Submitted 13 December, 2022; v1 submitted 31 October, 2022; originally announced October 2022.

Comments: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://www.melba-journal.org/papers/2022:029.html

Journal ref: Machine.Learning.for.Biomedical.Imaging. 1 (2022)

arXiv:2204.00348 [pdf, other]

WavFT: Acoustic model finetuning with labelled and unlabelled data

Authors: Utkarsh Chauhan, Vikas Joshi, Rupesh R. Mehta

Abstract: Unsupervised and self-supervised learning methods have leveraged unlabelled data to improve the pretrained models. However, these methods need significantly large amount of unlabelled data and the computational cost of training models with such large amount of data can be prohibitively high. We address this issue by using unlabelled data during finetuning, instead of pretraining. We propose acoust… ▽ More Unsupervised and self-supervised learning methods have leveraged unlabelled data to improve the pretrained models. However, these methods need significantly large amount of unlabelled data and the computational cost of training models with such large amount of data can be prohibitively high. We address this issue by using unlabelled data during finetuning, instead of pretraining. We propose acoustic model finetuning (FT) using labelled and unlabelled data. The model is jointly trained to learn representations to classify senones, as well as learn contextual acoustic representations. Our training objective is a combination of cross entropy loss, suitable for classification task, and contrastive loss, suitable to learn acoustic representations. The proposed approach outperforms conventional finetuning with 11.2% and 9.19% word error rate relative (WERR) reduction on Gujarati and Bengali languages respectively. △ Less

Submitted 1 April, 2022; originally announced April 2022.

arXiv:2112.10074 [pdf, other]

doi 10.59275/j.melba.2022-354b

QU-BraTS: MICCAI BraTS 2020 Challenge on Quantifying Uncertainty in Brain Tumor Segmentation - Analysis of Ranking Scores and Benchmarking Results

Authors: Raghav Mehta, Angelos Filos, Ujjwal Baid, Chiharu Sako, Richard McKinley, Michael Rebsamen, Katrin Datwyler, Raphael Meier, Piotr Radojewski, Gowtham Krishnan Murugesan, Sahil Nalawade, Chandan Ganesh, Ben Wagner, Fang F. Yu, Baowei Fei, Ananth J. Madhuranthakam, Joseph A. Maldjian, Laura Daza, Catalina Gomez, Pablo Arbelaez, Chengliang Dai, Shuo Wang, Hadrien Reynaud, Yuan-han Mo, Elsa Angelini , et al. (67 additional authors not shown)

Abstract: Deep learning (DL) models have provided state-of-the-art performance in various medical imaging benchmarking challenges, including the Brain Tumor Segmentation (BraTS) challenges. However, the task of focal pathology multi-compartment segmentation (e.g., tumor and lesion sub-regions) is particularly challenging, and potential errors hinder translating DL models into clinical workflows. Quantifying… ▽ More Deep learning (DL) models have provided state-of-the-art performance in various medical imaging benchmarking challenges, including the Brain Tumor Segmentation (BraTS) challenges. However, the task of focal pathology multi-compartment segmentation (e.g., tumor and lesion sub-regions) is particularly challenging, and potential errors hinder translating DL models into clinical workflows. Quantifying the reliability of DL model predictions in the form of uncertainties could enable clinical review of the most uncertain regions, thereby building trust and paving the way toward clinical translation. Several uncertainty estimation methods have recently been introduced for DL medical image segmentation tasks. Develo** scores to evaluate and compare the performance of uncertainty measures will assist the end-user in making more informed decisions. In this study, we explore and evaluate a score developed during the BraTS 2019 and BraTS 2020 task on uncertainty quantification (QU-BraTS) and designed to assess and rank uncertainty estimates for brain tumor multi-compartment segmentation. This score (1) rewards uncertainty estimates that produce high confidence in correct assertions and those that assign low confidence levels at incorrect assertions, and (2) penalizes uncertainty measures that lead to a higher percentage of under-confident correct assertions. We further benchmark the segmentation uncertainties generated by 14 independent participating teams of QU-BraTS 2020, all of which also participated in the main BraTS segmentation task. Overall, our findings confirm the importance and complementary value that uncertainty estimates provide to segmentation algorithms, highlighting the need for uncertainty quantification in medical image analyses. Finally, in favor of transparency and reproducibility, our evaluation code is made publicly available at: https://github.com/RagMeh11/QU-BraTS. △ Less

Submitted 23 August, 2022; v1 submitted 19 December, 2021; originally announced December 2021.

Comments: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA): https://www.melba-journal.org/papers/2022:026.html

Journal ref: Machine.Learning.for.Biomedical.Imaging. 1 (2022)

arXiv:2111.01561 [pdf, other]

Sub-cortical structure segmentation database for young population

Authors: Jayanthi Sivaswamy, Alphin J Thottupattu, Mythri V, Raghav Mehta, R Sheelakumari, Chandrasekharan Kesavadas

Abstract: Segmentation of sub-cortical structures from MRI scans is of interest in many neurological diagnosis. Since this is a laborious task machine learning and specifically deep learning (DL) methods have become explored. The structural complexity of the brain demands a large, high quality segmentation dataset to develop good DL-based solutions for sub-cortical structure segmentation. Towards this, we a… ▽ More Segmentation of sub-cortical structures from MRI scans is of interest in many neurological diagnosis. Since this is a laborious task machine learning and specifically deep learning (DL) methods have become explored. The structural complexity of the brain demands a large, high quality segmentation dataset to develop good DL-based solutions for sub-cortical structure segmentation. Towards this, we are releasing a set of 114, 1.5 Tesla, T1 MRI scans with manual delineations for 14 sub-cortical structures. The scans in the dataset were acquired from healthy young (21-30 years) subjects ( 58 male and 56 female) and all the structures are manually delineated by experienced radiology experts. Segmentation experiments have been conducted with this dataset and results demonstrate that accurate results can be obtained with deep-learning methods. Our sub-cortical structure segmentation dataset, Indian Brain Segmentation Dataset (IBSD) is made openly available at \url{https://doi.org/10.5281/zenodo.5656776}. △ Less

Submitted 9 November, 2021; v1 submitted 1 November, 2021; originally announced November 2021.

arXiv:2108.00713 [pdf, other]

Cohort Bias Adaptation in Aggregated Datasets for Lesion Segmentation

Authors: Brennan Nichyporuk, Jillian Cardinell, Justin Szeto, Raghav Mehta, Sotirios Tsaftaris, Douglas L. Arnold, Tal Arbel

Abstract: Many automatic machine learning models developed for focal pathology (e.g. lesions, tumours) detection and segmentation perform well, but do not generalize as well to new patient cohorts, impeding their widespread adoption into real clinical contexts. One strategy to create a more diverse, generalizable training set is to naively pool datasets from different cohorts. Surprisingly, training on this… ▽ More Many automatic machine learning models developed for focal pathology (e.g. lesions, tumours) detection and segmentation perform well, but do not generalize as well to new patient cohorts, impeding their widespread adoption into real clinical contexts. One strategy to create a more diverse, generalizable training set is to naively pool datasets from different cohorts. Surprisingly, training on this \it{big data} does not necessarily increase, and may even reduce, overall performance and model generalizability, due to the existence of cohort biases that affect label distributions. In this paper, we propose a generalized affine conditioning framework to learn and account for cohort biases across multi-source datasets, which we call Source-Conditioned Instance Normalization (SCIN). Through extensive experimentation on three different, large scale, multi-scanner, multi-centre Multiple Sclerosis (MS) clinical trial MRI datasets, we show that our cohort bias adaptation method (1) improves performance of the network on pooled datasets relative to naively pooling datasets and (2) can quickly adapt to a new cohort by fine-tuning the instance normalization parameters, thus learning the new cohort bias with only 10 labelled samples. △ Less

Submitted 18 May, 2022; v1 submitted 2 August, 2021; originally announced August 2021.

Comments: Accepted at DART 2021

arXiv:2103.16617 [pdf, other]

HAD-Net: A Hierarchical Adversarial Knowledge Distillation Network for Improved Enhanced Tumour Segmentation Without Post-Contrast Images

Authors: Saverio Vadacchino, Raghav Mehta, Nazanin Mohammadi Sepahvand, Brennan Nichyporuk, James J. Clark, Tal Arbel

Abstract: Segmentation of enhancing tumours or lesions from MRI is important for detecting new disease activity in many clinical contexts. However, accurate segmentation requires the inclusion of medical images (e.g., T1 post contrast MRI) acquired after injecting patients with a contrast agent (e.g., Gadolinium), a process no longer thought to be safe. Although a number of modality-agnostic segmentation ne… ▽ More Segmentation of enhancing tumours or lesions from MRI is important for detecting new disease activity in many clinical contexts. However, accurate segmentation requires the inclusion of medical images (e.g., T1 post contrast MRI) acquired after injecting patients with a contrast agent (e.g., Gadolinium), a process no longer thought to be safe. Although a number of modality-agnostic segmentation networks have been developed over the past few years, they have been met with limited success in the context of enhancing pathology segmentation. In this work, we present HAD-Net, a novel offline adversarial knowledge distillation (KD) technique, whereby a pre-trained teacher segmentation network, with access to all MRI sequences, teaches a student network, via hierarchical adversarial training, to better overcome the large domain shift presented when crucial images are absent during inference. In particular, we apply HAD-Net to the challenging task of enhancing tumour segmentation when access to post-contrast imaging is not available. The proposed network is trained and tested on the BraTS 2019 brain tumour segmentation challenge dataset, where it achieves performance improvements in the ranges of 16% - 26% over (a) recent modality-agnostic segmentation methods (U-HeMIS, U-HVED), (b) KD-Net adapted to this problem, (c) the pre-trained student network and (d) a non-hierarchical version of the network (AD-Net), in terms of Dice scores for enhancing tumour (ET). The network also shows improvements in tumour core (TC) Dice scores. Finally, the network outperforms both the baseline student network and AD-Net in terms of uncertainty quantification for enhancing tumour segmentation based on the BraTs 2019 uncertainty challenge metrics. Our code is publicly available at: https://github.com/SaverioVad/HAD_Net △ Less

Submitted 12 May, 2021; v1 submitted 30 March, 2021; originally announced March 2021.

Comments: Accepted at Medical Imaging with Deep Learning (MIDL) 2021

arXiv:2008.06871 [pdf, other]

Attractive Ellipsoid Sliding Mode Observer Design for State of Charge Estimation of Lithium-ion Cells

Authors: Anirudh Nath, Raghvendra Gupta, Rohit Mehta, Supreet Singh Bahga, Amit Gupta, Shubhendu Bhasin

Abstract: This work investigates the real-time estimation of the state-of-charge (SoC) of Lithium-ion (Li-ion) cells for reliable, safe and efficient utilization. A novel attractive ellipsoid based sliding-mode observer (AESMO) algorithm is designed to estimate the SoC in real-time. The algorithm utilizes standard equivalent circuit model of a Li-ion cell and provides reliable and efficient SoC estimate in… ▽ More This work investigates the real-time estimation of the state-of-charge (SoC) of Lithium-ion (Li-ion) cells for reliable, safe and efficient utilization. A novel attractive ellipsoid based sliding-mode observer (AESMO) algorithm is designed to estimate the SoC in real-time. The algorithm utilizes standard equivalent circuit model of a Li-ion cell and provides reliable and efficient SoC estimate in the presence of bounded uncertainties in the battery parameters as well as exogenous disturbances. The theoretical framework of the observer design is not limited to the SoC estimation problem of Li-ion cell but applicable to a wider class of nonlinear systems with both matched and mismatched uncertainties. The main advantage of the proposed observer is to provide a fast and optimal SoC estimate based on minimization over the uncertainty bound. The proposed method is experimentally tested and evaluated using the hybrid pulse power characterization test (HPPC)and urban dynamometer driving schedule (UDDS) test data, which demonstrate its effectiveness and feasibility. △ Less

Submitted 16 August, 2020; originally announced August 2020.

Comments: 19 pages, 12 figures, 2 tables

arXiv:2008.05086 [pdf, other]

Transfer Learning Approaches for Streaming End-to-End Speech Recognition System

Authors: Vikas Joshi, Rui Zhao, Rupesh R. Mehta, Kshitiz Kumar, **yu Li

Abstract: Transfer learning (TL) is widely used in conventional hybrid automatic speech recognition (ASR) system, to transfer the knowledge from source to target language. TL can be applied to end-to-end (E2E) ASR system such as recurrent neural network transducer (RNN-T) models, by initializing the encoder and/or prediction network of the target language with the pre-trained models from source language. In… ▽ More Transfer learning (TL) is widely used in conventional hybrid automatic speech recognition (ASR) system, to transfer the knowledge from source to target language. TL can be applied to end-to-end (E2E) ASR system such as recurrent neural network transducer (RNN-T) models, by initializing the encoder and/or prediction network of the target language with the pre-trained models from source language. In the hybrid ASR system, transfer learning is typically done by initializing the target language acoustic model (AM) with source language AM. Several transfer learning strategies exist in the case of the RNN-T framework, depending upon the choice of the initialization model for encoder and prediction networks. This paper presents a comparative study of four different TL methods for RNN-T framework. We show 17% relative word error rate reduction with different TL methods over randomly initialized RNN-T model. We also study the impact of TL with varying amount of training data ranging from 50 hours to 1000 hours and show the efficacy of TL for languages with small amount of training data. △ Less

Submitted 17 August, 2020; v1 submitted 11 August, 2020; originally announced August 2020.

arXiv:2005.14262 [pdf, other]

Uncertainty Evaluation Metric for Brain Tumour Segmentation

Authors: Raghav Mehta, Angelos Filos, Yarin Gal, Tal Arbel

Abstract: In this paper, we develop a metric designed to assess and rank uncertainty measures for the task of brain tumour sub-tissue segmentation in the BraTS 2019 sub-challenge on uncertainty quantification. The metric is designed to: (1) reward uncertainty measures where high confidence is assigned to correct assertions, and where incorrect assertions are assigned low confidence and (2) penalize measures… ▽ More In this paper, we develop a metric designed to assess and rank uncertainty measures for the task of brain tumour sub-tissue segmentation in the BraTS 2019 sub-challenge on uncertainty quantification. The metric is designed to: (1) reward uncertainty measures where high confidence is assigned to correct assertions, and where incorrect assertions are assigned low confidence and (2) penalize measures that have higher percentages of under-confident correct assertions. Here, the workings of the components of the metric are explored based on a number of popular uncertainty measures evaluated on the BraTS 2019 dataset. △ Less

Submitted 28 May, 2020; originally announced May 2020.

Report number: MIDL/2019/ExtendedAbstract/H-PvDNIex

arXiv:1908.08074 [pdf, other]

DUAL-GLOW: Conditional Flow-Based Generative Model for Modality Transfer

Authors: Haoliang Sun, Ronak Mehta, Hao H. Zhou, Zhichun Huang, Sterling C. Johnson, Vivek Prabhakaran, Vikas Singh

Abstract: Positron emission tomography (PET) imaging is an imaging modality for diagnosing a number of neurological diseases. In contrast to Magnetic Resonance Imaging (MRI), PET is costly and involves injecting a radioactive substance into the patient. Motivated by developments in modality transfer in vision, we study the generation of certain types of PET images from MRI data. We derive new flow-based gen… ▽ More Positron emission tomography (PET) imaging is an imaging modality for diagnosing a number of neurological diseases. In contrast to Magnetic Resonance Imaging (MRI), PET is costly and involves injecting a radioactive substance into the patient. Motivated by developments in modality transfer in vision, we study the generation of certain types of PET images from MRI data. We derive new flow-based generative models which we show perform well in this small sample size regime (much smaller than dataset sizes available in standard vision tasks). Our formulation, DUAL-GLOW, is based on two invertible networks and a relation network that maps the latent spaces to each other. We discuss how given the prior distribution, learning the conditional distribution of PET given the MRI image reduces to obtaining the conditional distribution between the two latent codes w.r.t. the two image types. We also extend our framework to leverage 'side' information (or attributes) when available. By controlling the PET generation through 'conditioning' on age, our model is also able to capture brain FDG-PET (hypometabolism) changes, as a function of age. We present experiments on the Alzheimers Disease Neuroimaging Initiative (ADNI) dataset with 826 subjects, and obtain good performance in PET image synthesis, qualitatively and quantitatively better than recent works. △ Less

Submitted 21 August, 2019; originally announced August 2019.

Journal ref: ICCV 2019

arXiv:1906.09426 [pdf, other]

End-to-End ASR for Code-switched Hindi-English Speech

Authors: Brij Mohan Lal Srivastava, Basil Abraham, Sunayana Sitaram, Rupesh Mehta, Preethi Jyothi

Abstract: End-to-end (E2E) models have been explored for large speech corpora and have been found to match or outperform traditional pipeline-based systems in some languages. However, most prior work on end-to-end models use speech corpora exceeding hundreds or thousands of hours. In this study, we explore end-to-end models for code-switched Hindi-English language with less than 50 hours of data. We utilize… ▽ More End-to-end (E2E) models have been explored for large speech corpora and have been found to match or outperform traditional pipeline-based systems in some languages. However, most prior work on end-to-end models use speech corpora exceeding hundreds or thousands of hours. In this study, we explore end-to-end models for code-switched Hindi-English language with less than 50 hours of data. We utilize two specific measures to improve network performance in the low-resource setting, namely multi-task learning (MTL) and balancing the corpus to deal with the inherent class imbalance problem i.e. the skewed frequency distribution over graphemes. We compare the results of the proposed approaches with traditional, cascaded ASR systems. While the lack of data adversely affects the performance of end-to-end models, we see promising improvements with MTL and balancing the corpus. △ Less

Submitted 22 June, 2019; originally announced June 2019.

Showing 1–15 of 15 results for author: Mehta, R