Skip to main content

Showing 1–21 of 21 results for author: Majumdar, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.12946  [pdf

    eess.AS cs.AI cs.CL cs.LG

    Instruction Data Generation and Unsupervised Adaptation for Speech Language Models

    Authors: Vahid Noroozi, Zhehuai Chen, Somshubra Majumdar, Steve Huang, Jagadeesh Balam, Boris Ginsburg

    Abstract: In this paper, we propose three methods for generating synthetic samples to train and evaluate multimodal large language models capable of processing both text and speech inputs. Addressing the scarcity of samples containing both modalities, synthetic data generation emerges as a crucial strategy to enhance the performance of such systems and facilitate the modeling of cross-modal relationships be… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Accepted for Interspeech 2024

  2. arXiv:2312.17279  [pdf, other

    cs.CL eess.AS

    Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition

    Authors: Vahid Noroozi, Somshubra Majumdar, Ankur Kumar, Jagadeesh Balam, Boris Ginsburg

    Abstract: In this paper, we propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture. We adapted the FastConformer architecture for streaming applications through: (1) constraining both the look-ahead and past contexts in the encoder, and (2) introducing an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively du… ▽ More

    Submitted 2 May, 2024; v1 submitted 27 December, 2023; originally announced December 2023.

    Comments: Shorter version accepted to ICASSP 2024

  3. arXiv:2309.09950  [pdf, other

    eess.AS cs.SD

    Investigating End-to-End ASR Architectures for Long Form Audio Transcription

    Authors: Nithin Rao Koluguri, Samuel Kriman, Georgy Zelenfroind, Somshubra Majumdar, Dima Rekesh, Vahid Noroozi, Jagadeesh Balam, Boris Ginsburg

    Abstract: This paper presents an overview and evaluation of some of the end-to-end ASR models on long-form audios. We study three categories of Automatic Speech Recognition(ASR) models based on their core architecture: (1) convolutional, (2) convolutional with squeeze-and-excitation and (3) convolutional models with attention. We selected one ASR model from each category and evaluated Word Error Rate, maxim… ▽ More

    Submitted 20 September, 2023; v1 submitted 18 September, 2023; originally announced September 2023.

    Comments: PrePrint. Submitted to ICASSP 2024

  4. arXiv:2305.05084  [pdf, other

    eess.AS cs.SD

    Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

    Authors: Dima Rekesh, Nithin Rao Koluguri, Samuel Kriman, Somshubra Majumdar, Vahid Noroozi, He Huang, Oleksii Hrinchuk, Krishna Puvvada, Ankur Kumar, Jagadeesh Balam, Boris Ginsburg

    Abstract: Conformer-based models have become the dominant end-to-end architecture for speech processing tasks. With the objective of enhancing the conformer architecture for efficient training and inference, we carefully redesigned Conformer with a novel downsampling schema. The proposed model, named Fast Conformer(FC), is 2.8x faster than the original Conformer, supports scaling to Billion parameters witho… ▽ More

    Submitted 30 September, 2023; v1 submitted 8 May, 2023; originally announced May 2023.

    Comments: Accepted at ASRU 2023

  5. arXiv:2304.06795  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Efficient Sequence Transduction by Jointly Predicting Tokens and Durations

    Authors: Hainan Xu, Fei Jia, Somshubra Majumdar, He Huang, Shinji Watanabe, Boris Ginsburg

    Abstract: This paper introduces a novel Token-and-Duration Transducer (TDT) architecture for sequence-to-sequence tasks. TDT extends conventional RNN-Transducer architectures by jointly predicting both a token and its duration, i.e. the number of input frames covered by the emitted token. This is achieved by using a joint network with two outputs which are independently normalized to generate distributions… ▽ More

    Submitted 29 May, 2023; v1 submitted 13 April, 2023; originally announced April 2023.

  6. arXiv:2211.03541  [pdf, other

    eess.AS cs.LG cs.SD

    Multi-blank Transducers for Speech Recognition

    Authors: Hainan Xu, Fei Jia, Somshubra Majumdar, Shinji Watanabe, Boris Ginsburg

    Abstract: This paper proposes a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR). In standard RNN-T, the emission of a blank symbol consumes exactly one input frame; in our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. We refer to the added symbols as big blanks, and the method multi-blank RNN-T. For training… ▽ More

    Submitted 11 April, 2024; v1 submitted 4 November, 2022; originally announced November 2022.

    Journal ref: ICASSP 2023

  7. arXiv:2210.03255  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Damage Control During Domain Adaptation for Transducer Based Automatic Speech Recognition

    Authors: Somshubra Majumdar, Shantanu Acharya, Vitaly Lavrukhin, Boris Ginsburg

    Abstract: Automatic speech recognition models are often adapted to improve their accuracy in a new domain. A potential drawback of model adaptation to new domains is catastrophic forgetting, where the Word Error Rate on the original domain is significantly degraded. This paper addresses the situation when we want to simultaneously adapt automatic speech recognition models to a new domain and limit the degra… ▽ More

    Submitted 6 October, 2022; originally announced October 2022.

    Comments: To appear in Proc. SLT 2022, Jan 09-12, 2023, Doha, Qatar

  8. CTC Variations Through New WFST Topologies

    Authors: Aleksandr Laptev, Somshubra Majumdar, Boris Ginsburg

    Abstract: This paper presents novel Weighted Finite-State Transducer (WFST) topologies to implement Connectionist Temporal Classification (CTC)-like algorithms for automatic speech recognition. Three new CTC variants are proposed: (1) the "compact-CTC", in which direct transitions between units are replaced with <epsilon> back-off transitions; (2) the "minimal-CTC", that only adds <blank> self-loops when us… ▽ More

    Submitted 26 June, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: Accepted to Interspeech 2022, 5 pages, 2 figures, 7 tables

  9. arXiv:2107.10708  [pdf, other

    eess.AS cs.SD

    CarneliNet: Neural Mixture Model for Automatic Speech Recognition

    Authors: Aleksei Kalinov, Somshubra Majumdar, Jagadeesh Balam, Boris Ginsburg

    Abstract: End-to-end automatic speech recognition systems have achieved great accuracy by using deeper and deeper models. However, the increased depth comes with a larger receptive field that can negatively impact model performance in streaming scenarios. We propose an alternative approach that we call Neural Mixture Model. The basic idea is to introduce a parallel mixture of shallow networks instead of a v… ▽ More

    Submitted 22 July, 2021; originally announced July 2021.

    Comments: Submitted to ASRU 2021

  10. arXiv:2104.02014  [pdf, other

    cs.CL eess.AS

    SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition

    Authors: Patrick K. O'Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, Georg Kucsko

    Abstract: In the English speech-to-text (STT) machine learning task, acoustic models are conventionally trained on uncased Latin characters, and any necessary orthography (such as capitalization, punctuation, and denormalization of non-standard words) is imputed by separate post-processing models. This adds complexity and limits performance, as many formatting tasks benefit from semantic information present… ▽ More

    Submitted 6 April, 2021; v1 submitted 5 April, 2021; originally announced April 2021.

    Comments: 5 pages, 1 figure. Submitted to INTERSPEECH 2021

  11. arXiv:2104.01721  [pdf, other

    eess.AS

    Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition

    Authors: Somshubra Majumdar, Jagadeesh Balam, Oleksii Hrinchuk, Vitaly Lavrukhin, Vahid Noroozi, Boris Ginsburg

    Abstract: We propose Citrinet - a new end-to-end convolutional Connectionist Temporal Classification (CTC) based automatic speech recognition (ASR) model. Citrinet is deep residual neural model which uses 1D time-channel separable convolutions combined with sub-word encoding and squeeze-and-excitation. The resulting architecture significantly reduces the gap between non-autoregressive and sequence-to-sequen… ▽ More

    Submitted 4 April, 2021; originally announced April 2021.

  12. arXiv:2010.13886  [pdf, other

    eess.AS cs.SD

    MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection

    Authors: Fei Jia, Somshubra Majumdar, Boris Ginsburg

    Abstract: We present MarbleNet, an end-to-end neural network for Voice Activity Detection (VAD). MarbleNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU and dropout layers. When compared to a state-of-the-art VAD model, MarbleNet is able to achieve similar performance with roughly 1/10-th the parameter cost. We further conduct extensive a… ▽ More

    Submitted 11 February, 2021; v1 submitted 26 October, 2020; originally announced October 2020.

    Comments: Accepted to ICASSP 2021

  13. arXiv:2010.12715  [pdf, other

    eess.AS

    Improving Noise Robustness of an End-to-End Neural Model for Automatic Speech Recognition

    Authors: Jagadeesh Balam, Jocelyn Huang, Vitaly Lavrukhin, Slyne Deng, Somshubra Majumdar, Boris Ginsburg

    Abstract: We present our experiments in training robust to noise an end-to-end automatic speech recognition (ASR) model using intensive data augmentation. We explore the efficacy of fine-tuning a pre-trained model to improve noise robustness, and we find it to be a very efficient way to train for various noisy conditions, especially when the conditions in which the model will be used, are unknown. Starting… ▽ More

    Submitted 23 October, 2020; originally announced October 2020.

  14. arXiv:2004.14003  [pdf, other

    eess.IV cs.CV

    The International Workshop on Osteoarthritis Imaging Knee MRI Segmentation Challenge: A Multi-Institute Evaluation and Analysis Framework on a Standardized Dataset

    Authors: Arjun D. Desai, Francesco Caliva, Claudia Iriondo, Naji Khosravan, Aliasghar Mortazi, Sachin Jambawalikar, Drew Torigian, Jutta Ellermann, Mehmet Akcakaya, Ulas Bagci, Radhika Tibrewala, Io Flament, Matthew O`Brien, Sharmila Majumdar, Mathias Perslev, Akshay Pai, Christian Igel, Erik B. Dam, Sibaji Gaj, Mingrui Yang, Kunio Nakamura, Xiaojuan Li, Cem M. Deniz, Vladimir Juras, Ravinder Regatte , et al. (4 additional authors not shown)

    Abstract: Purpose: To organize a knee MRI segmentation challenge for characterizing the semantic and clinical efficacy of automatic segmentation methods relevant for monitoring osteoarthritis progression. Methods: A dataset partition consisting of 3D knee MRI from 88 subjects at two timepoints with ground-truth articular (femoral, tibial, patellar) cartilage and meniscus segmentations was standardized. Ch… ▽ More

    Submitted 26 May, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

    Comments: Submitted to Radiology: Artificial Intelligence; Fixed typos

  15. MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition

    Authors: Somshubra Majumdar, Boris Ginsburg

    Abstract: We present an MatchboxNet - an end-to-end neural network for speech command recognition. MatchboxNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU and dropout layers. MatchboxNet reaches state-of-the-art accuracy on the Google Speech Commands dataset while having significantly fewer parameters than similar models. The small foot… ▽ More

    Submitted 21 April, 2020; v1 submitted 18 April, 2020; originally announced April 2020.

  16. arXiv:2003.09089  [pdf, other

    eess.IV cs.CV cs.LG

    Hierarchical Severity Staging of Anterior Cruciate Ligament Injuries using Deep Learning with MRI Images

    Authors: Nikan K. Namiri, Io Flament, Bruno Astuto, Rutwik Shah, Radhika Tibrewala, Francesco Caliva, Thomas M. Link, Valentina Pedoia, Sharmila Majumdar

    Abstract: Purpose: To evaluate the diagnostic utility of two convolutional neural networks (CNNs) for severity staging of anterior cruciate ligament (ACL) injuries. Materials and Methods: This retrospective analysis was conducted on 1243 knee MR images (1008 intact, 18 partially torn, 77 fully torn, and 140 reconstructed ACLs) from 224 patients (age 47 +/- 14 years, 54% women) acquired between 2011 and 20… ▽ More

    Submitted 13 April, 2020; v1 submitted 19 March, 2020; originally announced March 2020.

  17. arXiv:2002.10591  [pdf, other

    eess.IV cs.CV

    Deep learning predicts total knee replacement from magnetic resonance images

    Authors: Aniket A. Tolpadi, **hee J. Lee, Valentina Pedoia, Sharmila Majumdar

    Abstract: Knee Osteoarthritis (OA) is a common musculoskeletal disorder in the United States. When diagnosed at early stages, lifestyle interventions such as exercise and weight loss can slow OA progression, but at later stages, only an invasive option is available: total knee replacement (TKR). Though a generally successful procedure, only 2/3 of patients who undergo the procedure report their knees feelin… ▽ More

    Submitted 24 February, 2020; originally announced February 2020.

    Comments: 18 pages, 5 figures (4 in main article, 1 supplemental), 8 tables (5 in main article, 3 supplemental). Submitted to Scientific Reports and currently in revision

    ACM Class: I.4.9

  18. arXiv:1909.06326  [pdf, other

    q-bio.QM cs.CV cs.LG eess.IV physics.med-ph

    Automatic Hip Fracture Identification and Functional Subclassification with Deep Learning

    Authors: Justin D Krogue, Kaiyang V Cheng, Kevin M Hwang, Paul Toogood, Eric G Meinberg, Erik J Geiger, Musa Zaid, Kevin C McGill, Rina Patel, Jae Ho Sohn, Alexandra Wright, Bryan F Darger, Kevin A Padrez, Eugene Ozhinsky, Sharmila Majumdar, Valentina Pedoia

    Abstract: Purpose: Hip fractures are a common cause of morbidity and mortality. Automatic identification and classification of hip fractures using deep learning may improve outcomes by reducing diagnostic errors and decreasing time to operation. Methods: Hip and pelvic radiographs from 1118 studies were reviewed and 3034 hips were labeled via bounding boxes and classified as normal, displaced femoral neck f… ▽ More

    Submitted 10 September, 2019; originally announced September 2019.

    Comments: Presented at Orthopaedic Research Society, Austin, TX, Feb 2, 2019, currently in submission for publication

  19. arXiv:1908.03679  [pdf, other

    eess.IV cs.CV cs.LG

    Distance Map Loss Penalty Term for Semantic Segmentation

    Authors: Francesco Caliva, Claudia Iriondo, Alejandro Morales Martinez, Sharmila Majumdar, Valentina Pedoia

    Abstract: Convolutional neural networks for semantic segmentation suffer from low performance at object boundaries. In medical imaging, accurate representation of tissue surfaces and volumes is important for tracking of disease biomarkers such as tissue morphology and shape features. In this work, we propose a novel distance map derived loss penalty term for semantic segmentation. We propose to use distance… ▽ More

    Submitted 9 August, 2019; originally announced August 2019.

    Comments: Medical Imaging with Deep Learning (MIDL2019) Conference [arXiv:1907.08612], Extended Abstract

    Report number: MIDL/2019/ExtendedAbstract/B1eIcvS45V

  20. arXiv:1812.07729  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Pathological Voice Classification Using Mel-Cepstrum Vectors and Support Vector Machine

    Authors: Maryam Pishgar, Fazle Karim, Somshubra Majumdar, Houshang Darabi

    Abstract: Vocal disorders have affected several patients all over the world. Due to the inherent difficulty of diagnosing vocal disorders without sophisticated equipment and trained personnel, a number of patients remain undiagnosed. To alleviate the monetary cost of diagnosis, there has been a recent growth in the use of data analysis to accurately detect and diagnose individuals for a fraction of the cost… ▽ More

    Submitted 18 December, 2018; originally announced December 2018.

    Comments: Accepted at IEEE BigData 2018 Workshop - FEMH Voice Data Challenge

  21. arXiv:1707.01567  [pdf, other

    eess.SY

    Adaptive Estimation for Nonlinear Systems using Reproducing Kernel Hilbert Spaces

    Authors: Parag Bobade, Suprotim Majumdar, Savio Pereira, Andrew J. Kurdila, John B. Ferris

    Abstract: This paper extends a conventional, general framework for online adaptive estimation problems for systems governed by unknown nonlinear ordinary differential equations. The central feature of the theory introduced in this paper represents the unknown function as a member of a reproducing kernel Hilbert space (RKHS) and defines a distributed parameter system (DPS) that governs state estimates and es… ▽ More

    Submitted 10 July, 2017; v1 submitted 5 July, 2017; originally announced July 2017.

    Comments: 24 pages, Submitted to CMAME

    MSC Class: 68T05; 93C41; 93C15; 68T30; 93C40