Skip to main content

Showing 1–21 of 21 results for author: Shah, R R

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.08802  [pdf, other

    eess.AS cs.SD

    DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing

    Authors: Neha Sahipjohn, Ashishkumar Gudmalwar, Nirmesh Shah, Pankaj Wasnik, Rajiv Ratn Shah

    Abstract: Audio-visual alignment after dubbing is a challenging research problem. To this end, we propose a novel method, DubWise Multi-modal Large Language Model (LLM)-based Text-to-Speech (TTS), which can control the speech duration of synthesized speech in such a way that it aligns well with the speakers lip movements given in the reference video even when the spoken text is different or in a different l… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted at INTERSPEECH 2024

  2. arXiv:2406.08076  [pdf, other

    eess.AS cs.SD

    VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech

    Authors: Ashishkumar Gudmalwar, Nirmesh Shah, Sai Akarsh, Pankaj Wasnik, Rajiv Ratn Shah

    Abstract: Despite the significant advancements in Text-to-Speech (TTS) systems, their full utilization in automatic dubbing remains limited. This task necessitates the extraction of voice identity and emotional style from a reference speech in a source language and subsequently transferring them to a target language using cross-lingual TTS techniques. While previous approaches have mainly concentrated on co… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted at INTERSPEECH 2024

  3. arXiv:2403.15469  [pdf, other

    cs.CL cs.LG eess.AS

    Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning

    Authors: Shivam Ratnakant Mhaskar, Nirmesh J. Shah, Mohammadi Zaki, Ashishkumar P. Gudmalwar, Pankaj Wasnik, Rajiv Ratn Shah

    Abstract: Traditional Automatic Video Dubbing (AVD) pipeline consists of three key modules, namely, Automatic Speech Recognition (ASR), Neural Machine Translation (NMT), and Text-to-Speech (TTS). Within AVD pipelines, isometric-NMT algorithms are employed to regulate the length of the synthesized output text. This is done to guarantee synchronization with respect to the alignment of video and audio subseque… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Accepted in NAACL2024 Findings

  4. arXiv:2310.05078  [pdf, other

    eess.AS cs.SD

    Partial Rank Similarity Minimization Method for Quality MOS Prediction of Unseen Speech Synthesis Systems in Zero-Shot and Semi-supervised setting

    Authors: Hemant Yadav, Erica Cooper, Junichi Yamagishi, Sunayana Sitaram, Rajiv Ratn Shah

    Abstract: This paper introduces a novel objective function for quality mean opinion score (MOS) prediction of unseen speech synthesis systems. The proposed function measures the similarity of relative positions of predicted MOS values, in a mini-batch, rather than the actual MOS values. That is the partial rank similarity is measured (PRS) rather than the individual MOS values as with the L1 loss. Our exper… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: Accepted to ASRU 2023

  5. arXiv:2303.06982  [pdf, other

    cs.SD eess.AS

    Analysing the Masked predictive coding training criterion for pre-training a Speech Representation Model

    Authors: Hemant Yadav, Sunayana Sitaram, Rajiv Ratn Shah

    Abstract: Recent developments in pre-trained speech representation utilizing self-supervised learning (SSL) have yielded exceptional results on a variety of downstream tasks. One such technique, known as masked predictive coding (MPC), has been employed by some of the most high-performing models. In this study, we investigate the impact of MPC loss on the type of information learnt at various layers in the… ▽ More

    Submitted 11 January, 2024; v1 submitted 13 March, 2023; originally announced March 2023.

  6. arXiv:2211.14700  [pdf, other

    cs.CL eess.AS

    A novel multimodal dynamic fusion network for disfluency detection in spoken utterances

    Authors: Sreyan Ghosh, Utkarsh Tyagi, Sonal Kumar, Manan Suri, Rajiv Ratn Shah

    Abstract: Disfluency, though originating from human spoken utterances, is primarily studied as a uni-modal text-based Natural Language Processing (NLP) task. Based on early-fusion and self-attention-based multimodal interaction between text and acoustic modalities, in this paper, we propose a novel multimodal architecture for disfluency detection from individual utterances. Our architecture leverages a mult… ▽ More

    Submitted 26 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023. arXiv admin note: text overlap with arXiv:2203.16794

  7. arXiv:2203.16028  [pdf, other

    cs.CL cs.MM cs.SD eess.AS

    Span Classification with Structured Information for Disfluency Detection in Spoken Utterances

    Authors: Sreyan Ghosh, Sonal Kumar, Yaman Kumar Singla, Rajiv Ratn Shah, S. Umesh

    Abstract: Existing approaches in disfluency detection focus on solving a token-level classification task for identifying and removing disfluencies in text. Moreover, most works focus on leveraging only contextual information captured by the linear sequences in text, thus ignoring the structured information in text which is efficiently captured by dependency trees. In this paper, building on the span classif… ▽ More

    Submitted 18 April, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

  8. arXiv:2111.15156  [pdf, other

    cs.CL cs.SD eess.AS

    Automated Speech Scoring System Under The Lens: Evaluating and interpreting the linguistic cues for language proficiency

    Authors: Pakhi Bamdev, Manraj Singh Grover, Yaman Kumar Singla, Payman Vafaee, Mika Hama, Rajiv Ratn Shah

    Abstract: English proficiency assessments have become a necessary metric for filtering and selecting prospective candidates for both academia and industry. With the rise in demand for such assessments, it has become increasingly necessary to have the automated human-interpretable results to prevent inconsistencies and ensure meaningful feedback to the second language learners. Feature-based classical approa… ▽ More

    Submitted 30 November, 2021; originally announced November 2021.

    Comments: Accepted for publication in the International Journal of Artificial Intelligence in Education (IJAIED)

  9. arXiv:2110.09264  [pdf, other

    cs.CL cs.SD eess.AS

    Intent Classification Using Pre-trained Language Agnostic Embeddings For Low Resource Languages

    Authors: Hemant Yadav, Akshat Gupta, Sai Krishna Rallabandi, Alan W Black, Rajiv Ratn Shah

    Abstract: Building Spoken Language Understanding (SLU) systems that do not rely on language specific Automatic Speech Recognition (ASR) is an important yet less explored problem in language processing. In this paper, we present a comparative study aimed at employing a pre-trained acoustic model to perform SLU in low resource scenarios. Specifically, we use three different embeddings extracted using Allosaur… ▽ More

    Submitted 18 April, 2022; v1 submitted 18 October, 2021; originally announced October 2021.

  10. arXiv:2110.07592  [pdf, other

    cs.CL cs.SD eess.AS

    DeToxy: A Large-Scale Multimodal Dataset for Toxicity Classification in Spoken Utterances

    Authors: Sreyan Ghosh, Samden Lepcha, S Sakshi, Rajiv Ratn Shah, S. Umesh

    Abstract: Toxic speech, also known as hate speech, is regarded as one of the crucial issues plaguing online social media today. Most recent work on toxic speech detection is constrained to the modality of text and written conversations with very limited work on toxicity detection from spoken utterances or using the modality of speech. In this paper, we introduce a new dataset DeToxy, the first publicly avai… ▽ More

    Submitted 4 April, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

    Comments: Submitted to Interspeech 2022

  11. arXiv:2109.00928  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring

    Authors: Yaman Kumar Singla, Avykat Gupta, Shaurya Bagga, Changyou Chen, Balaji Krishnamurthy, Rajiv Ratn Shah

    Abstract: Automatic Speech Scoring (ASS) is the computer-assisted evaluation of a candidate's speaking proficiency in a language. ASS systems face many challenges like open grammar, variable pronunciations, and unstructured or semi-structured content. Recent deep learning approaches have shown some promise in this domain. However, most of these approaches focus on extracting features from a single audio, ma… ▽ More

    Submitted 30 August, 2021; originally announced September 2021.

    Comments: Published in CIKM 2021

  12. arXiv:2101.00387  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    What all do audio transformer models hear? Probing Acoustic Representations for Language Delivery and its Structure

    Authors: Jui Shah, Yaman Kumar Singla, Changyou Chen, Rajiv Ratn Shah

    Abstract: In recent times, BERT based transformer models have become an inseparable part of the 'tech stack' of text processing models. Similar progress is being observed in the speech domain with a multitude of models observing state-of-the-art results by using audio transformer models to encode speech. This begs the question of what are these audio transformer models learning. Moreover, although the stand… ▽ More

    Submitted 12 July, 2021; v1 submitted 2 January, 2021; originally announced January 2021.

  13. arXiv:2010.16078  [pdf, other

    cs.CV eess.IV

    LIFI: Towards Linguistically Informed Frame Interpolation

    Authors: Aradhya Neeraj Mathur, Devansh Batra, Yaman Kumar, Rajiv Ratn Shah, Roger Zimmermann

    Abstract: In this work, we explore a new problem of frame interpolation for speech videos. Such content today forms the major form of online communication. We try to solve this problem by using several deep learning video generation algorithms to generate the missing frames. We also provide examples where computer vision models despite showing high performance on conventional non-linguistic metrics fail to… ▽ More

    Submitted 2 December, 2020; v1 submitted 30 October, 2020; originally announced October 2020.

    Comments: 9 pages, 7 tables, 4 figures

  14. arXiv:2006.08599  [pdf, other

    cs.CL cs.SD eess.AS

    "Notic My Speech" -- Blending Speech Patterns With Multimedia

    Authors: Dhruva Sahrawat, Yaman Kumar, Shashwat Aggarwal, Yifang Yin, Rajiv Ratn Shah, Roger Zimmermann

    Abstract: Speech as a natural signal is composed of three parts - visemes (visual part of speech), phonemes (spoken part of speech), and language (the imposed structure). However, video as a medium for the delivery of speech and a multimedia construct has mostly ignored the cognitive aspects of speech delivery. For example, video applications like transcoding and compression have till now ignored the fact h… ▽ More

    Submitted 12 June, 2020; originally announced June 2020.

    Comments: Under Review

  15. arXiv:2006.05236  [pdf, other

    cs.SD cs.CL eess.AS

    audino: A Modern Annotation Tool for Audio and Speech

    Authors: Manraj Singh Grover, Pakhi Bamdev, Ratin Kumar Brala, Yaman Kumar, Mika Hama, Rajiv Ratn Shah

    Abstract: In this paper, we introduce a collaborative and modern annotation tool for audio and speech: audino. The tool allows annotators to define and describe temporal segmentation in audios. These segments can be labelled and transcribed easily using a dynamically generated form. An admin can centrally control user roles and project assignment through the admin dashboard. The dashboard also enables descr… ▽ More

    Submitted 28 November, 2021; v1 submitted 9 June, 2020; originally announced June 2020.

  16. arXiv:2005.11184  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    End-to-end Named Entity Recognition from English Speech

    Authors: Hemant Yadav, Sreyan Ghosh, Yi Yu, Rajiv Ratn Shah

    Abstract: Named entity recognition (NER) from text has been a widely studied problem and usually extracts semantic information from text. Until now, NER from speech is mostly studied in a two-step pipeline process that includes first applying an automatic speech recognition (ASR) system on an audio sample and then passing the predicted transcript to a NER tagger. In such cases, the error does not propagate… ▽ More

    Submitted 22 May, 2020; originally announced May 2020.

    Comments: submitted to Interspeech-2020

  17. arXiv:2005.08182  [pdf, other

    cs.CL cs.SD eess.AS

    Multi-modal Automated Speech Scoring using Attention Fusion

    Authors: Manraj Singh Grover, Yaman Kumar, Sumit Sarin, Payman Vafaee, Mika Hama, Rajiv Ratn Shah

    Abstract: In this study, we propose a novel multi-modal end-to-end neural approach for automated assessment of non-native English speakers' spontaneous speech using attention fusion. The pipeline employs Bi-directional Recurrent Convolutional Neural Networks and Bi-directional Long Short-Term Memory Neural Networks to encode acoustic and lexical cues from spectrograms and transcriptions, respectively. Atten… ▽ More

    Submitted 28 November, 2021; v1 submitted 17 May, 2020; originally announced May 2020.

  18. arXiv:1911.12152  [pdf, other

    eess.SP cs.LG

    Universal EEG Encoder for Learning Diverse Intelligent Tasks

    Authors: Baani Leen Kaur Jolly, Palash Aggrawal, Surabhi S Nath, Viresh Gupta, Manraj Singh Grover, Rajiv Ratn Shah

    Abstract: Brain Computer Interfaces (BCI) have become very popular with Electroencephalography (EEG) being one of the most commonly used signal acquisition techniques. A major challenge in BCI studies is the individualistic analysis required for each task. Thus, task-specific feature extraction and classification are performed, which fails to generalize to other tasks with similar time-series EEG input data… ▽ More

    Submitted 26 November, 2019; originally announced November 2019.

  19. arXiv:1911.11378  [pdf, other

    cs.LG cs.CV cs.MM eess.IV stat.ML

    Text2FaceGAN: Face Generation from Fine Grained Textual Descriptions

    Authors: Osaid Rehman Nasir, Shailesh Kumar Jha, Manraj Singh Grover, Yi Yu, Ajit Kumar, Rajiv Ratn Shah

    Abstract: Powerful generative adversarial networks (GAN) have been developed to automatically synthesize realistic images from text. However, most existing tasks are limited to generating simple images such as flowers from captions. In this work, we extend this problem to the less addressed domain of face generation from fine-grained textual descriptions of face, e.g., "A person has curly hair, oval face, a… ▽ More

    Submitted 26 November, 2019; originally announced November 2019.

  20. arXiv:1907.01367  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Lipper: Synthesizing Thy Speech using Multi-View Lipreading

    Authors: Yaman Kumar, Rohit Jain, Khwaja Mohd. Salik, Rajiv Ratn Shah, Yifang yin, Roger Zimmermann

    Abstract: Lipreading has a lot of potential applications such as in the domain of surveillance and video conferencing. Despite this, most of the work in building lipreading systems has been limited to classifying silent videos into classes representing text phrases. However, there are multiple problems associated with making lipreading a text-based classification task like its dependence on a particular lan… ▽ More

    Submitted 28 June, 2019; originally announced July 2019.

    Comments: Accepted at AAAI 2019

  21. Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed

    Authors: Yaman Kumar, Mayank Aggarwal, Pratham Nawal, Shin'ichi Satoh, Rajiv Ratn Shah, Roger Zimmerman

    Abstract: Speechreading or lipreading is the technique of understanding and getting phonetic features from a speaker's visual features such as movement of lips, face, teeth and tongue. It has a wide range of multimedia applications such as in surveillance, Internet telephony, and as an aid to a person with hearing impairments. However, most of the work in speechreading has been limited to text generation fr… ▽ More

    Submitted 12 August, 2018; v1 submitted 2 July, 2018; originally announced July 2018.

    Comments: 2018 ACM Multimedia Conference (MM '18), October 22--26, 2018, Seoul, Republic of Korea