Skip to main content

Showing 1–28 of 28 results for author: Raj, D

Searching in archive eess. Search in all archives.
.
  1. arXiv:2402.08932  [pdf, other

    eess.AS cs.SD

    Listening to Multi-talker Conversations: Modular and End-to-end Perspectives

    Authors: Desh Raj

    Abstract: Since the first speech recognition systems were built more than 30 years ago, improvement in voice technology has enabled applications such as smart assistants and automated customer support. However, conversation intelligence of the future requires recognizing free-flowing multi-party conversations, which is a crucial and challenging component that still remains unsolved. In this dissertation, we… ▽ More

    Submitted 13 February, 2024; originally announced February 2024.

    Comments: Ph.D. dissertation

  2. arXiv:2401.15676  [pdf, other

    eess.AS cs.SD

    On Speaker Attribution with SURT

    Authors: Desh Raj, Matthew Wiesner, Matthew Maciejewski, Leibny Paola Garcia-Perera, Daniel Povey, Sanjeev Khudanpur

    Abstract: The Streaming Unmixing and Recognition Transducer (SURT) has recently become a popular framework for continuous, streaming, multi-talker speech recognition (ASR). With advances in architecture, objectives, and mixture simulation methods, it was demonstrated that SURT can be an efficient streaming method for speaker-agnostic transcription of real meetings. In this work, we push this framework furth… ▽ More

    Submitted 28 January, 2024; originally announced January 2024.

    Comments: 8 pages, 6 figures, 6 tables. Submitted to Odyssey 2024

  3. arXiv:2309.15796  [pdf, other

    eess.AS cs.CL cs.LG

    Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition

    Authors: Dongji Gao, Hainan Xu, Desh Raj, Leibny Paola Garcia Perera, Daniel Povey, Sanjeev Khudanpur

    Abstract: Training automatic speech recognition (ASR) systems requires large amounts of well-curated paired data. However, human annotators usually perform "non-verbatim" transcription, which can result in poorly trained models. In this paper, we propose Omni-temporal Classification (OTC), a novel training criterion that explicitly incorporates label uncertainties originating from such weak supervision. Thi… ▽ More

    Submitted 26 September, 2023; originally announced September 2023.

  4. arXiv:2309.15013  [pdf, other

    cs.CL cs.SD eess.AS

    Updated Corpora and Benchmarks for Long-Form Speech Recognition

    Authors: Jennifer Drexler Fox, Desh Raj, Natalie Delworth, Quinn McNamara, Corey Miller, Migüel Jetté

    Abstract: The vast majority of ASR research uses corpora in which both the training and test data have been pre-segmented into utterances. In most real-word ASR use-cases, however, test audio is not segmented, leading to a mismatch between inference-time conditions and models trained on segmented utterances. In this paper, we re-release three standard ASR corpora - TED-LIUM 3, Gigapeech, and VoxPopuli-en -… ▽ More

    Submitted 26 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  5. arXiv:2309.09546  [pdf, other

    eess.AS cs.CL cs.SD

    Training dynamic models using early exits for automatic speech recognition on resource-constrained devices

    Authors: George August Wright, Umberto Cappellazzo, Salah Zaiem, Desh Raj, Lucas Ondel Yang, Daniele Falavigna, Mohamed Nabih Ali, Alessio Brutti

    Abstract: The ability to dynamically adjust the computational load of neural models during inference is crucial for on-device processing scenarios characterised by limited and time-varying computational resources. A promising solution is presented by early-exit architectures, in which additional exit branches are appended to intermediate layers of the encoder. In self-attention models for automatic speech r… ▽ More

    Submitted 22 February, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

    Comments: Accepted at the ICASSP Workshop Self-supervision in Audio, Speech and Beyond 2024

  6. arXiv:2306.13734  [pdf, other

    eess.AS cs.CL cs.SD

    The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios

    Authors: Samuele Cornell, Matthew Wiesner, Shinji Watanabe, Desh Raj, Xuankai Chang, Paola Garcia, Matthew Maciejewski, Yoshiki Masuyama, Zhong-Qiu Wang, Stefano Squartini, Sanjeev Khudanpur

    Abstract: The CHiME challenges have played a significant role in the development and evaluation of robust automatic speech recognition (ASR) systems. We introduce the CHiME-7 distant ASR (DASR) task, within the 7th CHiME challenge. This task comprises joint ASR and diarization in far-field settings with multiple, and possibly heterogeneous, recording devices. Different from previous challenges, we evaluate… ▽ More

    Submitted 14 July, 2023; v1 submitted 23 June, 2023; originally announced June 2023.

  7. arXiv:2306.10559  [pdf, other

    eess.AS cs.SD

    SURT 2.0: Advances in Transducer-based Multi-talker Speech Recognition

    Authors: Desh Raj, Daniel Povey, Sanjeev Khudanpur

    Abstract: The Streaming Unmixing and Recognition Transducer (SURT) model was proposed recently as an end-to-end approach for continuous, streaming, multi-talker speech recognition (ASR). Despite impressive results on multi-turn meetings, SURT has notable limitations: (i) it suffers from leakage and omission related errors; (ii) it is computationally expensive, due to which it has not seen adoption in academ… ▽ More

    Submitted 19 September, 2023; v1 submitted 18 June, 2023; originally announced June 2023.

    Comments: 13 pages, 7 figures. To appear in IEEE TASLP. Project webpage: https://sites.google.com/view/surt2

  8. arXiv:2212.05271  [pdf, other

    eess.AS cs.SD

    GPU-accelerated Guided Source Separation for Meeting Transcription

    Authors: Desh Raj, Daniel Povey, Sanjeev Khudanpur

    Abstract: Guided source separation (GSS) is a type of target-speaker extraction method that relies on pre-computed speaker activities and blind source separation to perform front-end enhancement of overlapped speech signals. It was first proposed during the CHiME-5 challenge and provided significant improvements over the delay-and-sum beamforming baseline. Despite its strengths, however, the method has seen… ▽ More

    Submitted 13 August, 2023; v1 submitted 10 December, 2022; originally announced December 2022.

    Comments: 7 pages, 4 figures. To appear at InterSpeech 2023. Code available at https://github.com/desh2608/gss

  9. arXiv:2211.00482  [pdf, other

    eess.AS cs.SD

    Adapting self-supervised models to multi-talker speech recognition using speaker embeddings

    Authors: Zili Huang, Desh Raj, Paola García, Sanjeev Khudanpur

    Abstract: Self-supervised learning (SSL) methods which learn representations of data without explicit supervision have gained popularity in speech-processing tasks, particularly for single-talker applications. However, these models often have degraded performance for multi-talker scenarios -- possibly due to the domain mismatch -- which severely limits their use for such applications. In this paper, we inve… ▽ More

    Submitted 1 November, 2022; originally announced November 2022.

    Comments: submitted to ICASSP 2023

  10. arXiv:2210.11588  [pdf, other

    eess.AS cs.SD

    Anchored Speech Recognition with Neural Transducers

    Authors: Desh Raj, Junteng Jia, Jay Mahadeokar, Chunyang Wu, Niko Moritz, Xiaohui Zhang, Ozlem Kalinli

    Abstract: Neural transducers have achieved human level performance on standard speech recognition benchmarks. However, their performance significantly degrades in the presence of cross-talk, especially when the primary speaker has a low signal-to-noise ratio. Anchored speech recognition refers to a class of methods that use information from an anchor segment (e.g., wake-words) to recognize device-directed s… ▽ More

    Submitted 29 March, 2023; v1 submitted 20 October, 2022; originally announced October 2022.

    Comments: To appear at IEEE ICASSP 2023

  11. arXiv:2208.10320  [pdf, other

    eess.IV cs.CV cs.LG

    Optimising Chest X-Rays for Image Analysis by Identifying and Removing Confounding Factors

    Authors: Shahab Aslani, Watjana Lilaonitkul, Vaishnavi Gnanananthan, Divya Raj, Bojidar Rangelov, Alexandra L Young, Yipeng Hu, Paul Taylor, Daniel C Alexander, Joseph Jacob

    Abstract: During the COVID-19 pandemic, the sheer volume of imaging performed in an emergency setting for COVID-19 diagnosis has resulted in a wide variability of clinical CXR acquisitions. This variation is seen in the CXR projections used, image annotations added and in the inspiratory effort and degree of rotation of clinical images. The image analysis community has attempted to ease the burden on overst… ▽ More

    Submitted 22 August, 2022; originally announced August 2022.

  12. arXiv:2204.02306  [pdf, other

    eess.AS

    Low-Latency Speech Separation Guided Diarization for Telephone Conversations

    Authors: Giovanni Morrone, Samuele Cornell, Desh Raj, Luca Serafini, Enrico Zovato, Alessio Brutti, Stefano Squartini

    Abstract: In this paper, we carry out an analysis on the use of speech separation guided diarization (SSGD) in telephone conversations. SSGD performs diarization by separating the speakers signals and then applying voice activity detection on each estimated speaker signal. In particular, we compare two low-latency speech separation models. Moreover, we show a post-processing algorithm that significantly red… ▽ More

    Submitted 27 October, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: Accepted for Presentation at IEEE Spoken Language Technology Workshop (SLT) 2022

  13. arXiv:2110.04863  [pdf, other

    eess.AS cs.CL

    Injecting Text and Cross-lingual Supervision in Few-shot Learning from Self-Supervised Models

    Authors: Matthew Wiesner, Desh Raj, Sanjeev Khudanpur

    Abstract: Self-supervised model pre-training has recently garnered significant interest, but relatively few efforts have explored using additional resources in fine-tuning these models. We demonstrate how universal phoneset acoustic models can leverage cross-lingual supervision to improve transfer of pretrained self-supervised representations to new languages. We also show how target-language text can be us… ▽ More

    Submitted 10 October, 2021; originally announced October 2021.

    Comments: \c{opyright} 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  14. arXiv:2109.08555  [pdf, other

    eess.AS cs.SD

    Continuous Streaming Multi-Talker ASR with Dual-path Transducers

    Authors: Desh Raj, Liang Lu, Zhuo Chen, Yashesh Gaur, **yu Li

    Abstract: Streaming recognition of multi-talker conversations has so far been evaluated only for 2-speaker single-turn sessions. In this paper, we investigate it for multi-turn meetings containing multiple speakers using the Streaming Unmixing and Recognition Transducer (SURT) model, and show that naively extending the single-turn model to this harder setting incurs a performance penalty. As a solution, we… ▽ More

    Submitted 22 January, 2022; v1 submitted 17 September, 2021; originally announced September 2021.

    Comments: Accepted for publication at IEEE ICASSP 2022

  15. arXiv:2108.03342  [pdf, other

    eess.AS

    Target-speaker Voice Activity Detection with Improved I-Vector Estimation for Unknown Number of Speaker

    Authors: Maokui He, Desh Raj, Zili Huang, Jun Du, Zhuo Chen, Shinji Watanabe

    Abstract: Target-speaker voice activity detection (TS-VAD) has recently shown promising results for speaker diarization on highly overlapped speech. However, the original model requires a fixed (and known) number of speakers, which limits its application to real conversations. In this paper, we extend TS-VAD to speaker diarization with unknown numbers of speakers. This is achieved by two steps: first, an in… ▽ More

    Submitted 6 August, 2021; originally announced August 2021.

  16. arXiv:2104.01954  [pdf, other

    eess.AS cs.SD

    Reformulating DOVER-Lap Label Map** as a Graph Partitioning Problem

    Authors: Desh Raj, Sanjeev Khudanpur

    Abstract: We recently proposed DOVER-Lap, a method for combining overlap-aware speaker diarization system outputs. DOVER-Lap improved upon its predecessor DOVER by using a label map** method based on globally-informed greedy search. In this paper, we analyze this label map** in the framework of a maximum orthogonal graph partitioning problem, and present three inferences. First, we show that DOVER-Lap l… ▽ More

    Submitted 3 June, 2021; v1 submitted 5 April, 2021; originally announced April 2021.

    Comments: 5 pages, 3 figures. Acceped at INTERSPEECH 2021

  17. arXiv:2102.01363  [pdf, other

    eess.AS cs.CL cs.SD

    The Hitachi-JHU DIHARD III System: Competitive End-to-End Neural Diarization and X-Vector Clustering Systems Combined by DOVER-Lap

    Authors: Shota Horiguchi, Nelson Yalta, Paola Garcia, Yuki Takashima, Yawen Xue, Desh Raj, Zili Huang, Yusuke Fujita, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: This paper provides a detailed description of the Hitachi-JHU system that was submitted to the Third DIHARD Speech Diarization Challenge. The system outputs the ensemble results of the five subsystems: two x-vector-based subsystems, two end-to-end neural diarization-based subsystems, and one hybrid subsystem. We refine each system and all five subsystems become competitive and complementary. After… ▽ More

    Submitted 2 February, 2021; originally announced February 2021.

  18. arXiv:2011.02900  [pdf, other

    eess.AS cs.SD

    Multi-class Spectral Clustering with Overlaps for Speaker Diarization

    Authors: Desh Raj, Zili Huang, Sanjeev Khudanpur

    Abstract: This paper describes a method for overlap-aware speaker diarization. Given an overlap detector and a speaker embedding extractor, our method performs spectral clustering of segments informed by the output of the overlap detector. This is achieved by transforming the discrete clustering problem into a convex optimization problem which is solved by eigen-decomposition. Thereafter, we discretize the… ▽ More

    Submitted 5 November, 2020; originally announced November 2020.

    Comments: Accepted at IEEE SLT 2021

  19. arXiv:2011.02090  [pdf, other

    eess.AS cs.SD

    Frustratingly Easy Noise-aware Training of Acoustic Models

    Authors: Desh Raj, Jesus Villalba, Daniel Povey, Sanjeev Khudanpur

    Abstract: Environmental noises and reverberation have a detrimental effect on the performance of automatic speech recognition (ASR) systems. Multi-condition training of neural network-based acoustic models is used to deal with this problem, but it requires many-folds data augmentation, resulting in increased training time. In this paper, we propose utterance-level noise vectors for noise-aware training of a… ▽ More

    Submitted 2 February, 2021; v1 submitted 3 November, 2020; originally announced November 2020.

    Comments: 6 + 3 (Appendix) pages

  20. arXiv:2011.02014  [pdf, other

    eess.AS cs.SD

    Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis

    Authors: Desh Raj, Pavel Denisov, Zhuo Chen, Hakan Erdogan, Zili Huang, Maokui He, Shinji Watanabe, Jun Du, Takuya Yoshioka, Yi Luo, Naoyuki Kanda, **yu Li, Scott Wisdom, John R. Hershey

    Abstract: Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this pape… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: Accepted to IEEE SLT 2021

  21. arXiv:2011.01997  [pdf, other

    eess.AS cs.SD

    DOVER-Lap: A Method for Combining Overlap-aware Diarization Outputs

    Authors: Desh Raj, Leibny Paola Garcia-Perera, Zili Huang, Shinji Watanabe, Daniel Povey, Andreas Stolcke, Sanjeev Khudanpur

    Abstract: Several advances have been made recently towards handling overlap** speech for speaker diarization. Since speech and natural language tasks often benefit from ensemble techniques, we propose an algorithm for combining outputs from such diarization systems through majority voting. Our method, DOVER-Lap, is inspired from the recently proposed DOVER algorithm, but is designed to handle overlap**… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: Accepted to IEEE SLT 2021

  22. arXiv:2006.07898  [pdf, other

    eess.AS cs.SD

    The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge

    Authors: Ashish Arora, Desh Raj, Aswin Shanmugam Subramanian, Ke Li, Bar Ben-Yair, Matthew Maciejewski, Piotr Żelasko, Paola García, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: This paper summarizes the JHU team's efforts in tracks 1 and 2 of the CHiME-6 challenge for distant multi-microphone conversational speech diarization and recognition in everyday home environments. We explore multi-array processing techniques at each stage of the pipeline, such as multi-array guided source separation (GSS) for enhancement and acoustic model training data, posterior fusion for spee… ▽ More

    Submitted 14 June, 2020; originally announced June 2020.

    Comments: Presented at the CHiME-6 workshop (colocated with ICASSP 2020)

  23. arXiv:2005.07796  [pdf, other

    cs.CV cs.LG eess.IV

    FuSSI-Net: Fusion of Spatio-temporal Skeletons for Intention Prediction Network

    Authors: Francesco Piccoli, Rajarathnam Balakrishnan, Maria Jesus Perez, Moraldeepsingh Sachdeo, Carlos Nunez, Matthew Tang, Kajsa Andreasson, Kalle Bjurek, Ria Dass Raj, Ebba Davidsson, Colin Eriksson, Victor Hagman, Jonas Sjoberg, Ying Li, L. Srikar Muppirisetty, Sohini Roychowdhury

    Abstract: Pedestrian intention recognition is very important to develop robust and safe autonomous driving (AD) and advanced driver assistance systems (ADAS) functionalities for urban driving. In this work, we develop an end-to-end pedestrian intention framework that performs well on day- and night- time scenarios. Our framework relies on objection detection bounding boxes combined with skeletal features of… ▽ More

    Submitted 15 May, 2020; originally announced May 2020.

    Comments: 5 pages, 6 figures, 5 tables, IEEE Asilomar SSC

  24. arXiv:2004.09249  [pdf, other

    cs.SD cs.CL eess.AS

    CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

    Authors: Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, Vimal Manohar, Daniel Povey, Desh Raj, David Snyder, Aswin Shanmugam Subramanian, Jan Trmal, Bar Ben Yair, Christoph Boeddeker, Zhaoheng Ni, Yusuke Fujita, Shota Horiguchi, Naoyuki Kanda, Takuya Yoshioka, Neville Ryant

    Abstract: Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous C… ▽ More

    Submitted 2 May, 2020; v1 submitted 20 April, 2020; originally announced April 2020.

  25. arXiv:1911.07953  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Sequential Multi-Frame Neural Beamforming for Speech Separation and Enhancement

    Authors: Zhong-Qiu Wang, Hakan Erdogan, Scott Wisdom, Kevin Wilson, Desh Raj, Shinji Watanabe, Zhuo Chen, John R. Hershey

    Abstract: This work introduces sequential neural beamforming, which alternates between neural network based spectral separation and beamforming based spatial separation. Our neural networks for separation use an advanced convolutional architecture trained with a novel stabilized signal-to-noise ratio loss function. For beamforming, we explore multiple ways of computing time-varying covariance matrices, incl… ▽ More

    Submitted 3 November, 2020; v1 submitted 18 November, 2019; originally announced November 2019.

    Comments: 7 pages, 7 figures, IEEE SLT 2021 (slt2020.org)

  26. arXiv:1910.06407  [pdf, other

    cs.CV cs.LG eess.IV

    FireNet: Real-time Segmentation of Fire Perimeter from Aerial Video

    Authors: Jigar Doshi, Dominic Garcia, Cliff Massey, Pablo Llueca, Nicolas Borensztein, Michael Baird, Matthew Cook, Devaki Raj

    Abstract: In this paper, we share our approach to real-time segmentation of fire perimeter from aerial full-motion infrared video. We start by describing the problem from a humanitarian aid and disaster response perspective. Specifically, we explain the importance of the problem, how it is currently resolved, and how our machine learning approach improves it. To test our models we annotate a large-scale dat… ▽ More

    Submitted 14 October, 2019; originally announced October 2019.

    Comments: Published at NeurIPS 2019; Workshop on Artificial Intelligence for Humanitarian Assistance and Disaster Response(AI+HADR 2019)

  27. Probing the Information Encoded in X-vectors

    Authors: Desh Raj, David Snyder, Daniel Povey, Sanjeev Khudanpur

    Abstract: Deep neural network based speaker embeddings, such as x-vectors, have been shown to perform well in text-independent speaker recognition/verification tasks. In this paper, we use simple classifiers to investigate the contents encoded by x-vector embeddings. We probe these embeddings for information related to the speaker, channel, transcription (sentence, words, phones), and meta information about… ▽ More

    Submitted 30 September, 2019; v1 submitted 13 September, 2019; originally announced September 2019.

    Comments: Accepted at IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2019

    Journal ref: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (2019): 726-733

  28. Educational Data Mining and Learning Analytics - Educational Assistance for Teaching and Learning

    Authors: Ms. Ganesan Kavitha, Dr. Lawrance Raj

    Abstract: Teaching and Learning process of an educational institution needs to be monitored and effectively analysed for enhancement. Teaching and Learning is a vital element for an educational institution. It is also one of the criteria set by majority of the Accreditation Agencies around the world. Learning analytics and Educational Data Mining are relatively new. Learning analytics refers to the collecti… ▽ More

    Submitted 11 June, 2017; originally announced June 2017.

    Comments: 5 Pages, 5 Tables and 1 Figure, Internal Journal of Computer and Organization Trends (IJCOT), March 2017