Search | arXiv e-print repository

Implicit Self-supervised Language Representation for Spoken Language Diarization

Authors: Jagabandhu Mishra, S. R. Mahadeva Prasanna

Abstract: In a code-switched (CS) scenario, the use of spoken language diarization (LD) as a pre-possessing system is essential. Further, the use of implicit frameworks is preferable over the explicit framework, as it can be easily adapted to deal with low/zero resource languages. Inspired by speaker diarization (SD) literature, three frameworks based on (1) fixed segmentation, (2) change point-based segmen… ▽ More In a code-switched (CS) scenario, the use of spoken language diarization (LD) as a pre-possessing system is essential. Further, the use of implicit frameworks is preferable over the explicit framework, as it can be easily adapted to deal with low/zero resource languages. Inspired by speaker diarization (SD) literature, three frameworks based on (1) fixed segmentation, (2) change point-based segmentation and (3) E2E are proposed to perform LD. The initial exploration with synthetic TTSF-LD dataset shows, using x-vector as implicit language representation with appropriate analysis window length ($N$) can able to achieve at per performance with explicit LD. The best implicit LD performance of $6.38$ in terms of Jaccard error rate (JER) is achieved by using the E2E framework. However, considering the E2E framework the performance of implicit LD degrades to $60.4$ while using with practical Microsoft CS (MSCS) dataset. The difference in performance is mostly due to the distributional difference between the monolingual segment duration of secondary language in the MSCS and TTSF-LD datasets. Moreover, to avoid segment smoothing, the smaller duration of the monolingual segment suggests the use of a small value of $N$. At the same time with small $N$, the x-vector representation is unable to capture the required language discrimination due to the acoustic similarity, as the same speaker is speaking both languages. Therefore, to resolve the issue a self-supervised implicit language representation is proposed in this study. In comparison with the x-vector representation, the proposed representation provides a relative improvement of $63.9\%$ and achieved a JER of $21.8$ using the E2E framework. △ Less

Submitted 21 August, 2023; originally announced August 2023.

Comments: Planning to Submit in IEEE-JSTSP

arXiv:2306.12913 [pdf, other]

Implicit spoken language diarization

Authors: Jagabandhu Mishra, Amartya Chowdhury, S. R. Mahadeva Prasanna

Abstract: Spoken language diarization (LD) and related tasks are mostly explored using the phonotactic approach. Phonotactic approaches mostly use explicit way of language modeling, hence requiring intermediate phoneme modeling and transcribed data. Alternatively, the ability of deep learning approaches to model temporal dynamics may help for the implicit modeling of language information through deep embedd… ▽ More Spoken language diarization (LD) and related tasks are mostly explored using the phonotactic approach. Phonotactic approaches mostly use explicit way of language modeling, hence requiring intermediate phoneme modeling and transcribed data. Alternatively, the ability of deep learning approaches to model temporal dynamics may help for the implicit modeling of language information through deep embedding vectors. Hence this work initially explores the available speaker diarization frameworks that capture speaker information implicitly to perform LD tasks. The performance of the LD system on synthetic code-switch data using the end-to-end x-vector approach is 6.78% and 7.06%, and for practical data is 22.50% and 60.38%, in terms of diarization error rate and Jaccard error rate (JER), respectively. The performance degradation is due to the data imbalance and resolved to some extent by using pre-trained wave2vec embeddings that provide a relative improvement of 30.74% in terms of JER. △ Less

Submitted 22 June, 2023; originally announced June 2023.

arXiv:2302.13209 [pdf, other]

I-MSV 2022: Indic-Multilingual and Multi-sensor Speaker Verification Challenge

Authors: Jagabandhu Mishra, Mrinmoy Bhattacharjee, S. R. Mahadeva Prasanna

Abstract: Speaker Verification (SV) is a task to verify the claimed identity of the claimant using his/her voice sample. Though there exists an ample amount of research in SV technologies, the development concerning a multilingual conversation is limited. In a country like India, almost all the speakers are polyglot in nature. Consequently, the development of a Multilingual SV (MSV) system on the data colle… ▽ More Speaker Verification (SV) is a task to verify the claimed identity of the claimant using his/her voice sample. Though there exists an ample amount of research in SV technologies, the development concerning a multilingual conversation is limited. In a country like India, almost all the speakers are polyglot in nature. Consequently, the development of a Multilingual SV (MSV) system on the data collected in the Indian scenario is more challenging. With this motivation, the Indic- Multilingual Speaker Verification (I-MSV) Challenge 2022 has been designed for understanding and comparing the state-of-the-art SV techniques. For the challenge, approximately $100$ hours of data spoken by $100$ speakers has been collected using $5$ different sensors in $13$ Indian languages. The data is divided into development, training, and testing sets and has been made publicly available for further research. The goal of this challenge is to make the SV system robust to language and sensor variations between enrollment and testing. In the challenge, participants were asked to develop the SV system in two scenarios, viz. constrained and unconstrained. The best system in the constrained and unconstrained scenario achieved a performance of $2.12\%$ and $0.26\%$ in terms of Equal Error Rate (EER), respectively. △ Less

Submitted 25 February, 2023; originally announced February 2023.

arXiv:2302.05265 [pdf, other]

doi 10.1007/s00034-024-02743-w

Spoken language change detection inspired by speaker change detection

Authors: Jagabandhu Mishra, S. R. Mahadeva Prasanna

Abstract: Spoken language change detection (LCD) refers to identifying the language transitions in a code-switched utterance. Similarly, identifying the speaker transitions in a multispeaker utterance is known as speaker change detection (SCD). Since tasks-wise both are similar, the architecture/framework developed for the SCD task may be suitable for the LCD task. Hence, the aim of the present work is to d… ▽ More Spoken language change detection (LCD) refers to identifying the language transitions in a code-switched utterance. Similarly, identifying the speaker transitions in a multispeaker utterance is known as speaker change detection (SCD). Since tasks-wise both are similar, the architecture/framework developed for the SCD task may be suitable for the LCD task. Hence, the aim of the present work is to develop LCD systems inspired by SCD. Initially, both LCD and SCD are performed by humans. The study suggests humans require (a) a larger duration around the change point and (b) language-specific prior exposure, for performing LCD as compared to SCD. The larger duration requirement is incorporated by increasing the analysis window length of the unsupervised distance-based approach. This leads to a relative performance improvement of 29.1% and 2.4%, and a priori language knowledge provides a relative improvement of 31.63% and 14.27% on the synthetic and practical codeswitched datasets, respectively. The performance difference between the practical and synthetic datasets is mostly due to differences in the distribution of the monolingual segment duration. △ Less

Submitted 10 February, 2023; originally announced February 2023.

arXiv:2203.02680

Language vs Speaker Change: A Comparative Study

Authors: Jagabandhu Mishra, S. R. Mahadeva Prasanna

Abstract: Spoken language change detection (LCD) refers to detecting language switching points in a multilingual speech signal. Speaker change detection (SCD) refers to locating the speaker change points in a multispeaker speech signal. The objective of this work is to understand the challenges in LCD task by comparing it with SCD task. Human subjective study for change detection is performed for LCD and SC… ▽ More Spoken language change detection (LCD) refers to detecting language switching points in a multilingual speech signal. Speaker change detection (SCD) refers to locating the speaker change points in a multispeaker speech signal. The objective of this work is to understand the challenges in LCD task by comparing it with SCD task. Human subjective study for change detection is performed for LCD and SCD. This study demonstrates that LCD requires larger duration spectro-temporal information around the change point compared to SCD. Based on this, the work explores automatic distance based and model based LCD approaches. The model based ones include Gaussian mixture model and universal background model (GMM-UBM), attention, and Generative adversarial network (GAN) based approaches. Both the human and automatic LCD tasks infer that the performance of the LCD task improves by incorporating more and more spectro-temporal duration. △ Less

Submitted 6 October, 2023; v1 submitted 5 March, 2022; originally announced March 2022.

Comments: The work is substantially modified. The new version of the same will be submitted soon

arXiv:2101.04261 [pdf, other]

NxTF: An API and Compiler for Deep Spiking Neural Networks on Intel Loihi

Authors: Bodo Rueckauer, Connor Bybee, Ralf Goettsche, Yashwardhan Singh, Joyesh Mishra, Andreas Wild

Abstract: Spiking Neural Networks (SNNs) are a promising paradigm for efficient event-driven processing of spatio-temporally sparse data streams. SNNs have inspired the design and can take advantage of the emerging class of neuromorphic processors like Intel Loihi. These novel hardware architectures expose a variety of constraints that affect firmware, compiler and algorithm development alike. To enable rap… ▽ More Spiking Neural Networks (SNNs) are a promising paradigm for efficient event-driven processing of spatio-temporally sparse data streams. SNNs have inspired the design and can take advantage of the emerging class of neuromorphic processors like Intel Loihi. These novel hardware architectures expose a variety of constraints that affect firmware, compiler and algorithm development alike. To enable rapid and flexible development of SNN algorithms on Loihi, we developed NxTF: a programming interface derived from Keras and compiler optimized for map** deep convolutional SNNs to the multi-core Intel Loihi architecture. We evaluate NxTF on DNNs trained directly on spikes as well as models converted from traditional DNNs, processing both sparse event-based and dense frame-based data sets. Further, we assess the effectiveness of the compiler to distribute models across a large number of cores and to compress models by exploiting Loihi's weight sharing features. Finally, we evaluate model accuracy, energy and time to solution compared to other architectures. The compiler achieves near optimal resource utilization of 80% across 16 Loihi chips for a 28-layer, 4M parameter MobileNet model with input size 128x128. In addition, we report the lowest error rate of 8.52% for the CIFAR-10 dataset on neuromorphic hardware, using an off-the-shelf MobileNet. △ Less

Submitted 11 January, 2021; originally announced January 2021.

arXiv:2004.12691 [pdf, other]

Neuromorphic Nearest-Neighbor Search Using Intel's Pohoiki Springs

Authors: E. Paxon Frady, Garrick Orchard, David Florey, Nabil Imam, Ruokun Liu, Joyesh Mishra, Jonathan Tse, Andreas Wild, Friedrich T. Sommer, Mike Davies

Abstract: Neuromorphic computing applies insights from neuroscience to uncover innovations in computing technology. In the brain, billions of interconnected neurons perform rapid computations at extremely low energy levels by leveraging properties that are foreign to conventional computing systems, such as temporal spiking codes and finely parallelized processing units integrating both memory and computatio… ▽ More Neuromorphic computing applies insights from neuroscience to uncover innovations in computing technology. In the brain, billions of interconnected neurons perform rapid computations at extremely low energy levels by leveraging properties that are foreign to conventional computing systems, such as temporal spiking codes and finely parallelized processing units integrating both memory and computation. Here, we showcase the Pohoiki Springs neuromorphic system, a mesh of 768 interconnected Loihi chips that collectively implement 100 million spiking neurons in silicon. We demonstrate a scalable approximate k-nearest neighbor (k-NN) algorithm for searching large databases that exploits neuromorphic principles. Compared to state-of-the-art conventional CPU-based implementations, we achieve superior latency, index build time, and energy efficiency when evaluated on several standard datasets containing over 1 million high-dimensional patterns. Further, the system supports adding new data points to the indexed database online in O(1) time unlike all but brute force conventional k-NN implementations. △ Less

Submitted 27 April, 2020; originally announced April 2020.

Comments: 9 pages, 8 figures, 3 tables, submission to NICE 2020

Showing 1–7 of 7 results for author: Mishra, J