Search | arXiv e-print repository

Speakers Unembedded: Embedding-free Approach to Long-form Neural Diarization

Authors: Xiang Li, Vivek Govindan, Rohit Paturi, Sundararajan Srinivasan

Abstract: End-to-end neural diarization (EEND) models offer significant improvements over traditional embedding-based Speaker Diarization (SD) approaches but falls short on generalizing to long-form audio with large number of speakers. EEND-vector-clustering method mitigates this by combining local EEND with global clustering of speaker embeddings from local windows, but this requires an additional speaker… ▽ More End-to-end neural diarization (EEND) models offer significant improvements over traditional embedding-based Speaker Diarization (SD) approaches but falls short on generalizing to long-form audio with large number of speakers. EEND-vector-clustering method mitigates this by combining local EEND with global clustering of speaker embeddings from local windows, but this requires an additional speaker embedding framework alongside the EEND module. In this paper, we propose a novel framework applying EEND both locally and globally for long-form audio without separate speaker embeddings. This approach achieves significant relative DER reduction of 13% and 10% over the conventional 1-pass EEND on Callhome American English and RT03-CTS datasets respectively and marginal improvements over EEND-vector-clustering without the need for additional speaker embeddings. Furthermore, we discuss the computational complexity of our proposed framework and explore strategies for reducing processing times. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: Accepted at INTERSPEECH 2024

arXiv:2406.17266 [pdf, other]

AG-LSEC: Audio Grounded Lexical Speaker Error Correction

Authors: Rohit Paturi, Xiang Li, Sundararajan Srinivasan

Abstract: Speaker Diarization (SD) systems are typically audio-based and operate independently of the ASR system in traditional speech transcription pipelines and can have speaker errors due to SD and/or ASR reconciliation, especially around speaker turns and regions of speech overlap. To reduce these errors, a Lexical Speaker Error Correction (LSEC), in which an external language model provides lexical inf… ▽ More Speaker Diarization (SD) systems are typically audio-based and operate independently of the ASR system in traditional speech transcription pipelines and can have speaker errors due to SD and/or ASR reconciliation, especially around speaker turns and regions of speech overlap. To reduce these errors, a Lexical Speaker Error Correction (LSEC), in which an external language model provides lexical information to correct the speaker errors, was recently proposed. Though the approach achieves good Word Diarization error rate (WDER) improvements, it does not use any additional acoustic information and is prone to miscorrections. In this paper, we propose to enhance and acoustically ground the LSEC system with speaker scores directly derived from the existing SD pipeline. This approach achieves significant relative WDER reductions in the range of 25-40% over the audio-based SD, ASR system and beats the LSEC system by 15-25% relative on RT03-CTS, Callhome American English and Fisher datasets. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: Accepted at INTERSPEECH 2024

arXiv:2405.08317 [pdf, other]

SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

Authors: Raghuveer Peri, Sai Muralidhar Jayanthi, Srikanth Ronanki, Anshu Bhatia, Karel Mundnich, Saket Dingliwal, Nilaksh Das, Zejiang Hou, Goeric Huybrechts, Srikanth Vishnubhotla, Daniel Garcia-Romero, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff

Abstract: Integrated Speech and Large Language Models (SLMs) that can follow speech instructions and generate relevant text responses have gained popularity lately. However, the safety and robustness of these models remains largely unclear. In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking. Specifically, we… ▽ More Integrated Speech and Large Language Models (SLMs) that can follow speech instructions and generate relevant text responses have gained popularity lately. However, the safety and robustness of these models remains largely unclear. In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking. Specifically, we design algorithms that can generate adversarial examples to jailbreak SLMs in both white-box and black-box attack settings without human involvement. Additionally, we propose countermeasures to thwart such jailbreaking attacks. Our models, trained on dialog data with speech instructions, achieve state-of-the-art performance on spoken question-answering task, scoring over 80% on both safety and helpfulness metrics. Despite safety guardrails, experiments on jailbreaking demonstrate the vulnerability of SLMs to adversarial perturbations and transfer attacks, with average attack success rates of 90% and 10% respectively when evaluated on a dataset of carefully designed harmful questions spanning 12 different toxic categories. However, we demonstrate that our proposed countermeasures reduce the attack success significantly. △ Less

Submitted 14 May, 2024; originally announced May 2024.

Comments: 9+6 pages, Submitted to ACL 2024

arXiv:2405.08295 [pdf, other]

SpeechVerse: A Large-scale Generalizable Audio Language Model

Authors: Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, Xilai Li, Karel Mundnich, Monica Sunkara, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff

Abstract: Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore devel… ▽ More Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters, while kee** the pre-trained models frozen during training. The models are instruction finetuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. We perform extensive benchmarking that includes comparing our model performance against traditional baselines across several datasets and tasks. Furthermore, we evaluate the model's capability for generalized instruction following by testing on out-of-domain datasets, novel prompts, and unseen tasks. Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks. △ Less

Submitted 31 May, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

Comments: Single Column, 13 page

arXiv:2311.00697 [pdf, other]

End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation

Authors: Juan Zuluaga-Gomez, Zhaocheng Huang, Xing Niu, Rohit Paturi, Sundararajan Srinivasan, Prashant Mathur, Brian Thompson, Marcello Federico

Abstract: Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers. In this paper, we tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model, named Speaker-Turn Aware Conversational Speech Translation, that combin… ▽ More Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers. In this paper, we tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model, named Speaker-Turn Aware Conversational Speech Translation, that combines automatic speech recognition, speech translation and speaker turn detection using special tokens in a serialized labeling format. We run experiments on the Fisher-CALLHOME corpus, which we adapted by merging the two single-speaker channels into one multi-speaker channel, thus representing the more realistic and challenging scenario with multi-speaker turns and cross-talk. Experimental results across single- and multi-speaker conditions and against conventional ST systems, show that our model outperforms the reference systems on the multi-speaker condition, while attaining comparable performance on the single-speaker condition. We release scripts for data processing and model training. △ Less

Submitted 1 November, 2023; originally announced November 2023.

Comments: Accepted at EMNLP 2023. Code: https://github.com/amazon-science/stac-speech-translation

arXiv:2307.09954 [pdf, other]

Priority-based DREAM Approach for Highly Manoeuvring Intruders in A Perimeter Defense Problem

Authors: Shridhar Velhal, Suresh Sundaram, Narasimhan Sundararajan

Abstract: In this paper, a Priority-based Dynamic REsource Allocation with decentralized Multi-task assignment (P-DREAM) approach is presented to protect a territory from highly manoeuvring intruders. In the first part, static optimization problems are formulated to compute the following parameters of the perimeter defense problem; the number of reserve stations, their locations, the priority region, the mo… ▽ More In this paper, a Priority-based Dynamic REsource Allocation with decentralized Multi-task assignment (P-DREAM) approach is presented to protect a territory from highly manoeuvring intruders. In the first part, static optimization problems are formulated to compute the following parameters of the perimeter defense problem; the number of reserve stations, their locations, the priority region, the monitoring region, and the minimum number of defenders required for the monitoring purpose. The concept of a prioritized intruder is proposed here to identify and handle those critical intruders (computed based on the velocity ratio and location) to be tackled on a priority basis. The computed priority region helps to assign reserve defenders sufficiently earlier such that they can neutralize the prioritized intruders. The monitoring region defines the minimum region to be monitored and is sufficient enough to handle the intruders. In the second part, the earlier developed DREAM approach is modified to incorporate the priority of an intruder. The proposed P-DREAM approach assigns the defenders to the prioritized intruders as the first task. A convex territory protection problem is simulated to illustrate the P-DREAM approach. It involves the computation of static parameters and solving the prioritized task assignments with dynamic resource allocation. Monte-Carlo results were conducted to verify the performance of P-DREAM, and the results clearly show that the P-DREAM approach can protect the territory with consistent performance against highly manoeuvring intruders. △ Less

Submitted 19 July, 2023; originally announced July 2023.

arXiv:2306.09313 [pdf, other]

Lexical Speaker Error Correction: Leveraging Language Models for Speaker Diarization Error Correction

Authors: Rohit Paturi, Sundararajan Srinivasan, Xiang Li

Abstract: Speaker diarization (SD) is typically used with an automatic speech recognition (ASR) system to ascribe speaker labels to recognized words. The conventional approach reconciles outputs from independently optimized ASR and SD systems, where the SD system typically uses only acoustic information to identify the speakers in the audio stream. This approach can lead to speaker errors especially around… ▽ More Speaker diarization (SD) is typically used with an automatic speech recognition (ASR) system to ascribe speaker labels to recognized words. The conventional approach reconciles outputs from independently optimized ASR and SD systems, where the SD system typically uses only acoustic information to identify the speakers in the audio stream. This approach can lead to speaker errors especially around speaker turns and regions of speaker overlap. In this paper, we propose a novel second-pass speaker error correction system using lexical information, leveraging the power of modern language models (LMs). Our experiments across multiple telephony datasets show that our approach is both effective and robust. Training and tuning only on the Fisher dataset, this error correction approach leads to relative word-level diarization error rate (WDER) reductions of 15-30% on three telephony datasets: RT03-CTS, Callhome American English and held-out portions of Fisher. △ Less

Submitted 15 June, 2023; originally announced June 2023.

Comments: Accepted at INTERSPEECH 2023

arXiv:2212.11950 [pdf, other]

doi 10.1109/ITSC48978.2021.9564420

Peek into the Future Camera-based Occupant Sensing in Configurable Cabins for Autonomous Vehicles

Authors: Avinash Prabu, Renran Tian, Lingxi Li, Jialiang Le, Srinivasan Sundararajan, Saeed Barbat

Abstract: The development of fully autonomous vehicles (AVs) can potentially eliminate drivers and introduce unprecedented seating design. However, highly flexible seat configurations may lead to occupants' unconventional poses and actions. Understanding occupant behaviors and prioritize safety features become eye-catching topics in the AV research frontier. Visual sensors have the advantages of cost-effici… ▽ More The development of fully autonomous vehicles (AVs) can potentially eliminate drivers and introduce unprecedented seating design. However, highly flexible seat configurations may lead to occupants' unconventional poses and actions. Understanding occupant behaviors and prioritize safety features become eye-catching topics in the AV research frontier. Visual sensors have the advantages of cost-efficiency and high-fidelity imaging and become more widely applied for in-car sensing purposes. Occlusion is one big concern for this type of system in crowded car cabins. It is important but largely unknown about how a visual-sensing framework will look like to support 2-D and 3-D human pose tracking towards highly configurable seats. As one of the first studies to touch this topic, we peek into the future camera-based sensing framework via a simulation experiment. Constructed representative car-cabin, seat layouts, and occupant sizes, camera coverage from different angles and positions is simulated and calculated. The comprehensive coverage data are synthesized through an optimization process to determine the camera layout and overall occupant coverage. The results show the needs and design of a different number of cameras to fully or partially cover all the occupants with changeable configurations of up to six seats. △ Less

Submitted 22 December, 2022; originally announced December 2022.

Comments: Conference: 2021 IEEE International Intelligent Transportation Systems Conference (ITSC) Link: https://ieeexplore.ieee.org/document/9564420

arXiv:2212.07084 [pdf, other]

Fully Complex-valued Fully Convolutional Multi-feature Fusion Network (FC2MFN) for Building Segmentation of InSAR images

Authors: Aniruddh Sikdar, Sumanth Udupa, Suresh Sundaram, Narasimhan Sundararajan

Abstract: Building segmentation in high-resolution InSAR images is a challenging task that can be useful for large-scale surveillance. Although complex-valued deep learning networks perform better than their real-valued counterparts for complex-valued SAR data, phase information is not retained throughout the network, which causes a loss of information. This paper proposes a Fully Complex-valued, Fully Conv… ▽ More Building segmentation in high-resolution InSAR images is a challenging task that can be useful for large-scale surveillance. Although complex-valued deep learning networks perform better than their real-valued counterparts for complex-valued SAR data, phase information is not retained throughout the network, which causes a loss of information. This paper proposes a Fully Complex-valued, Fully Convolutional Multi-feature Fusion Network(FC2MFN) for building semantic segmentation on InSAR images using a novel, fully complex-valued learning scheme. The network learns multi-scale features, performs multi-feature fusion, and has a complex-valued output. For the particularity of complex-valued InSAR data, a new complex-valued pooling layer is proposed that compares complex numbers considering their magnitude and phase. This helps the network retain the phase information even through the pooling layer. Experimental results on the simulated InSAR dataset show that FC2MFN achieves better results compared to other state-of-the-art methods in terms of segmentation performance and model complexity. △ Less

Submitted 14 December, 2022; originally announced December 2022.

Comments: Accepted for publication in IEEE Symposium Series On Computational Intelligence 2022, 8 pages, 6 figures

arXiv:2211.13280 [pdf, other]

Device Directedness with Contextual Cues for Spoken Dialog Systems

Authors: Dhanush Bekal, Sundararajan Srinivasan, Sravan Bodapati, Srikanth Ronanki, Katrin Kirchhoff

Abstract: In this work, we define barge-in verification as a supervised learning task where audio-only information is used to classify user spoken dialogue into true and false barge-ins. Following the success of pre-trained models, we use low-level speech representations from a self-supervised representation learning model for our downstream classification task. Further, we propose a novel technique to infu… ▽ More In this work, we define barge-in verification as a supervised learning task where audio-only information is used to classify user spoken dialogue into true and false barge-ins. Following the success of pre-trained models, we use low-level speech representations from a self-supervised representation learning model for our downstream classification task. Further, we propose a novel technique to infuse lexical information directly into speech representations to improve the domain-specific language information implicitly learned during pre-training. Experiments conducted on spoken dialog data show that our proposed model trained to validate barge-in entirely from speech representations is faster by 38% relative and achieves 4.5% relative F1 score improvement over a baseline LSTM model that uses both audio and Automatic Speech Recognition (ASR) 1-best hypotheses. On top of this, our best proposed model with lexically infused representations along with contextual features provides a further relative improvement of 5.7% in the F1 score but only 22% faster than the baseline. △ Less

Submitted 23 November, 2022; originally announced November 2022.

arXiv:2202.13870 [pdf, other]

Simulating Network Paths with Recurrent Buffering Units

Authors: Divyam Anshumaan, Sriram Balasubramanian, Shubham Tiwari, Nagarajan Natarajan, Sundararajan Sellamanickam, Venkata N. Padmanabhan

Abstract: Simulating physical network paths (e.g., Internet) is a cornerstone research problem in the emerging sub-field of AI-for-networking. We seek a model that generates end-to-end packet delay values in response to the time-varying load offered by a sender, which is typically a function of the previously output delays. The problem setting is unique, and renders the state-of-the-art text and time-series… ▽ More Simulating physical network paths (e.g., Internet) is a cornerstone research problem in the emerging sub-field of AI-for-networking. We seek a model that generates end-to-end packet delay values in response to the time-varying load offered by a sender, which is typically a function of the previously output delays. The problem setting is unique, and renders the state-of-the-art text and time-series generative models inapplicable or ineffective. We formulate an ML problem at the intersection of dynamical systems, sequential decision making, and time-series modeling. We propose a novel grey-box approach to network simulation that embeds the semantics of physical network path in a new RNN-style model called RBU, providing the interpretability of standard network simulator tools, the power of neural models, the efficiency of SGD-based techniques for learning, and yielding promising results on synthetic and real-world network traces. △ Less

Submitted 6 December, 2022; v1 submitted 23 February, 2022; originally announced February 2022.

Comments: Accepted in AAAI 2023, 19 pages, 14 figures

arXiv:2112.05863 [pdf, other]

Directed Speech Separation for Automatic Speech Recognition of Long Form Conversational Speech

Authors: Rohit Paturi, Sundararajan Srinivasan, Katrin Kirchhoff, Daniel Garcia-Romero

Abstract: Many of the recent advances in speech separation are primarily aimed at synthetic mixtures of short audio utterances with high degrees of overlap. Most of these approaches need an additional stitching step to stitch the separated speech chunks for long form audio. Since most of the approaches involve Permutation Invariant training (PIT), the order of separated speech chunks is nondeterministic and… ▽ More Many of the recent advances in speech separation are primarily aimed at synthetic mixtures of short audio utterances with high degrees of overlap. Most of these approaches need an additional stitching step to stitch the separated speech chunks for long form audio. Since most of the approaches involve Permutation Invariant training (PIT), the order of separated speech chunks is nondeterministic and leads to difficulty in accurately stitching homogenous speaker chunks for downstream tasks like Automatic Speech Recognition (ASR). Also, most of these models are trained with synthetic mixtures and do not generalize to real conversational data. In this paper, we propose a speaker conditioned separator trained on speaker embeddings extracted directly from the mixed signal using an over-clustering based approach. This model naturally regulates the order of the separated chunks without the need for an additional stitching step. We also introduce a data sampling strategy with real and synthetic mixtures which generalizes well to real conversation speech. With this model and data sampling technique, we show significant improvements in speaker-attributed word error rate (SA-WER) on Hub5 data. △ Less

Submitted 6 September, 2022; v1 submitted 10 December, 2021; originally announced December 2021.

Comments: Accepted for publication at Interspeech 2022

arXiv:2112.00158 [pdf]

Representation learning through cross-modal conditional teacher-student training for speech emotion recognition

Authors: Sundararajan Srinivasan, Zhaocheng Huang, Katrin Kirchhoff

Abstract: Generic pre-trained speech and text representations promise to reduce the need for large labeled datasets on specific speech and language tasks. However, it is not clear how to effectively adapt these representations for speech emotion recognition. Recent public benchmarks show the efficacy of several popular self-supervised speech representations for emotion classification. In this study, we show… ▽ More Generic pre-trained speech and text representations promise to reduce the need for large labeled datasets on specific speech and language tasks. However, it is not clear how to effectively adapt these representations for speech emotion recognition. Recent public benchmarks show the efficacy of several popular self-supervised speech representations for emotion classification. In this study, we show that the primary difference between the top-performing representations is in predicting valence while the differences in predicting activation and dominance dimensions are less pronounced. However, we show that even the best-performing HuBERT representation underperforms on valence prediction compared to a multimodal model that also incorporates text representation. We address this shortcoming by injecting lexical information into the speech representation using the multimodal model as a teacher. To improve the efficacy of our approach, we propose a novel estimate of the quality of the emotion predictions, to condition teacher-student training. We report new audio-only state-of-the-art concordance correlation coefficient (CCC) values of 0.757, 0.627, 0.671 for activation, valence and dominance predictions, respectively, on the MSP-Podcast corpus, and also state-of-the-art values of 0.667, 0.582, 0.545 on the IEMOCAP corpus. △ Less

Submitted 27 January, 2022; v1 submitted 30 November, 2021; originally announced December 2021.

Comments: Accepted for publication at IEEE ICASSP 2022

arXiv:2107.10135 [pdf, other]

Global Outliers Detection in Wireless Sensor Networks: A Novel Approach Integrating Time-Series Analysis, Entropy, and Random Forest-based Classification

Authors: Mahmood Safaei, Maha Driss, Wadii Boulila, Elankovan A Sundararajan, Mitra Safaei

Abstract: Wireless Sensor Networks (WSNs) have recently attracted greater attention worldwide due to their practicality in monitoring, communicating, and reporting specific physical phenomena. The data collected by WSNs is often inaccurate as a result of unavoidable environmental factors, which may include noise, signal weakness, or intrusion attacks depending on the specific situation. Sending high-noise d… ▽ More Wireless Sensor Networks (WSNs) have recently attracted greater attention worldwide due to their practicality in monitoring, communicating, and reporting specific physical phenomena. The data collected by WSNs is often inaccurate as a result of unavoidable environmental factors, which may include noise, signal weakness, or intrusion attacks depending on the specific situation. Sending high-noise data has negative effects not just on data accuracy and network reliability, but also regarding the decision-making processes in the base station. Anomaly detection, or outlier detection, is the process of detecting noisy data amidst the contexts thus described. The literature contains relatively few noise detection techniques in the context of WSNs, particularly for outlier-detection algorithms applying time series analysis, which considers the effective neighbors to ensure a global-collaborative detection. Hence, the research presented in this paper is intended to design and implement a global outlier-detection approach, which allows us to find and select appropriate neighbors to ensure an adaptive collaborative detection based on time-series analysis and entropy techniques. The proposed approach applies a random forest algorithm for identifying the best results. To measure the effectiveness and efficiency of the proposed approach, a comprehensive and real scenario provided by the Intel Berkeley Research lab has been simulated. Noisy data have been injected into the collected data randomly. The results obtained from the experiment then conducted experimentation demonstrate that our approach can detect anomalies with up to 99% accuracy. △ Less

Submitted 21 July, 2021; originally announced July 2021.

arXiv:2106.11716 [pdf, ps, other]

doi 10.1016/j.engappai.2022.104717

Robust EMRAN-aided Coupled Controller for Autonomous Vehicles

Authors: Sauranil Debarshi, Suresh Sundaram, Narasimhan Sundararajan

Abstract: This paper presents a coupled, neural network-aided longitudinal cruise and lateral path-tracking controller for an autonomous vehicle with model uncertainties and experiencing unknown external disturbances. Using a feedback error learning mechanism, an inverse vehicle dynamics learning scheme utilizing an adaptive Radial Basis Function (RBF) neural network, referred to as the Extended Minimal Res… ▽ More This paper presents a coupled, neural network-aided longitudinal cruise and lateral path-tracking controller for an autonomous vehicle with model uncertainties and experiencing unknown external disturbances. Using a feedback error learning mechanism, an inverse vehicle dynamics learning scheme utilizing an adaptive Radial Basis Function (RBF) neural network, referred to as the Extended Minimal Resource Allocating Network (EMRAN) is employed. EMRAN uses an extended Kalman filter for online learning and weight updates, and also incorporates a growing/pruning strategy for maintaining a compact network for easier real-time implementation. The online learning algorithm handles the parametric uncertainties and eliminates the effect of unknown disturbances on the road. Combined with a self-regulating learning scheme for improving generalization performance, the proposed EMRAN-aided control architecture aids a basic PID cruise and Stanley path-tracking controllers in a coupled form. Its performance and robustness to various disturbances and uncertainties are compared with the conventional PID and Stanley controllers, along with a comparison with a fuzzy-based PID controller and an active disturbance rejection control (ADRC) scheme. Simulation results are presented for both slow and high speed scenarios. The root mean square (RMS) and maximum tracking errors clearly indicate the effectiveness of the proposed control scheme in achieving better tracking performance in autonomous vehicles under unknown environments. △ Less

Submitted 8 January, 2022; v1 submitted 22 June, 2021; originally announced June 2021.

Report number: Engineering Applications of Artificial Intelligence, vol. 110, p. 104717

arXiv:2106.05792 [pdf, other]

Speaker-conversation factorial designs for diarization error analysis

Authors: Scott Seyfarth, Sundararajan Srinivasan, Katrin Kirchhoff

Abstract: Speaker diarization accuracy can be affected by both acoustics and conversation characteristics. Determining the cause of diarization errors is difficult because speaker voice acoustics and conversation structure co-vary, and the interactions between acoustics, conversational structure, and diarization accuracy are complex. This paper proposes a methodology that can distinguish independent margina… ▽ More Speaker diarization accuracy can be affected by both acoustics and conversation characteristics. Determining the cause of diarization errors is difficult because speaker voice acoustics and conversation structure co-vary, and the interactions between acoustics, conversational structure, and diarization accuracy are complex. This paper proposes a methodology that can distinguish independent marginal effects of acoustic and conversation characteristics on diarization accuracy by remixing conversations in a factorial design. As an illustration, this approach is used to investigate gender-related and language-related accuracy differences with three diarization systems: a baseline system using subsegment x-vector clustering, a variant of it with shorter subsegments, and a third system based on a Bayesian hidden Markov model. Our analysis shows large accuracy disparities for the baseline system primarily due to conversational structure, which are partially mitigated in the other two systems. The illustration thus demonstrates how the methodology can be used to identify and guide diarization model improvements. △ Less

Submitted 10 June, 2021; originally announced June 2021.

Comments: 5 pages, 2 figures, Interspeech 2021

arXiv:2105.11353 [pdf, other]

Change Point Detection in Nonstationary Sub-Hourly Wind Time Series

Authors: Sakitha Ariyarathne, Harsha Gangammanavar, Raanju R. Sundararajan

Abstract: In this paper, we present a change point detection method for detecting change points in multivariate nonstationary wind speed time series. The change point method identifies changes in the covariance structure and decomposes the nonstationary multivariate time series into stationary segments. We also present parametric and nonparametric simulation techniques to simulate new wind time series withi… ▽ More In this paper, we present a change point detection method for detecting change points in multivariate nonstationary wind speed time series. The change point method identifies changes in the covariance structure and decomposes the nonstationary multivariate time series into stationary segments. We also present parametric and nonparametric simulation techniques to simulate new wind time series within each stationary segment. The proposed simulation methods retain statistical properties of the original time series and therefore, can be employed for simulation-based analysis of power systems planning and operations problems. We demonstrate the capabilities of the change point detection method through computational experiments conducted on wind speed time series at five-minute resolution. We also conduct experiments on the economic dispatch problem to illustrate the impact of nonstationarity in wind generation on conventional generation and location marginal prices. △ Less

Submitted 24 May, 2021; originally announced May 2021.

Comments: 18 pages, 3 figures, 3 tables, and 5 sections

arXiv:2103.05834 [pdf, other]

Best of Both Worlds: Robust Accented Speech Recognition with Adversarial Transfer Learning

Authors: Nilaksh Das, Sravan Bodapati, Monica Sunkara, Sundararajan Srinivasan, Duen Horng Chau

Abstract: Training deep neural networks for automatic speech recognition (ASR) requires large amounts of transcribed speech. This becomes a bottleneck for training robust models for accented speech which typically contains high variability in pronunciation and other semantics, since obtaining large amounts of annotated accented data is both tedious and costly. Often, we only have access to large amounts of… ▽ More Training deep neural networks for automatic speech recognition (ASR) requires large amounts of transcribed speech. This becomes a bottleneck for training robust models for accented speech which typically contains high variability in pronunciation and other semantics, since obtaining large amounts of annotated accented data is both tedious and costly. Often, we only have access to large amounts of unannotated speech from different accents. In this work, we leverage this unannotated data to provide semantic regularization to an ASR model that has been trained only on one accent, to improve its performance for multiple accents. We propose Accent Pre-Training (Acc-PT), a semi-supervised training strategy that combines transfer learning and adversarial training. Our approach improves the performance of a state-of-the-art ASR model by 33% on average over the baseline across multiple accents, training only on annotated samples from one standard accent, and as little as 105 minutes of unannotated speech from a target accent. △ Less

Submitted 9 March, 2021; originally announced March 2021.

arXiv:2102.07381 [pdf, other]

A Decentralized Multi-UAV Spatio-Temporal Multi-Task Allocation Approach for Perimeter Defense

Authors: Shridhar Velhal, Suresh Sundaram, Narasimhan Sundararajan

Abstract: This paper provides a new solution approach to a multi-player perimeter defense game, in which the intruders' team tries to enter the territory, and a team of defenders protects the territory by capturing intruders on the perimeter of the territory. The objective of the defenders is to detect and capture the intruders before the intruders enter the territory. Each defender independently senses the… ▽ More This paper provides a new solution approach to a multi-player perimeter defense game, in which the intruders' team tries to enter the territory, and a team of defenders protects the territory by capturing intruders on the perimeter of the territory. The objective of the defenders is to detect and capture the intruders before the intruders enter the territory. Each defender independently senses the intruder and computes his trajectory to capture the assigned intruders in a cooperative fashion. The intruder is estimated to reach a specific location on the perimeter at a specific time. Each intruder is viewed as a spatio-temporal task, and the defenders are assigned to execute these spatio-temporal tasks. At any given time, the perimeter defense problem is converted into a Decentralized Multi-UAV Spatio-Temporal Multi-Task Allocation (DMUST-MTA) problem. The cost of executing a task for a trajectory is defined by a composite cost function of both the spatial and temporal components. In this paper, a decentralized consensus-based bundle algorithm has been modified to solve the spatio-temporal multi-task allocation problem, and the performance evaluation of the proposed approach is carried out based on Monte-Carlo simulations. The simulation results show the effectiveness of the proposed approach to solve the perimeter defense game under different scenarios. Performance comparison with a state-of-the-art centralized approach with full observability, clearly indicates that DMUST-MTA achieves similar performance in a decentralized way with partial observability conditions with a lesser computational time and easy scaling up. △ Less

Submitted 15 February, 2021; originally announced February 2021.

arXiv:2012.06756 [pdf, ps, other]

Gap Reduced Minimum Error Robust Simultaneous Estimation For Unstable Nano Air Vehicle

Authors: **raj V Pushpangathan, Harikumar Kandath, Suresh Sundaram, Narasimhan Sundararajan

Abstract: This paper proposes a novel Gap Reduced Minimum Error Robust Simultaneous (GRMERS) estimator for resource-constrained Nano Aerial Vehicle (NAV) that enables a single estimator to provide simultaneous and robust estimation for a given N unstable and uncertain NAV plant models. The estimated full state feedback enables a stable flight for NAV. The GRMERS estimator is implemented utilizing a Minimum… ▽ More This paper proposes a novel Gap Reduced Minimum Error Robust Simultaneous (GRMERS) estimator for resource-constrained Nano Aerial Vehicle (NAV) that enables a single estimator to provide simultaneous and robust estimation for a given N unstable and uncertain NAV plant models. The estimated full state feedback enables a stable flight for NAV. The GRMERS estimator is implemented utilizing a Minimum Error Robust Simultaneous (MERS) estimator and Gap Reducing (GR) compensators. The MERS estimator provides robust simultaneous estimation with minimal largest worst-case estimation error even in the presence of a bounded energy exogenous disturbance signal. The GR compensators reduce the gap between the graphs of N linear plant models to decrease the estimation error generated by the MERS estimator. A sufficient condition for the existence of a simultaneous estimator is established using LMIs and robust estimation theory. Further, MERS estimator and GR compensator design are formulated as non-convex tractable optimization problems and are solved using the population-based genetic algorithms. The performance of the GRMERS estimator consisting of MERS estimator and GR compensators from the population-based genetic algorithms is validated through simulation studies. The study results indicate that a single GRMERS estimator can produce state estimates with reduced errors for all flight conditions. The results indicate that the single GRMERS estimator is robust than the individually designed H inifinity filters. △ Less

Submitted 12 December, 2020; originally announced December 2020.

arXiv:1910.05339 [pdf, other]

doi 10.1145/3377813.3381353

DeCaf: Diagnosing and Triaging Performance Issues in Large-Scale Cloud Services

Authors: Chetan Bansal, Sundararajan Renganathan, Ashima Asudani, Olivier Midy, Mathru Janakiraman

Abstract: Large scale cloud services use Key Performance Indicators (KPIs) for tracking and monitoring performance. They usually have Service Level Objectives (SLOs) baked into the customer agreements which are tied to these KPIs. Dependency failures, code bugs, infrastructure failures, and other problems can cause performance regressions. It is critical to minimize the time and manual effort in diagnosing… ▽ More Large scale cloud services use Key Performance Indicators (KPIs) for tracking and monitoring performance. They usually have Service Level Objectives (SLOs) baked into the customer agreements which are tied to these KPIs. Dependency failures, code bugs, infrastructure failures, and other problems can cause performance regressions. It is critical to minimize the time and manual effort in diagnosing and triaging such issues to reduce customer impact. Large volume of logs and mixed type of attributes (categorical, continuous) in the logs makes diagnosis of regressions non-trivial. In this paper, we present the design, implementation and experience from building and deploying DeCaf, a system for automated diagnosis and triaging of KPI issues using service logs. It uses machine learning along with pattern mining to help service owners automatically root cause and triage performance issues. We present the learnings and results from case studies on two large scale cloud services in Microsoft where DeCaf successfully diagnosed 10 known and 31 unknown issues. DeCaf also automatically triages the identified issues by leveraging historical data. Our key insights are that for any such diagnosis tool to be effective in practice, it should a) scale to large volumes of service logs and attributes, b) support different types of KPIs and ranking functions, c) be integrated into the DevOps processes. △ Less

Submitted 2 February, 2020; v1 submitted 11 October, 2019; originally announced October 2019.

Comments: To be published in the proceedings of ICSE-SEIP '20, Seoul, Republic of Korea

arXiv:1905.11883 [pdf, other]

A Case Study on the Effects of Partial Solar Eclipse on Distributed Photovoltaic Systems and Management Areas

Authors: Aditya Sundararajan, Temitayo O. Olowu, Longfei Wei, Shahinur Rahman, Arif I. Sarwat

Abstract: Photovoltaic (PV) systems depend on irradiance, ambient temperature and module temperature. A solar eclipse causes significant changes in these parameters, thereby impacting PV generation profile, performance, and power quality of larger grid where they connect to. This paper presents a case study to evaluate the impacts of the solar eclipse of August 21, 2017 on two real-world grid-tied PV system… ▽ More Photovoltaic (PV) systems depend on irradiance, ambient temperature and module temperature. A solar eclipse causes significant changes in these parameters, thereby impacting PV generation profile, performance, and power quality of larger grid where they connect to. This paper presents a case study to evaluate the impacts of the solar eclipse of August 21, 2017 on two real-world grid-tied PV systems (1.4MW and 355kW) in Miami and Daytona, Florida, the feeders they are connected to, and the management areas they belong to. Four types of analyses are conducted to obtain a comprehensive picture of the impacts using 1-minute PV generation data, hourly weather data, real feeder parameters, and daily reliability data. These analyses include: individual PV system performance measurement using power performance index; power quality analysis at the point of interconnection; a study on the operation of voltage regulating devices on the feeders during eclipse peak using an IEEE 8500 test case distribution feeder; and reliability study involving a multilayer perceptron framework for forecasting system reliability of the management areas. Results from this study provide a unique insight into how solar eclipses impact the behavior of PV systems and the grid, which would be of concern to electric utilities in future high penetration scenarios. △ Less

Submitted 24 May, 2019; originally announced May 2019.

Comments: Accepted by IET Smart Grid journal

arXiv:1809.08709 [pdf, ps, other]

A Canonical Form for First-Order Distributed Optimization Algorithms

Authors: Akhil Sundararajan, Bryan Van Scoy, Laurent Lessard

Abstract: We consider the distributed optimization problem in which a network of agents aims to minimize the average of local functions. To solve this problem, several algorithms have recently been proposed where agents perform various combinations of communication with neighbors, local gradient computations, and updates to local state variables. In this paper, we present a canonical form that characterizes… ▽ More We consider the distributed optimization problem in which a network of agents aims to minimize the average of local functions. To solve this problem, several algorithms have recently been proposed where agents perform various combinations of communication with neighbors, local gradient computations, and updates to local state variables. In this paper, we present a canonical form that characterizes any first-order distributed algorithm that can be implemented using a single round of communication and gradient computation per iteration, and where each agent stores up to two state variables. The canonical form features a minimal set of parameters that are both unique and expressive enough to capture any distributed algorithm in this class. The generic nature of our canonical form enables the systematic analysis and design of distributed optimization algorithms. △ Less

Submitted 15 July, 2019; v1 submitted 23 September, 2018; originally announced September 2018.

Journal ref: American Control Conference, pp. 4075-4080, Jul 2019

Showing 1–23 of 23 results for author: Sundararajan