Search | arXiv e-print repository

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

Authors: Shruti Palaskar, Oggi Rudovic, Sameer Dharur, Florian Pesce, Gautam Krishna, Aswin Sivaraman, Jack Berkowitz, Ahmed Hussen Abdelaziz, Saurabh Adya, Ahmed Tewfik

Abstract: Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to… ▽ More Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to consume new, previously unseen modalities via low rank adaptation. For device-directed speech detection, using FLoRA, the multimodal LLM achieves 22% relative reduction in equal error rate (EER) over the text-only approach and attains performance parity with its full fine-tuning (FFT) counterpart while needing to tune only a fraction of its parameters. Furthermore, with the newly introduced adapter dropout, FLoRA is robust to missing data, improving over FFT by 20% lower EER and 56% lower false accept rate. The proposed approach scales well for model sizes from 16M to 3B parameters. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Accepted at Interspeech 2024

arXiv:2211.07493 [pdf, ps, other]

The Potential of Neural Speech Synthesis-based Data Augmentation for Personalized Speech Enhancement

Authors: Anastasia Kuznetsova, Aswin Sivaraman, Minje Kim

Abstract: With the advances in deep learning, speech enhancement systems benefited from large neural network architectures and achieved state-of-the-art quality. However, speaker-agnostic methods are not always desirable, both in terms of quality and their complexity, when they are to be used in a resource-constrained environment. One promising way is personalized speech enhancement (PSE), which is a smalle… ▽ More With the advances in deep learning, speech enhancement systems benefited from large neural network architectures and achieved state-of-the-art quality. However, speaker-agnostic methods are not always desirable, both in terms of quality and their complexity, when they are to be used in a resource-constrained environment. One promising way is personalized speech enhancement (PSE), which is a smaller and easier speech enhancement problem for small models to solve, because it focuses on a particular test-time user. To achieve the personalization goal, while dealing with the typical lack of personal data, we investigate the effect of data augmentation based on neural speech synthesis (NSS). In the proposed method, we show that the quality of the NSS system's synthetic data matters, and if they are good enough the augmented dataset can be used to improve the PSE system that outperforms the speaker-agnostic baseline. The proposed PSE systems show significant complexity reduction while preserving the enhancement quality. △ Less

Submitted 14 November, 2022; originally announced November 2022.

arXiv:2112.14845 [pdf, other]

Collective Autoscaling for Cloud Microservices

Authors: Vighnesh Sachidananda, Anirudh Sivaraman

Abstract: As cloud applications shift from monoliths to loosely coupled microservices, application developers must decide how many compute resources (e.g., number of replicated containers) to assign to each microservice within an application. This decision affects both (1) the dollar cost to the application developer and (2) the end-to-end latency perceived by the application user. Today, individual microse… ▽ More As cloud applications shift from monoliths to loosely coupled microservices, application developers must decide how many compute resources (e.g., number of replicated containers) to assign to each microservice within an application. This decision affects both (1) the dollar cost to the application developer and (2) the end-to-end latency perceived by the application user. Today, individual microservices are autoscaled independently by adding VMs whenever per-microservice CPU or memory utilization crosses a configurable threshold. However, an application user's end-to-end latency consists of time spent on multiple microservices and each microservice might need a different number of VMs to achieve an overall end-to-end latency. We present COLA, an autoscaler for microservice-based applications, which collectively allocates VMs to microservices with a global goal of minimizing dollar cost while kee** end-to-end application latency under a given target. Using 5 open-source applications, we compared COLA to several utilization and machine learning based autoscalers. We evaluate COLA across different compute settings on Google Kubernetes Engine (GKE) in which users manage compute resources, GKE standard, and a new mode of operation in which the cloud provider manages compute infrastructure, GKE Autopilot. COLA meets a desired median or tail latency target on 53 of 63 workloads where it provides a cost reduction of 19.3%, on average, over the next cheapest autoscaler. COLA is the most cost effective autoscaling policy for 48 of these 53 workloads. The cost savings from managing a cluster with COLA result in COLA paying for its training cost in a few days. On smaller applications, for which we can exhaustively search microservice configurations, we find that COLA is optimal for 90% of cases and near optimal otherwise. △ Less

Submitted 7 August, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

arXiv:2110.10739 [pdf, other]

Adapting Speech Separation to Real-World Meetings Using Mixture Invariant Training

Authors: Aswin Sivaraman, Scott Wisdom, Hakan Erdogan, John R. Hershey

Abstract: The recently-proposed mixture invariant training (MixIT) is an unsupervised method for training single-channel sound separation models in the sense that it does not require ground-truth isolated reference sources. In this paper, we investigate using MixIT to adapt a separation model on real far-field overlap** reverberant and noisy speech data from the AMI Corpus. The models are tested on real A… ▽ More The recently-proposed mixture invariant training (MixIT) is an unsupervised method for training single-channel sound separation models in the sense that it does not require ground-truth isolated reference sources. In this paper, we investigate using MixIT to adapt a separation model on real far-field overlap** reverberant and noisy speech data from the AMI Corpus. The models are tested on real AMI recordings containing overlap** speech, and are evaluated subjectively by human listeners. To objectively evaluate our models, we also devise a synthetic AMI test set. For human evaluations on real recordings, we also propose a modification of the standard MUSHRA protocol to handle imperfect reference signals, which we call MUSHIRA. Holding network architectures constant, we find that a fine-tuned semi-supervised model yields the largest SI-SNR improvement, PESQ scores, and human listening ratings across synthetic and real datasets, outperforming unadapted generalist models trained on orders of magnitude more data. Our results show that unsupervised learning through MixIT enables model adaptation on real-world unlabeled spontaneous speech recordings. △ Less

Submitted 20 October, 2021; originally announced October 2021.

arXiv:2105.03542 [pdf, other]

Zero-Shot Personalized Speech Enhancement through Speaker-Informed Model Selection

Authors: Aswin Sivaraman, Minje Kim

Abstract: This paper presents a novel zero-shot learning approach towards personalized speech enhancement through the use of a sparsely active ensemble model. Optimizing speech denoising systems towards a particular test-time speaker can improve performance and reduce run-time complexity. However, test-time model adaptation may be challenging if collecting data from the test-time speaker is not possible. To… ▽ More This paper presents a novel zero-shot learning approach towards personalized speech enhancement through the use of a sparsely active ensemble model. Optimizing speech denoising systems towards a particular test-time speaker can improve performance and reduce run-time complexity. However, test-time model adaptation may be challenging if collecting data from the test-time speaker is not possible. To this end, we propose using an ensemble model wherein each specialist module denoises noisy utterances from a distinct partition of training set speakers. The gating module inexpensively estimates test-time speaker characteristics in the form of an embedding vector and selects the most appropriate specialist module for denoising the test signal. Grou** the training set speakers into non-overlap** semantically similar groups is non-trivial and ill-defined. To do this, we first train a Siamese network using noisy speech pairs to maximize or minimize the similarity of its output vectors depending on whether the utterances derive from the same speaker or not. Next, we perform k-means clustering on the latent space formed by the averaged embedding vectors per training set speaker. In this way, we designate speaker groups and train specialist modules optimized around partitions of the complete training set. Our experiments show that ensemble models made up of low-capacity specialists can outperform high-capacity generalist models with greater efficiency and improved adaptation towards unseen test-time speakers. △ Less

Submitted 7 May, 2021; originally announced May 2021.

Comments: 5 pages, 3 figures, submitted to 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

arXiv:2104.02018 [pdf, other]

Personalized Speech Enhancement through Self-Supervised Data Augmentation and Purification

Authors: Aswin Sivaraman, Sunwoo Kim, Minje Kim

Abstract: Training personalized speech enhancement models is innately a no-shot learning problem due to privacy constraints and limited access to noise-free speech from the target user. If there is an abundance of unlabeled noisy speech from the test-time user, a personalized speech enhancement model can be trained using self-supervised learning. One straightforward approach to model personalization is to u… ▽ More Training personalized speech enhancement models is innately a no-shot learning problem due to privacy constraints and limited access to noise-free speech from the target user. If there is an abundance of unlabeled noisy speech from the test-time user, a personalized speech enhancement model can be trained using self-supervised learning. One straightforward approach to model personalization is to use the target speaker's noisy recordings as pseudo-sources. Then, a pseudo denoising model learns to remove injected training noises and recover the pseudo-sources. However, this approach is volatile as it depends on the quality of the pseudo-sources, which may be too noisy. As a remedy, we propose an improvement to the self-supervised approach through data purification. We first train an SNR predictor model to estimate the frame-by-frame SNR of the pseudo-sources. Then, the predictor's estimates are converted into weights which adjust the frame-by-frame contribution of the pseudo-sources towards training the personalized model. We empirically show that the proposed data purification step improves the usability of the speaker-specific noisy data in the context of personalized speech enhancement. Without relying on any clean speech recordings or speaker embeddings, our approach may be seen as privacy-preserving. △ Less

Submitted 5 April, 2021; originally announced April 2021.

Comments: 5 pages, 3 figures, under review

arXiv:2104.02017 [pdf, other]

doi 10.1109/JSTSP.2022.3181782

Efficient Personalized Speech Enhancement through Self-Supervised Learning

Authors: Aswin Sivaraman, Minje Kim

Abstract: This work presents self-supervised learning methods for develo** monaural speaker-specific (i.e., personalized) speech enhancement models. While generalist models must broadly address many speakers, specialist models can adapt their enhancement function towards a particular speaker's voice, expecting to solve a narrower problem. Hence, specialists are capable of achieving more optimal performanc… ▽ More This work presents self-supervised learning methods for develo** monaural speaker-specific (i.e., personalized) speech enhancement models. While generalist models must broadly address many speakers, specialist models can adapt their enhancement function towards a particular speaker's voice, expecting to solve a narrower problem. Hence, specialists are capable of achieving more optimal performance in addition to reducing computational complexity. However, naive personalization methods can require clean speech from the target user, which is inconvenient to acquire, e.g., due to subpar recording conditions. To this end, we pose personalization as either a zero-shot task, in which no additional clean speech of the target speaker is used for training, or a few-shot learning task, in which the goal is to minimize the duration of the clean speech used for transfer learning. With this paper, we propose self-supervised learning methods as a solution to both zero- and few-shot personalization tasks. The proposed methods are designed to learn the personalized speech features from unlabeled data (i.e., in-the-wild noisy recordings from the target user) without knowing the corresponding clean sources. Our experiments investigate three different self-supervised learning mechanisms. The results show that self-supervised models achieve zero-shot and few-shot personalization using fewer model parameters and less clean data from the target user, achieving the data efficiency and model compression goals. △ Less

Submitted 27 July, 2022; v1 submitted 5 April, 2021; originally announced April 2021.

Comments: 15 pages, 9 figures, published in IEEE JSTSP 2022

arXiv:2102.04911 [pdf, other]

The case for model-driven interpretability of delay-based congestion control protocols

Authors: Muhammad Khan, Yasir Zaki, Shiva Iyer, Talal Ahamd, Thomas Pötsch, Jay Chen, Anirudh Sivaraman, Lakshmi Subramanian

Abstract: Analyzing and interpreting the exact behavior of new delay-based congestion control protocols with complex non-linear control loops is exceptionally difficult in highly variable networks such as cellular networks. This paper proposes a Model-Driven Interpretability (MDI) congestion control framework, which derives a model version of a delay-based protocol by simplifying a congestion control protoc… ▽ More Analyzing and interpreting the exact behavior of new delay-based congestion control protocols with complex non-linear control loops is exceptionally difficult in highly variable networks such as cellular networks. This paper proposes a Model-Driven Interpretability (MDI) congestion control framework, which derives a model version of a delay-based protocol by simplifying a congestion control protocol's response into a guided random walk over a two-dimensional Markov model. We demonstrate the case for the MDI framework by using MDI to analyze and interpret the behavior of two delay-based protocols over cellular channels: Verus and Copa. Our results show a successful approximation of throughput and delay characteristics of the protocols' model versions across variable network conditions. The learned model of a protocol provides key insights into an algorithm's convergence properties. △ Less

Submitted 9 February, 2021; originally announced February 2021.

arXiv:2011.03426

Self-Supervised Learning from Contrastive Mixtures for Personalized Speech Enhancement

Authors: Aswin Sivaraman, Minje Kim

Abstract: This work explores how self-supervised learning can be universally used to discover speaker-specific features towards enabling personalized speech enhancement models. We specifically address the few-shot learning scenario where access to cleaning recordings of a test-time speaker is limited to a few seconds, but noisy recordings of the speaker are abundant. We develop a simple contrastive learning… ▽ More This work explores how self-supervised learning can be universally used to discover speaker-specific features towards enabling personalized speech enhancement models. We specifically address the few-shot learning scenario where access to cleaning recordings of a test-time speaker is limited to a few seconds, but noisy recordings of the speaker are abundant. We develop a simple contrastive learning procedure which treats the abundant noisy data as makeshift training targets through pairwise noise injection: the model is pretrained to maximize agreement between pairs of differently deformed identical utterances and to minimize agreement between pairs of similarly deformed nonidentical utterances. Our experiments compare the proposed pretraining approach with two baseline alternatives: speaker-agnostic fully-supervised pretraining, and speaker-specific self-supervised pretraining without contrastive loss terms. Of all three approaches, the proposed method using contrastive mixtures is found to be most robust to model compression (using 85% fewer parameters) and reduced clean speech (requiring only 3 seconds). △ Less

Submitted 9 August, 2022; v1 submitted 6 November, 2020; originally announced November 2020.

Comments: This work has been superseded by article 2104.02017

arXiv:2005.08128 [pdf, other]

Sparse Mixture of Local Experts for Efficient Speech Enhancement

Authors: Aswin Sivaraman, Minje Kim

Abstract: In this paper, we investigate a deep learning approach for speech denoising through an efficient ensemble of specialist neural networks. By splitting up the speech denoising task into non-overlap** subproblems and introducing a classifier, we are able to improve denoising performance while also reducing computational complexity. More specifically, the proposed model incorporates a gating network… ▽ More In this paper, we investigate a deep learning approach for speech denoising through an efficient ensemble of specialist neural networks. By splitting up the speech denoising task into non-overlap** subproblems and introducing a classifier, we are able to improve denoising performance while also reducing computational complexity. More specifically, the proposed model incorporates a gating network which assigns noisy speech signals to an appropriate specialist network based on either speech degradation level or speaker gender. In our experiments, a baseline recurrent network is compared against an ensemble of similarly-designed smaller recurrent networks regulated by the auxiliary gating network. Using stochastically generated batches from a large noisy speech corpus, the proposed model learns to estimate a time-frequency masking matrix based on the magnitude spectrogram of an input mixture signal. Both baseline and specialist networks are trained to estimate the ideal ratio mask, while the gating network is trained to perform subproblem classification. Our findings demonstrate that a fine-tuned ensemble network is able to exceed the speech denoising capabilities of a generalist network, doing so with fewer model parameters. △ Less

Submitted 16 May, 2020; originally announced May 2020.

Comments: 5 pages, 5 figures

Journal ref: Published in Interspeech 2020

arXiv:1902.00956 [pdf, ps, other]

Deep Autotuner: A Data-Driven Approach to Natural-Sounding Pitch Correction for Singing Voice in Karaoke Performances

Authors: Sanna Wager, George Tzanetakis, Cheng-i Wang, Lijiang Guo, Aswin Sivaraman, Minje Kim

Abstract: We describe a machine-learning approach to pitch correcting a solo singing performance in a karaoke setting, where the solo voice and accompaniment are on separate tracks. The proposed approach addresses the situation where no musical score of the vocals nor the accompaniment exists: It predicts the amount of correction from the relationship between the spectral contents of the vocal and accompani… ▽ More We describe a machine-learning approach to pitch correcting a solo singing performance in a karaoke setting, where the solo voice and accompaniment are on separate tracks. The proposed approach addresses the situation where no musical score of the vocals nor the accompaniment exists: It predicts the amount of correction from the relationship between the spectral contents of the vocal and accompaniment tracks. Hence, the pitch shift in cents suggested by the model can be used to make the voice sound in tune with the accompaniment. This approach differs from commercially used automatic pitch correction systems, where notes in the vocal tracks are shifted to be centered around notes in a user-defined score or mapped to the closest pitch among the twelve equal-tempered scale degrees. We train the model using a dataset of 4,702 amateur karaoke performances selected for good intonation. We present a Convolutional Gated Recurrent Unit (CGRU) model to accomplish this task. This method can be extended into unsupervised pitch correction of a vocal performance, popularly referred to as autotuning. △ Less

Submitted 3 February, 2019; originally announced February 2019.

arXiv:1805.02603 [pdf, ps, other]

A Data-Driven Approach to Smooth Pitch Correction for Singing Voice in Pop Music

Authors: Sanna Wager, Lijiang Guo, Aswin Sivaraman, Minje Kim

Abstract: In this paper, we present a machine-learning approach to pitch correction for voice in a karaoke setting, where the vocals and accompaniment are on separate tracks and time-aligned. The network takes as input the time-frequency representation of the two tracks and predicts the amount of pitch-shifting in cents required to make the voice sound in-tune with the accompaniment. It is trained on exampl… ▽ More In this paper, we present a machine-learning approach to pitch correction for voice in a karaoke setting, where the vocals and accompaniment are on separate tracks and time-aligned. The network takes as input the time-frequency representation of the two tracks and predicts the amount of pitch-shifting in cents required to make the voice sound in-tune with the accompaniment. It is trained on examples of semi-professional singing. The proposed approach differs from existing real-time pitch correction methods by replacing pitch tracking and map** to a discrete set of notes---for example, the twelve classes of the equal-tempered scale---with learning a correction that is continuous both in frequency and in time directly from the harmonics of the vocal and accompaniment tracks. A Recurrent Neural Network (RNN) model provides a correction that takes context into account, preserving expressive pitch bending and vibrato. This method can be extended into unsupervised pitch correction of a vocal performance---popularly referred to as autotuning. △ Less

Submitted 7 May, 2018; originally announced May 2018.

arXiv:1801.09774 [pdf, other]

On Psychoacoustically Weighted Cost Functions Towards Resource-Efficient Deep Neural Networks for Speech Denoising

Authors: Kai Zhen, Aswin Sivaraman, Jongmo Sung, Minje Kim

Abstract: We present a psychoacoustically enhanced cost function to balance network complexity and perceptual performance of deep neural networks for speech denoising. While training the network, we utilize perceptual weights added to the ordinary mean-squared error to emphasize contribution from frequency bins which are most audible while ignoring error from inaudible bins. To generate the weights, we empl… ▽ More We present a psychoacoustically enhanced cost function to balance network complexity and perceptual performance of deep neural networks for speech denoising. While training the network, we utilize perceptual weights added to the ordinary mean-squared error to emphasize contribution from frequency bins which are most audible while ignoring error from inaudible bins. To generate the weights, we employ psychoacoustic models to compute the global masking threshold from the clean speech spectra. We then evaluate the speech denoising performance of our perceptually guided neural network by using both objective and perceptual sound quality metrics, testing on various network structures ranging from shallow and narrow ones to deep and wide ones. The experimental results showcase our method as a valid approach for infusing perceptual significance to deep neural network operations. In particular, the more perceptually sensible enhancement in performance seen by simple neural network topologies proves that the proposed method can lead to resource-efficient speech denoising implementations in small devices without degrading the perceived signal fidelity. △ Less

Submitted 29 January, 2018; originally announced January 2018.

Comments: 5 pages, 4 figures

Showing 1–13 of 13 results for author: Sivaraman, A