Skip to main content

Showing 1–13 of 13 results for author: Sivaraman, A

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.09617  [pdf, other

    cs.CL cs.HC eess.AS

    Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

    Authors: Shruti Palaskar, Oggi Rudovic, Sameer Dharur, Florian Pesce, Gautam Krishna, Aswin Sivaraman, Jack Berkowitz, Ahmed Hussen Abdelaziz, Saurabh Adya, Ahmed Tewfik

    Abstract: Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  2. arXiv:2211.07493  [pdf, ps, other

    eess.AS cs.SD

    The Potential of Neural Speech Synthesis-based Data Augmentation for Personalized Speech Enhancement

    Authors: Anastasia Kuznetsova, Aswin Sivaraman, Minje Kim

    Abstract: With the advances in deep learning, speech enhancement systems benefited from large neural network architectures and achieved state-of-the-art quality. However, speaker-agnostic methods are not always desirable, both in terms of quality and their complexity, when they are to be used in a resource-constrained environment. One promising way is personalized speech enhancement (PSE), which is a smalle… ▽ More

    Submitted 14 November, 2022; originally announced November 2022.

  3. arXiv:2112.14845  [pdf, other

    cs.DC eess.SY

    Collective Autoscaling for Cloud Microservices

    Authors: Vighnesh Sachidananda, Anirudh Sivaraman

    Abstract: As cloud applications shift from monoliths to loosely coupled microservices, application developers must decide how many compute resources (e.g., number of replicated containers) to assign to each microservice within an application. This decision affects both (1) the dollar cost to the application developer and (2) the end-to-end latency perceived by the application user. Today, individual microse… ▽ More

    Submitted 7 August, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

  4. arXiv:2110.10739  [pdf, other

    cs.SD eess.AS

    Adapting Speech Separation to Real-World Meetings Using Mixture Invariant Training

    Authors: Aswin Sivaraman, Scott Wisdom, Hakan Erdogan, John R. Hershey

    Abstract: The recently-proposed mixture invariant training (MixIT) is an unsupervised method for training single-channel sound separation models in the sense that it does not require ground-truth isolated reference sources. In this paper, we investigate using MixIT to adapt a separation model on real far-field overlap** reverberant and noisy speech data from the AMI Corpus. The models are tested on real A… ▽ More

    Submitted 20 October, 2021; originally announced October 2021.

  5. arXiv:2105.03542  [pdf, other

    eess.AS cs.LG cs.SD

    Zero-Shot Personalized Speech Enhancement through Speaker-Informed Model Selection

    Authors: Aswin Sivaraman, Minje Kim

    Abstract: This paper presents a novel zero-shot learning approach towards personalized speech enhancement through the use of a sparsely active ensemble model. Optimizing speech denoising systems towards a particular test-time speaker can improve performance and reduce run-time complexity. However, test-time model adaptation may be challenging if collecting data from the test-time speaker is not possible. To… ▽ More

    Submitted 7 May, 2021; originally announced May 2021.

    Comments: 5 pages, 3 figures, submitted to 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

  6. arXiv:2104.02018  [pdf, other

    eess.AS cs.LG cs.SD

    Personalized Speech Enhancement through Self-Supervised Data Augmentation and Purification

    Authors: Aswin Sivaraman, Sunwoo Kim, Minje Kim

    Abstract: Training personalized speech enhancement models is innately a no-shot learning problem due to privacy constraints and limited access to noise-free speech from the target user. If there is an abundance of unlabeled noisy speech from the test-time user, a personalized speech enhancement model can be trained using self-supervised learning. One straightforward approach to model personalization is to u… ▽ More

    Submitted 5 April, 2021; originally announced April 2021.

    Comments: 5 pages, 3 figures, under review

  7. arXiv:2104.02017  [pdf, other

    eess.AS cs.LG cs.SD

    Efficient Personalized Speech Enhancement through Self-Supervised Learning

    Authors: Aswin Sivaraman, Minje Kim

    Abstract: This work presents self-supervised learning methods for develo** monaural speaker-specific (i.e., personalized) speech enhancement models. While generalist models must broadly address many speakers, specialist models can adapt their enhancement function towards a particular speaker's voice, expecting to solve a narrower problem. Hence, specialists are capable of achieving more optimal performanc… ▽ More

    Submitted 27 July, 2022; v1 submitted 5 April, 2021; originally announced April 2021.

    Comments: 15 pages, 9 figures, published in IEEE JSTSP 2022

  8. arXiv:2102.04911  [pdf, other

    cs.NI eess.SY

    The case for model-driven interpretability of delay-based congestion control protocols

    Authors: Muhammad Khan, Yasir Zaki, Shiva Iyer, Talal Ahamd, Thomas Pötsch, Jay Chen, Anirudh Sivaraman, Lakshmi Subramanian

    Abstract: Analyzing and interpreting the exact behavior of new delay-based congestion control protocols with complex non-linear control loops is exceptionally difficult in highly variable networks such as cellular networks. This paper proposes a Model-Driven Interpretability (MDI) congestion control framework, which derives a model version of a delay-based protocol by simplifying a congestion control protoc… ▽ More

    Submitted 9 February, 2021; originally announced February 2021.

  9. arXiv:2011.03426   

    eess.AS cs.LG cs.SD

    Self-Supervised Learning from Contrastive Mixtures for Personalized Speech Enhancement

    Authors: Aswin Sivaraman, Minje Kim

    Abstract: This work explores how self-supervised learning can be universally used to discover speaker-specific features towards enabling personalized speech enhancement models. We specifically address the few-shot learning scenario where access to cleaning recordings of a test-time speaker is limited to a few seconds, but noisy recordings of the speaker are abundant. We develop a simple contrastive learning… ▽ More

    Submitted 9 August, 2022; v1 submitted 6 November, 2020; originally announced November 2020.

    Comments: This work has been superseded by article 2104.02017

  10. arXiv:2005.08128  [pdf, other

    eess.AS cs.LG cs.SD

    Sparse Mixture of Local Experts for Efficient Speech Enhancement

    Authors: Aswin Sivaraman, Minje Kim

    Abstract: In this paper, we investigate a deep learning approach for speech denoising through an efficient ensemble of specialist neural networks. By splitting up the speech denoising task into non-overlap** subproblems and introducing a classifier, we are able to improve denoising performance while also reducing computational complexity. More specifically, the proposed model incorporates a gating network… ▽ More

    Submitted 16 May, 2020; originally announced May 2020.

    Comments: 5 pages, 5 figures

    Journal ref: Published in Interspeech 2020

  11. arXiv:1902.00956  [pdf, ps, other

    cs.SD cs.LG eess.AS stat.ML

    Deep Autotuner: A Data-Driven Approach to Natural-Sounding Pitch Correction for Singing Voice in Karaoke Performances

    Authors: Sanna Wager, George Tzanetakis, Cheng-i Wang, Lijiang Guo, Aswin Sivaraman, Minje Kim

    Abstract: We describe a machine-learning approach to pitch correcting a solo singing performance in a karaoke setting, where the solo voice and accompaniment are on separate tracks. The proposed approach addresses the situation where no musical score of the vocals nor the accompaniment exists: It predicts the amount of correction from the relationship between the spectral contents of the vocal and accompani… ▽ More

    Submitted 3 February, 2019; originally announced February 2019.

  12. arXiv:1805.02603  [pdf, ps, other

    cs.SD eess.AS

    A Data-Driven Approach to Smooth Pitch Correction for Singing Voice in Pop Music

    Authors: Sanna Wager, Lijiang Guo, Aswin Sivaraman, Minje Kim

    Abstract: In this paper, we present a machine-learning approach to pitch correction for voice in a karaoke setting, where the vocals and accompaniment are on separate tracks and time-aligned. The network takes as input the time-frequency representation of the two tracks and predicts the amount of pitch-shifting in cents required to make the voice sound in-tune with the accompaniment. It is trained on exampl… ▽ More

    Submitted 7 May, 2018; originally announced May 2018.

  13. arXiv:1801.09774  [pdf, other

    cs.SD eess.AS

    On Psychoacoustically Weighted Cost Functions Towards Resource-Efficient Deep Neural Networks for Speech Denoising

    Authors: Kai Zhen, Aswin Sivaraman, Jongmo Sung, Minje Kim

    Abstract: We present a psychoacoustically enhanced cost function to balance network complexity and perceptual performance of deep neural networks for speech denoising. While training the network, we utilize perceptual weights added to the ordinary mean-squared error to emphasize contribution from frequency bins which are most audible while ignoring error from inaudible bins. To generate the weights, we empl… ▽ More

    Submitted 29 January, 2018; originally announced January 2018.

    Comments: 5 pages, 4 figures