Skip to main content

Showing 1–18 of 18 results for author: Ithapu, V K

Searching in archive eess. Search in all archives.
.
  1. arXiv:2303.16024  [pdf, other

    cs.CV cs.SD eess.AS

    Egocentric Auditory Attention Localization in Conversations

    Authors: Fiona Ryan, Hao Jiang, Abhinav Shukla, James M. Rehg, Vamsi Krishna Ithapu

    Abstract: In a noisy conversation environment such as a dinner party, people often exhibit selective auditory attention, or the ability to focus on a particular speaker while tuning out others. Recognizing who somebody is listening to in a conversation is essential for develo** technologies that can understand social behavior and devices that can augment human hearing by amplifying particular sound source… ▽ More

    Submitted 28 March, 2023; originally announced March 2023.

  2. arXiv:2301.08730  [pdf, other

    cs.CV cs.SD eess.AS

    Novel-View Acoustic Synthesis

    Authors: Changan Chen, Alexander Richard, Roman Shapovalov, Vamsi Krishna Ithapu, Natalia Neverova, Kristen Grauman, Andrea Vedaldi

    Abstract: We introduce the novel-view acoustic synthesis (NVAS) task: given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint? We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space by analyzing the input audio-visual cues. To benc… ▽ More

    Submitted 24 October, 2023; v1 submitted 20 January, 2023; originally announced January 2023.

    Comments: Accepted at CVPR 2023. Project page: https://vision.cs.utexas.edu/projects/nvas

  3. arXiv:2301.02184  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    Chat2Map: Efficient Scene Map** from Multi-Ego Conversations

    Authors: Sagnik Majumder, Hao Jiang, Pierre Moulon, Ethan Henderson, Paul Calamia, Kristen Grauman, Vamsi Krishna Ithapu

    Abstract: Can conversational videos captured from multiple egocentric viewpoints reveal the map of a scene in a cost-efficient way? We seek to answer this question by proposing a new problem: efficiently building the map of a previously unseen 3D environment by exploiting shared information in the egocentric audio-visual observations of participants in a natural conversation. Our hypothesis is that as multi… ▽ More

    Submitted 20 April, 2023; v1 submitted 4 January, 2023; originally announced January 2023.

    Comments: Accepted to CVPR 2023

  4. arXiv:2211.10999  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

    Authors: Rodrigo Mira, Buye Xu, Jacob Donley, Anurag Kumar, Stavros Petridis, Vamsi Krishna Ithapu, Maja Pantic

    Abstract: Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging not only the audio itself but also the target speaker's lip movements. This approach has been shown to yield improvements over audio-only speech enhancement, particularly for the removal of interfering speech. Despite recent advances in speech synthesis, most audio-visual approaches continue to use… ▽ More

    Submitted 13 March, 2023; v1 submitted 20 November, 2022; originally announced November 2022.

    Comments: accepted to ICASSP 2023

  5. arXiv:2211.08624  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Leveraging Heteroscedastic Uncertainty in Learning Complex Spectral Map** for Single-channel Speech Enhancement

    Authors: Kuan-Lin Chen, Daniel D. E. Wong, Ke Tan, Buye Xu, Anurag Kumar, Vamsi Krishna Ithapu

    Abstract: Most speech enhancement (SE) models learn a point estimate and do not make use of uncertainty estimation in the learning process. In this paper, we show that modeling heteroscedastic uncertainty by minimizing a multivariate Gaussian negative log-likelihood (NLL) improves SE performance at no extra cost. During training, our approach augments a model learning complex spectral map** with a tempora… ▽ More

    Submitted 8 March, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

    Comments: 5 pages. Accepted at ICASSP 2023

  6. arXiv:2211.04473  [pdf, other

    cs.SD cs.AI eess.AS

    Towards Improved Room Impulse Response Estimation for Speech Recognition

    Authors: Anton Ratnarajah, Ishwarya Ananthabhotla, Vamsi Krishna Ithapu, Pablo Hoffmann, Dinesh Manocha, Paul Calamia

    Abstract: We propose a novel approach for blind room impulse response (RIR) estimation systems in the context of a downstream application scenario, far-field automatic speech recognition (ASR). We first draw the connection between improved RIR estimation and improved ASR performance, as a means of evaluating neural RIR estimators. We then propose a generative adversarial network (GAN) based architecture tha… ▽ More

    Submitted 19 March, 2023; v1 submitted 7 November, 2022; originally announced November 2022.

    Comments: Accepted at ICASSP 2023. More results are available at https://anton-jeran.github.io/S2IR/

  7. arXiv:2206.12297  [pdf, other

    eess.AS cs.SD

    SAQAM: Spatial Audio Quality Assessment Metric

    Authors: Pranay Manocha, Anurag Kumar, Buye Xu, Anjali Menon, Israel D. Gebru, Vamsi K. Ithapu, Paul Calamia

    Abstract: Audio quality assessment is critical for assessing the perceptual realism of sounds. However, the time and expense of obtaining ''gold standard'' human judgments limit the availability of such data. For AR&VR, good perceived sound quality and localizability of sources are among the key elements to ensure complete immersion of the user. Our work introduces SAQAM which uses a multi-task learning fra… ▽ More

    Submitted 24 June, 2022; originally announced June 2022.

    Comments: To Appear, Interspeech 2022

  8. arXiv:2202.08862  [pdf, other

    cs.SD cs.LG eess.AS

    RemixIT: Continual self-training of speech enhancement models via bootstrapped remixing

    Authors: Efthymios Tzinis, Yossi Adi, Vamsi Krishna Ithapu, Buye Xu, Paris Smaragdis, Anurag Kumar

    Abstract: We present RemixIT, a simple yet effective self-supervised method for training speech enhancement without the need of a single isolated in-domain speech nor a noise waveform. Our approach overcomes limitations of previous methods which make them dependent on clean in-domain target signals and thus, sensitive to any domain mismatch between train and test samples. RemixIT is based on a continuous se… ▽ More

    Submitted 3 August, 2022; v1 submitted 17 February, 2022; originally announced February 2022.

    Comments: To appear in IEEE Journal of Selected Topics in Signal Processing

    Journal ref: J-STSP-SLSAP-00040-2022

  9. arXiv:2202.03416  [pdf, other

    cs.SD cs.LG eess.AS

    Deep Impulse Responses: Estimating and Parameterizing Filters with Deep Networks

    Authors: Alexander Richard, Peter Dodds, Vamsi Krishna Ithapu

    Abstract: Impulse response estimation in high noise and in-the-wild settings, with minimal control of the underlying data distributions, is a challenging problem. We propose a novel framework for parameterizing and estimating impulse responses based on recent advances in neural representation learning. Our framework is driven by a carefully designed neural network that jointly estimates the impulse response… ▽ More

    Submitted 7 February, 2022; originally announced February 2022.

  10. arXiv:2201.01928  [pdf, other

    cs.CV cs.SD eess.AS

    Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

    Authors: Hao Jiang, Calvin Murdock, Vamsi Krishna Ithapu

    Abstract: Augmented reality devices have the potential to enhance human perception and enable other assistive functionalities in complex conversational environments. Effectively capturing the audio-visual context necessary for understanding these social interactions first requires detecting and localizing the voice activities of the device wearer and the surrounding people. These tasks are challenging due t… ▽ More

    Submitted 6 January, 2022; originally announced January 2022.

  11. Continual self-training with bootstrapped remixing for speech enhancement

    Authors: Efthymios Tzinis, Yossi Adi, Vamsi K. Ithapu, Buye Xu, Anurag Kumar

    Abstract: We propose RemixIT, a simple and novel self-supervised training method for speech enhancement. The proposed method is based on a continuously self-training scheme that overcomes limitations from previous studies including assumptions for the in-domain noise distribution and having access to clean target signals. Specifically, a separation teacher model is pre-trained on an out-of-domain dataset an… ▽ More

    Submitted 29 January, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

    Comments: To appear in Proc. ICASSP 2022, May 22-27, 2022, Singapore

    Journal ref: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  12. arXiv:2107.07503  [pdf, other

    eess.AS cs.SD

    Filtered Noise Sha** for Time Domain Room Impulse Response Estimation From Reverberant Speech

    Authors: Christian J. Steinmetz, Vamsi Krishna Ithapu, Paul Calamia

    Abstract: Deep learning approaches have emerged that aim to transform an audio signal so that it sounds as if it was recorded in the same room as a reference recording, with applications both in audio post-production and augmented reality. In this work, we propose FiNS, a Filtered Noise Sha** network that directly estimates the time domain room impulse response (RIR) from reverberant speech. Our domain-in… ▽ More

    Submitted 15 July, 2021; originally announced July 2021.

    Comments: Accepted to WASPAA 2021. See details at https://facebookresearch.github.io/FiNS/

  13. arXiv:2107.04174  [pdf, other

    cs.SD cs.CV cs.LG eess.AS eess.SP

    EasyCom: An Augmented Reality Dataset to Support Algorithms for Easy Communication in Noisy Environments

    Authors: Jacob Donley, Vladimir Tourbabin, Jung-Suk Lee, Mark Broyles, Hao Jiang, Jie Shen, Maja Pantic, Vamsi Krishna Ithapu, Ravish Mehra

    Abstract: Augmented Reality (AR) as a platform has the potential to facilitate the reduction of the cocktail party effect. Future AR headsets could potentially leverage information from an array of sensors spanning many different modalities. Training and testing signal processing and machine learning algorithms on tasks such as beam-forming and speech enhancement require high quality representative data. To… ▽ More

    Submitted 18 October, 2021; v1 submitted 8 July, 2021; originally announced July 2021.

    Comments: Dataset is available at: https://github.com/facebookresearch/EasyComDataset

  14. arXiv:2106.11335  [pdf, other

    cs.SD cs.AI eess.AS

    Do sound event representations generalize to other audio tasks? A case study in audio transfer learning

    Authors: Anurag Kumar, Yun Wang, Vamsi Krishna Ithapu, Christian Fuegen

    Abstract: Transfer learning is critical for efficient information transfer across multiple related learning problems. A simple, yet effective transfer learning approach utilizes deep neural networks trained on a large-scale task for feature extraction. Such representations are then used to learn related downstream tasks. In this paper, we investigate transfer learning capacity of audio representations obtai… ▽ More

    Submitted 21 June, 2021; originally announced June 2021.

    Comments: Accepted Interspeech 2021

  15. DPLM: A Deep Perceptual Spatial-Audio Localization Metric

    Authors: Pranay Manocha, Anurag Kumar, Buye Xu, Anjali Menon, Israel D. Gebru, Vamsi K. Ithapu, Paul Calamia

    Abstract: Subjective evaluations are critical for assessing the perceptual realism of sounds in audio-synthesis driven technologies like augmented and virtual reality. However, they are challenging to set up, fatiguing for users, and expensive. In this work, we tackle the problem of capturing the perceptual characteristics of localizing sounds. Specifically, we propose a framework for building a general pur… ▽ More

    Submitted 28 May, 2021; originally announced May 2021.

  16. arXiv:2007.00144  [pdf, other

    cs.SD cs.LG eess.AS

    A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition

    Authors: Anurag Kumar, Vamsi Krishna Ithapu

    Abstract: An important problem in machine auditory perception is to recognize and detect sound events. In this paper, we propose a sequential self-teaching approach to learning sounds. Our main proposition is that it is harder to learn sounds in adverse situations such as from weakly labeled and/or noisy labeled data, and in these situations a single stage of learning is not sufficient. Our proposal is a se… ▽ More

    Submitted 30 June, 2020; originally announced July 2020.

    Comments: Accepted International Conference on Machine Learning $\textbf{(ICML) 2020}$. 14 pages

  17. arXiv:1912.11474  [pdf, other

    cs.CV cs.HC cs.SD eess.AS

    SoundSpaces: Audio-Visual Navigation in 3D Environments

    Authors: Changan Chen, Unnat Jain, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, Kristen Grauman

    Abstract: Moving around in the world is naturally a multisensory experience, but today's embodied agents are deaf---restricted to solely their visual perception of the environment. We introduce audio-visual navigation for complex, acoustically and visually realistic 3D environments. By both seeing and hearing, the agent must learn to navigate to a sounding object. We propose a multi-modal deep reinforcement… ▽ More

    Submitted 21 August, 2020; v1 submitted 24 December, 2019; originally announced December 2019.

    Comments: Accepted to ECCV 2020 (Spotlight). Project page: http://vision.cs.utexas.edu/projects/audio_visual_navigation/

  18. Secost: Sequential co-supervision for large scale weakly labeled audio event detection

    Authors: Anurag Kumar, Vamsi Krishna Ithapu

    Abstract: Weakly supervised learning algorithms are critical for scaling audio event detection to several hundreds of sound categories. Such learning models should not only disambiguate sound events efficiently with minimal class-specific annotation but also be robust to label noise, which is more apparent with weak labels instead of strong annotations. In this work, we propose a new framework for designing… ▽ More

    Submitted 4 May, 2020; v1 submitted 25 October, 2019; originally announced October 2019.

    Comments: Accepted IEEE ICASSP 2020