Skip to main content

Showing 1–50 of 95 results for author: Anurag

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.20005  [pdf, other

    eess.IV cs.CV

    Malaria Cell Detection Using Deep Neural Networks

    Authors: Saurabh Sawant, Anurag Singh

    Abstract: Malaria remains one of the most pressing public health concerns globally, causing significant morbidity and mortality, especially in sub-Saharan Africa. Rapid and accurate diagnosis is crucial for effective treatment and disease management. Traditional diagnostic methods, such as microscopic examination of blood smears, are labor-intensive and require significant expertise, which may not be readil… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  2. arXiv:2406.17124  [pdf, other

    cs.SD cs.LG eess.AS

    Investigating Confidence Estimation Measures for Speaker Diarization

    Authors: Anurag Chowdhury, Abhinav Misra, Mark C. Fuhs, Monika Woszczyna

    Abstract: Speaker diarization systems segment a conversation recording based on the speakers' identity. Such systems can misclassify the speaker of a portion of audio due to a variety of factors, such as speech pattern variation, background noise, and overlap** speech. These errors propagate to, and can adversely affect, downstream systems that rely on the speaker's identity, such as speaker-adapted speec… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Accepted in INTERSPEECH 2024

  3. arXiv:2406.15117  [pdf, other

    eess.IV cs.AI cs.CV

    FA-Net: A Fuzzy Attention-aided Deep Neural Network for Pneumonia Detection in Chest X-Rays

    Authors: Ayush Roy, Anurag Bhattacharjee, Diego Oliva, Oscar Ramos-Soto, Francisco J. Alvarez-Padilla, Ram Sarkar

    Abstract: Pneumonia is a respiratory infection caused by bacteria, fungi, or viruses. It affects many people, particularly those in develo** or underdeveloped nations with high pollution levels, unhygienic living conditions, overcrowding, and insufficient medical infrastructure. Pneumonia can cause pleural effusion, where fluids fill the lungs, leading to respiratory difficulty. Early diagnosis is crucial… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  4. arXiv:2406.14861  [pdf, other

    eess.SY cs.ET

    Resilience of the Electric Grid through Trustable IoT-Coordinated Assets

    Authors: Vineet J. Nair, Venkatesh Venkataramanan, Priyank Srivastava, Partha S. Sarker, Anurag Srivastava, Laurentiu D. Marinovici, Jun Zha, Christopher Irwin, Prateek Mittal, John Williams, H. Vincent Poor, Anuradha M. Annaswamy

    Abstract: The electricity grid has evolved from a physical system to a cyber-physical system with digital devices that perform measurement, control, communication, computation, and actuation. The increased penetration of distributed energy resources (DERs) that include renewable generation, flexible loads, and storage provides extraordinary opportunities for improvements in efficiency and sustainability. Ho… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    Comments: Submitted to the Proceedings of the National Academy of Sciences (PNAS), under review

  5. arXiv:2406.11619  [pdf, other

    eess.AS cs.LG

    AV-CrossNet: an Audiovisual Complex Spectral Map** Network for Speech Separation By Leveraging Narrow- and Cross-Band Modeling

    Authors: Vahid Ahmadi Kalkhorani, Cheng Yu, Anurag Kumar, Ke Tan, Buye Xu, DeLiang Wang

    Abstract: Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the CrossNet architecture, which is a recently proposed network that performs complex spectral map** for speech separation by lever… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: 10 pages, 4 Figures, and 4 Tables

  6. arXiv:2406.08914  [pdf, other

    cs.SD cs.LG eess.AS

    Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition

    Authors: William Ravenscroft, George Close, Stefan Goetze, Thomas Hain, Mohammad Soleymanpour, Anurag Chowdhury, Mark C. Fuhs

    Abstract: One solution to automatic speech recognition (ASR) of overlap** speakers is to separate speech and then perform ASR on the separated signals. Commonly, the separator produces artefacts which often degrade ASR performance. Addressing this issue typically requires reference transcriptions to jointly train the separation and ASR networks. This is often not viable for training on real-world in-domai… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 5 pages, 3 Figures, 3 Tables, Accepted for Interspeech 2024

  7. arXiv:2406.04660  [pdf, other

    eess.AS cs.SD

    URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

    Authors: Wangyou Zhang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Chenda Li, Zhaoheng Ni, Anurag Kumar, Jan Pirklbauer, Marvin Sach, Shinji Watanabe, Tim Fingscheidt, Yanmin Qian

    Abstract: The last decade has witnessed significant advancements in deep learning-based speech enhancement (SE). However, most existing SE research has limitations on the coverage of SE sub-tasks, data diversity and amount, and evaluation metrics. To fill this gap and promote research toward universal SE, we establish a new SE challenge, named URGENT, to focus on the universality, robustness, and generaliza… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: 6 pages, 3 figures, 3 tables. Accepted by Interspeech 2024. An extended version of the accepted manuscript with appendix

  8. arXiv:2405.20402  [pdf, other

    eess.AS cs.SD eess.SP

    Cross-Talk Reduction

    Authors: Zhong-Qiu Wang, Anurag Kumar, Shinji Watanabe

    Abstract: While far-field multi-talker mixtures are recorded, each speaker can wear a close-talk microphone so that close-talk mixtures can be recorded at the same time. Although each close-talk mixture has a high signal-to-noise ratio (SNR) of the wearer, it has a very limited range of applications, as it also contains significant cross-talk speech by other speakers and is not clean enough. In this context… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: in International Joint Conference on Artificial Intelligence (IJCAI), 2024

  9. arXiv:2405.01040  [pdf, other

    cs.CV cs.CL eess.IV

    Few Shot Class Incremental Learning using Vision-Language models

    Authors: Anurag Kumar, Chinmay Bharti, Saikat Dutta, Srikrishna Karanam, Biplab Banerjee

    Abstract: Recent advancements in deep learning have demonstrated remarkable performance comparable to human capabilities across various supervised computer vision tasks. However, the prevalent assumption of having an extensive pool of training data encompassing all classes prior to model training often diverges from real-world scenarios, where limited data availability for novel classes is the norm. The cha… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: under review at Pattern Recognition Letters

  10. arXiv:2404.15009  [pdf, other

    cs.CV eess.IV

    The Brain Tumor Segmentation in Pediatrics (BraTS-PEDs) Challenge: Focus on Pediatrics (CBTN-CONNECT-DIPGR-ASNR-MICCAI BraTS-PEDs)

    Authors: Anahita Fathi Kazerooni, Nastaran Khalili, Deep Gandhi, Xinyang Liu, Zhifan Jiang, Syed Muhammed Anwar, Jake Albrecht, Maruf Adewole, Udunna Anazodo, Hannah Anderson, Sina Bagheri, Ujjwal Baid, Timothy Bergquist, Austin J. Borja, Evan Calabrese, Verena Chung, Gian-Marco Conte, Farouk Dako, James Eddy, Ivan Ezhov, Ariana Familiar, Keyvan Farahani, Anurag Gottipati, Debanjan Haldar, Shuvanjan Haldar , et al. (51 additional authors not shown)

    Abstract: Pediatric tumors of the central nervous system are the most common cause of cancer-related death in children. The five-year survival rate for high-grade gliomas in children is less than 20%. Due to their rarity, the diagnosis of these entities is often delayed, their treatment is mainly based on historic treatment concepts, and clinical trials require multi-institutional collaborations. Here we pr… ▽ More

    Submitted 29 April, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2305.17033

  11. arXiv:2404.14729  [pdf, other

    eess.SY

    Emergent Cooperation for Energy-efficient Connectivity via Wireless Power Transfer

    Authors: Winston Hurst, Anurag Pallaprolu, Yasamin Mostofi

    Abstract: This paper addresses the challenge of incentivizing energy-constrained, non-cooperative user equipment (UE) to serve as cooperative relays. We consider a source UE with a non-line-of-sight channel to an access point (AP), where direct communication may be infeasible or may necessitate a substantial transmit power. Other UEs in the vicinity are viewed as relay candidates, and our aim is to enable e… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

  12. arXiv:2403.18821  [pdf, other

    cs.SD cs.CV cs.MM eess.AS

    Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark

    Authors: Ziyang Chen, Israel D. Gebru, Christian Richardt, Anurag Kumar, William Laney, Andrew Owens, Alexander Richard

    Abstract: We present a new dataset called Real Acoustic Fields (RAF) that captures real acoustic room data from multiple modalities. The dataset includes high-quality and densely captured room impulse response data paired with multi-view images, and precise 6DoF pose tracking data for sound emitters and listeners in the rooms. We used this dataset to evaluate existing methods for novel-view acoustic synthes… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Comments: Accepted to CVPR 2024. Project site: https://facebookresearch.github.io/real-acoustic-fields/

  13. arXiv:2403.01369  [pdf, other

    eess.AS cs.AI cs.LG

    A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement

    Authors: Ravi Shankar, Ke Tan, Buye Xu, Anurag Kumar

    Abstract: Self-supervised learned models have been found to be very effective for certain speech tasks such as automatic speech recognition, speaker identification, keyword spotting and others. While the features are undeniably useful in speech recognition and associated tasks, their utility in speech enhancement systems is yet to be firmly established, and perhaps not properly understood. In this paper, we… ▽ More

    Submitted 2 March, 2024; originally announced March 2024.

    Comments: 8 pages; Shorter form accepted in ICASSP 2024

  14. arXiv:2402.18968  [pdf, other

    eess.AS cs.SD

    Ambisonics Networks -- The Effect Of Radial Functions Regularization

    Authors: Bar Shaybet, Anurag Kumar, Vladimir Tourbabin, Boaz Rafaely

    Abstract: Ambisonics, a popular format of spatial audio, is the spherical harmonic (SH) representation of the plane wave density function of a sound field. Many algorithms operate in the SH domain and utilize the Ambisonics as their input signal. The process of encoding Ambisonics from a spherical microphone array involves dividing by the radial functions, which may amplify noise at low frequencies. This ca… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

    Comments: to be published in Icassp 2024

  15. arXiv:2401.06148  [pdf, other

    eess.IV cs.AI cs.CV q-bio.QM

    Artificial Intelligence for Digital and Computational Pathology

    Authors: Andrew H. Song, Guillaume Jaume, Drew F. K. Williamson, Ming Y. Lu, Anurag Vaidya, Tiffany R. Miller, Faisal Mahmood

    Abstract: Advances in digitizing tissue slides and the fast-paced progress in artificial intelligence, including deep learning, have boosted the field of computational pathology. This field holds tremendous potential to automate clinical diagnosis, predict patient prognosis and response to therapy, and discover new morphological biomarkers from tissue images. Some of these artificial intelligence-based syst… ▽ More

    Submitted 12 December, 2023; originally announced January 2024.

    Journal ref: Nature Reviews Bioengineering 2023

  16. arXiv:2311.18168  [pdf, other

    cs.CV cs.LG eess.AS

    Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications

    Authors: Karren D. Yang, Anurag Ranjan, Jen-Hao Rick Chang, Raviteja Vemulapalli, Oncel Tuzel

    Abstract: We consider the task of animating 3D facial geometry from speech signal. Existing works are primarily deterministic, focusing on learning a one-to-one map** from speech signal to 3D face meshes on small datasets with limited speakers. While these models can achieve high-quality lip articulation for speakers in the training set, they are unable to capture the full and diverse distribution of 3D f… ▽ More

    Submitted 29 November, 2023; originally announced November 2023.

  17. arXiv:2311.12585  [pdf

    eess.SY

    An IoT-based Smart Parking System

    Authors: Ridhi Choudhary, Arnav Sanjay Sinha, Krishna Jaiswal, Anurag Chandra

    Abstract: The number of vehicles on the road is growing every day, thus there's a growing need to develop effective and hassle-free parking systems. Finding a parking space may be a big challenge, especially in crowded cities or areas with scheduled sporting or cultural events. The project suggests an automated parking system that makes use of technology like sensor systems and microcontrollers. In order to… ▽ More

    Submitted 21 November, 2023; originally announced November 2023.

    Comments: 3 pages

  18. arXiv:2310.18820  [pdf, other

    cs.CR cs.IT eess.SY

    Demand-Side Threats to Power Grid Operations from IoT-Enabled Edge

    Authors: Subhash Lakshminarayana, Carsten Maple, Andrew Larkins, Daryl Flack, Christopher Few, Anurag. K. Srivastava

    Abstract: The growing adoption of Internet-of-Things (IoT)-enabled energy smart appliances (ESAs) at the consumer end, such as smart heat pumps, electric vehicle chargers, etc., is seen as key to enabling demand-side response (DSR) services. However, these smart appliances are often poorly engineered from a security point of view and present a new threat to power grid operations. They may become convenient… ▽ More

    Submitted 28 October, 2023; originally announced October 2023.

  19. arXiv:2310.17864  [pdf, other

    eess.AS cs.SD

    TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

    Authors: Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, **chuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe, Yangyang Shi, Yumeng Tao, Robin Scheibler, Samuele Cornell, Sean Kim, Stavros Petridis

    Abstract: TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by develo** impactful features. Here, we survey TorchAudio's devel… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

  20. arXiv:2310.15130  [pdf, other

    cs.SD cs.CV eess.AS

    Novel-View Acoustic Synthesis from 3D Reconstructed Rooms

    Authors: Byeongjoo Ahn, Karren Yang, Brian Hamilton, Jonathan Sheaffer, Anurag Ranjan, Miguel Sarabia, Oncel Tuzel, Jen-Hao Rick Chang

    Abstract: We investigate the benefit of combining blind audio recordings with 3D scene information for novel-view acoustic synthesis. Given audio recordings from 2-4 microphones and the 3D geometry and material of a scene containing multiple unknown sound sources, we estimate the sound anywhere in the scene. We identify the main challenges of novel-view acoustic synthesis as sound source localization, separ… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

  21. arXiv:2310.08718  [pdf, other

    eess.SP

    A Framework for Develo** and Evaluating Algorithms for Estimating Multipath Propagation Parameters from Channel Sounder Measurements

    Authors: Akbar Sayeed, Damla Guven, Michael Doebereiner, Sebastian Semper, Camillo Gentile, Anuraag Bodi, Zihang Cheng

    Abstract: A framework is proposed for develo** and evaluating algorithms for extracting multipath propagation components (MPCs) from measurements collected by channel sounders at millimeter-wave frequencies. Sounders equipped with an omnidirectional transmitter and a receiver with a uniform planar array (UPA) are considered. An accurate mathematical model is developed for the spatial frequency response of… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

    Comments: 17 pages

  22. arXiv:2310.04467  [pdf, other

    cs.LG cs.AI eess.SY

    Design Principles for Lifelong Learning AI Accelerators

    Authors: Dhireesha Kudithipudi, Anurag Daram, Abdullah M. Zyarah, Fatima Tuz Zohora, James B. Aimone, Angel Yanguas-Gil, Nicholas Soures, Emre Neftci, Matthew Mattina, Vincenzo Lomonaco, Clare D. Thiem, Benjamin Epstein

    Abstract: Lifelong learning - an agent's ability to learn throughout its lifetime - is a hallmark of biological learning systems and a central challenge for artificial intelligence (AI). The development of lifelong learning algorithms could lead to a range of novel AI applications, but this will also require the development of appropriate hardware accelerators, particularly if the models are to be deployed… ▽ More

    Submitted 5 October, 2023; originally announced October 2023.

  23. arXiv:2309.15977  [pdf, other

    cs.SD cs.CV eess.AS

    Neural Acoustic Context Field: Rendering Realistic Room Impulse Response With Neural Fields

    Authors: Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu

    Abstract: Room impulse response (RIR), which measures the sound propagation within an environment, is critical for synthesizing high-fidelity audio for a given environment. Some prior work has proposed representing RIR as a neural field function of the sound emitter and receiver positions. However, these methods do not sufficiently consider the acoustic properties of an audio scene, leading to unsatisfactor… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

  24. arXiv:2309.10788  [pdf, other

    eess.SY

    Physics-Informed Machine Learning for Data Anomaly Detection, Classification, Localization, and Mitigation: A Review, Challenges, and Path Forward

    Authors: Mehdi Jabbari Zideh, Paroma Chatterjee, Anurag K. Srivastava

    Abstract: Advancements in digital automation for smart grids have led to the installation of measurement devices like phasor measurement units (PMUs), micro-PMUs ($μ$-PMUs), and smart meters. However, a large amount of data collected by these devices brings several challenges as control room operators need to use this data with models to make confident decisions for reliable and resilient operation of the c… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

  25. arXiv:2309.02404  [pdf, other

    cs.SD cs.CV eess.AS

    Voice Morphing: Two Identities in One Voice

    Authors: Sushanta K. Pani, Anurag Chowdhury, Morgan Sandler, Arun Ross

    Abstract: In a biometric system, each biometric sample or template is typically associated with a single identity. However, recent research has demonstrated the possibility of generating "morph" biometric samples that can successfully match more than a single identity. Morph attacks are now recognized as a potential security threat to biometric systems. However, most morph attacks have been studied on biome… ▽ More

    Submitted 5 September, 2023; originally announced September 2023.

    Comments: Accepted oral paper at BIOSIG 2023

  26. arXiv:2308.00122  [pdf, other

    cs.CV cs.SD eess.AS

    DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion Models

    Authors: Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, Chenliang Xu

    Abstract: We propose DAVIS, a Diffusion model-based Audio-VIusal Separation framework that solves the audio-visual sound source separation task through a generative manner. While existing discriminative methods that perform mask regression have made remarkable progress in this field, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse… ▽ More

    Submitted 31 July, 2023; originally announced August 2023.

  27. Multi-agent Deep Reinforcement Learning for Distributed Load Restoration

    Authors: Linh Vu, Tuyen Vu, Thanh-Long Vu, Anurag Srivastava

    Abstract: This paper addresses the load restoration problem after power outage events. Our primary proposed methodology is using multi-agent deep reinforcement learning to optimize the load restoration process in distribution systems, modeled as networked microgrids, via determining the optimal operational sequence of circuit breakers (switches). An innovative invalid action masking technique is incorporate… ▽ More

    Submitted 24 June, 2023; originally announced June 2023.

    Comments: 12 pages, 19 figures, journal under review

  28. arXiv:2305.05479  [pdf, other

    cs.CR cs.DC eess.SP eess.SY

    Multiple-stop** time Sequential Detection for Energy Efficient Mining in Blockchain-Enabled IoT

    Authors: Anurag Gupta, Vikram Krishnamurthy

    Abstract: What are the optimal times for an Internet of Things (IoT) device to act as a blockchain miner? The aim is to minimize the energy consumed by low-power IoT devices that log their data into a secure (tamper-proof) distributed ledger. We formulate a multiple stop** time Bayesian sequential detection problem to address energy-efficient blockchain mining for IoT devices. The objective is to identify… ▽ More

    Submitted 17 August, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

  29. arXiv:2304.01448  [pdf, other

    eess.AS

    TorchAudio-Squim: Reference-less Speech Quality and Intelligibility measures in TorchAudio

    Authors: Anurag Kumar, Ke Tan, Zhaoheng Ni, Pranay Manocha, Xiaohui Zhang, Ethan Henderson, Buye Xu

    Abstract: Measuring quality and intelligibility of a speech signal is usually a critical step in development of speech processing systems. To enable this, a variety of metrics to measure quality and intelligibility under different assumptions have been developed. Through this paper, we introduce tools and a set of models to estimate such known metrics using deep neural networks. These models are made availa… ▽ More

    Submitted 3 April, 2023; originally announced April 2023.

    Comments: ICASSP 2023

  30. arXiv:2303.13471  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Egocentric Audio-Visual Object Localization

    Authors: Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu

    Abstract: Humans naturally perceive surrounding scenes by unifying sound and sight in a first-person view. Likewise, machines are advanced to approach human intelligence by learning with multisensory inputs from an egocentric perspective. In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even w… ▽ More

    Submitted 23 March, 2023; originally announced March 2023.

    Comments: Accepted by CVPR 2023

  31. arXiv:2302.11753  [pdf

    eess.SY

    Securely implementing and managing neighborhood solar with storage and peer to peer transactive energy

    Authors: Steven Knudsen, Subir Majumder, Anurag K. Srivastava

    Abstract: In this paper, we aim to leverage peer to peer (P2P) transactive energy framework for optimal control of rooftop or neighborhood solar power with battery electric storage systems (BESS). Here we propose that the multiple neighboring customers would interconnect to form a community DC grid while still being connected to the main utility grid. The proposed infrastructure would (i) increase the resil… ▽ More

    Submitted 22 February, 2023; originally announced February 2023.

    Comments: in Cigre Annual Meeting, Paris, Aug. 2022

  32. arXiv:2302.08095  [pdf, other

    cs.SD cs.CL eess.AS

    PAAPLoss: A Phonetic-Aligned Acoustic Parameter Loss for Speech Enhancement

    Authors: Muqiao Yang, Joseph Konan, David Bick, Yunyang Zeng, Shuo Han, Anurag Kumar, Shinji Watanabe, Bhiksha Raj

    Abstract: Despite rapid advancement in recent years, current speech enhancement models often produce speech that differs in perceptual quality from real clean speech. We propose a learning objective that formalizes differences in perceptual quality, by using domain knowledge of acoustic-phonetics. We identify temporal acoustic parameters -- such as spectral tilt, spectral flux, shimmer, etc. -- that are non… ▽ More

    Submitted 16 February, 2023; originally announced February 2023.

    Comments: Accepted at ICASSP 2023

  33. arXiv:2302.08088  [pdf, other

    cs.CL cs.SD eess.AS

    TAPLoss: A Temporal Acoustic Parameter Loss for Speech Enhancement

    Authors: Yunyang Zeng, Joseph Konan, Shuo Han, David Bick, Muqiao Yang, Anurag Kumar, Shinji Watanabe, Bhiksha Raj

    Abstract: Speech enhancement models have greatly progressed in recent years, but still show limits in perceptual quality of their speech outputs. We propose an objective for perceptual quality based on temporal acoustic parameters. These are fundamental speech features that play an essential role in various applications, including speaker recognition and paralinguistic analysis. We provide a differentiable… ▽ More

    Submitted 15 February, 2023; originally announced February 2023.

    Comments: Accepted at ICASSP 2023

  34. arXiv:2302.02088  [pdf, other

    cs.CV cs.GR cs.SD eess.AS

    AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis

    Authors: Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu

    Abstract: Can machines recording an audio-visual scene produce realistic, matching audio-visual experiences at novel positions and novel view directions? We answer it by studying a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning. Concretely, given a video recording of an audio-visual scene, the task is to synthesize new videos with s… ▽ More

    Submitted 16 October, 2023; v1 submitted 3 February, 2023; originally announced February 2023.

    Comments: NeurIPS 2023

  35. arXiv:2301.04320  [pdf, other

    cs.SD cs.LG eess.AS

    Rethinking complex-valued deep neural networks for monaural speech enhancement

    Authors: Haibin Wu, Ke Tan, Buye Xu, Anurag Kumar, Daniel Wong

    Abstract: Despite multiple efforts made towards adopting complex-valued deep neural networks (DNNs), it remains an open question whether complex-valued DNNs are generally more effective than real-valued DNNs for monaural speech enhancement. This work is devoted to presenting a critical assessment by systematically examining complex-valued DNNs against their real-valued counterparts. Specifically, we investi… ▽ More

    Submitted 11 January, 2023; originally announced January 2023.

  36. arXiv:2211.10999  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

    Authors: Rodrigo Mira, Buye Xu, Jacob Donley, Anurag Kumar, Stavros Petridis, Vamsi Krishna Ithapu, Maja Pantic

    Abstract: Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging not only the audio itself but also the target speaker's lip movements. This approach has been shown to yield improvements over audio-only speech enhancement, particularly for the removal of interfering speech. Despite recent advances in speech synthesis, most audio-visual approaches continue to use… ▽ More

    Submitted 13 March, 2023; v1 submitted 20 November, 2022; originally announced November 2022.

    Comments: accepted to ICASSP 2023

  37. arXiv:2211.08624  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Leveraging Heteroscedastic Uncertainty in Learning Complex Spectral Map** for Single-channel Speech Enhancement

    Authors: Kuan-Lin Chen, Daniel D. E. Wong, Ke Tan, Buye Xu, Anurag Kumar, Vamsi Krishna Ithapu

    Abstract: Most speech enhancement (SE) models learn a point estimate and do not make use of uncertainty estimation in the learning process. In this paper, we show that modeling heteroscedastic uncertainty by minimizing a multivariate Gaussian negative log-likelihood (NLL) improves SE performance at no extra cost. During training, our approach augments a model learning complex spectral map** with a tempora… ▽ More

    Submitted 8 March, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

    Comments: 5 pages. Accepted at ICASSP 2023

  38. arXiv:2210.14800  [pdf, other

    eess.AS cs.HC cs.SD

    Naturalistic Head Motion Generation from Speech

    Authors: Trisha Mittal, Zakaria Aldeneh, Masha Fedzechkina, Anurag Ranjan, Barry-John Theobald

    Abstract: Synthesizing natural head motion to accompany speech for an embodied conversational agent is necessary for providing a rich interactive experience. Most prior works assess the quality of generated head motion by comparing them against a single ground-truth using an objective metric. Yet there are many plausible head motion sequences to accompany a speech utterance. In this work, we study the varia… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023

  39. arXiv:2209.10043  [pdf, other

    cs.LG cs.AI eess.IV q-bio.QM

    SynthA1c: Towards Clinically Interpretable Patient Representations for Diabetes Risk Stratification

    Authors: Michael S. Yao, Allison Chae, Matthew T. MacLean, Anurag Verma, Jeffrey Duda, James Gee, Drew A. Torigian, Daniel Rader, Charles Kahn, Walter R. Witschey, Hersh Sagreiya

    Abstract: Early diagnosis of Type 2 Diabetes Mellitus (T2DM) is crucial to enable timely therapeutic interventions and lifestyle modifications. As the time available for clinical office visits shortens and medical imaging data become more widely available, patient image data could be used to opportunistically identify patients for additional T2DM diagnostic workup by physicians. We investigated whether imag… ▽ More

    Submitted 27 July, 2023; v1 submitted 20 September, 2022; originally announced September 2022.

    Comments: 12 pages. Accepted to PRIME MICCAI 2023

  40. arXiv:2209.03715  [pdf, other

    eess.SY

    Optimization-based framework for low-voltage grid reinforcement assessment under various levels of flexibility and coordination

    Authors: Soner Candas, Beneharo Reveron Baecker, Anurag Mohapatra, Thomas Hamacher

    Abstract: The rapid electrification of residential heating and mobility sectors is expected to drive the existing distribution grid assets beyond their planned operating conditions. This change will also reveal new potentials through sector coupling, flexibilities, and the local exchange of decentralized generation. This paper thus presents an optimization framework for multi-modal energy systems at the low… ▽ More

    Submitted 19 April, 2023; v1 submitted 8 September, 2022; originally announced September 2022.

  41. arXiv:2207.00237  [pdf, other

    cs.SD cs.LG eess.AS

    Improving Speech Enhancement through Fine-Grained Speech Characteristics

    Authors: Muqiao Yang, Joseph Konan, David Bick, Anurag Kumar, Shinji Watanabe, Bhiksha Raj

    Abstract: While deep learning based speech enhancement systems have made rapid progress in improving the quality of speech signals, they can still produce outputs that contain artifacts and can sound unnatural. We propose a novel approach to speech enhancement aimed at improving perceptual quality and naturalness of enhanced signals by optimizing for key characteristics of speech. We first identify key acou… ▽ More

    Submitted 11 July, 2022; v1 submitted 1 July, 2022; originally announced July 2022.

    Comments: Accepted at InterSpeech 2022

  42. arXiv:2206.12297  [pdf, other

    eess.AS cs.SD

    SAQAM: Spatial Audio Quality Assessment Metric

    Authors: Pranay Manocha, Anurag Kumar, Buye Xu, Anjali Menon, Israel D. Gebru, Vamsi K. Ithapu, Paul Calamia

    Abstract: Audio quality assessment is critical for assessing the perceptual realism of sounds. However, the time and expense of obtaining ''gold standard'' human judgments limit the availability of such data. For AR&VR, good perceived sound quality and localizability of sources are among the key elements to ensure complete immersion of the user. Our work introduces SAQAM which uses a multi-task learning fra… ▽ More

    Submitted 24 June, 2022; originally announced June 2022.

    Comments: To Appear, Interspeech 2022

  43. arXiv:2206.12285  [pdf, other

    eess.AS cs.SD

    Speech Quality Assessment through MOS using Non-Matching References

    Authors: Pranay Manocha, Anurag Kumar

    Abstract: Human judgments obtained through Mean Opinion Scores (MOS) are the most reliable way to assess the quality of speech signals. However, several recent attempts to automatically estimate MOS using deep learning approaches lack robustness and generalization capabilities, limiting their use in real-world applications. In this work, we present a novel framework, NORESQA-MOS, for estimating the MOS of a… ▽ More

    Submitted 24 June, 2022; originally announced June 2022.

    Comments: To Appear, Interspeech 2022

  44. arXiv:2203.05643  [pdf, ps, other

    cs.NI eess.SY

    Controlling Transaction Rate in Tangle Ledger: A Principal Agent Problem Approach

    Authors: Anurag Gupta, Vikram Krishnamurthy

    Abstract: Tangle is a distributed ledger technology that stores data as a directed acyclic graph (DAG). Unlike blockchain, Tangle does not require dedicated miners for its operation; this makes Tangle suitable for Internet of Things (IoT) applications. Distributed ledgers have a built-in transaction rate control mechanism to prevent congestion and spamming; this is typically achieved by increasing or decrea… ▽ More

    Submitted 18 April, 2023; v1 submitted 10 March, 2022; originally announced March 2022.

  45. arXiv:2202.08883  [pdf, other

    eess.AS cs.LG cs.SD

    Curriculum optimization for low-resource speech recognition

    Authors: Anastasia Kuznetsova, Anurag Kumar, Jennifer Drexler Fox, Francis Tyers

    Abstract: Modern end-to-end speech recognition models show astonishing results in transcribing audio signals into written text. However, conventional data feeding pipelines may be sub-optimal for low-resource speech recognition, which still remains a challenging task. We propose an automated curriculum learning approach to optimize the sequence of training examples based on both the progress of the model wh… ▽ More

    Submitted 17 February, 2022; originally announced February 2022.

  46. arXiv:2202.08862  [pdf, other

    cs.SD cs.LG eess.AS

    RemixIT: Continual self-training of speech enhancement models via bootstrapped remixing

    Authors: Efthymios Tzinis, Yossi Adi, Vamsi Krishna Ithapu, Buye Xu, Paris Smaragdis, Anurag Kumar

    Abstract: We present RemixIT, a simple yet effective self-supervised method for training speech enhancement without the need of a single isolated in-domain speech nor a noise waveform. Our approach overcomes limitations of previous methods which make them dependent on clean in-domain target signals and thus, sensitive to any domain mismatch between train and test samples. RemixIT is based on a continuous se… ▽ More

    Submitted 3 August, 2022; v1 submitted 17 February, 2022; originally announced February 2022.

    Comments: To appear in IEEE Journal of Selected Topics in Signal Processing

    Journal ref: J-STSP-SLSAP-00040-2022

  47. arXiv:2202.00538  [pdf, other

    cs.SD cs.CV eess.AS

    The impact of removing head movements on audio-visual speech enhancement

    Authors: Zhiqi Kang, Mostafa Sadeghi, Radu Horaud, Xavier Alameda-Pineda, Jacob Donley, Anurag Kumar

    Abstract: This paper investigates the impact of head movements on audio-visual speech enhancement (AVSE). Although being a common conversational feature, head movements have been ignored by past and recent studies: they challenge today's learning-based methods as they often degrade the performance of models that are trained on clean, frontal, and steady face images. To alleviate this problem, we propose to… ▽ More

    Submitted 2 February, 2022; v1 submitted 1 February, 2022; originally announced February 2022.

  48. arXiv:2112.04613  [pdf, other

    cs.SD eess.AS

    NICE-Beam: Neural Integrated Covariance Estimators for Time-Varying Beamformers

    Authors: Jonah Casebeer, Jacob Donley, Daniel Wong, Buye Xu, Anurag Kumar

    Abstract: Estimating a time-varying spatial covariance matrix for a beamforming algorithm is a challenging task, especially for wearable devices, as the algorithm must compensate for time-varying signal statistics due to rapid pose-changes. In this paper, we propose Neural Integrated Covariance Estimators for Beamformers, NICE-Beam. NICE-Beam is a general technique for learning how to estimate time-varying… ▽ More

    Submitted 8 December, 2021; originally announced December 2021.

  49. arXiv:2111.13920  [pdf

    cs.CV cs.LG eess.IV

    Sparse Subspace Clustering Friendly Deep Dictionary Learning for Hyperspectral Image Classification

    Authors: Anurag Goel, Angshul Majumdar

    Abstract: Subspace clustering techniques have shown promise in hyperspectral image segmentation. The fundamental assumption in subspace clustering is that the samples belonging to different clusters/segments lie in separable subspaces. What if this condition does not hold? We surmise that even if the condition does not hold in the original space, the data may be nonlinearly transformed to a space where it w… ▽ More

    Submitted 27 November, 2021; originally announced November 2021.

    Comments: IEEE Geoscience And Remote Sensing Letters

  50. arXiv:2111.00610  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units

    Authors: Anurag Katakkar, Alan W Black

    Abstract: Language models (LMs) for text data have been studied extensively for their usefulness in language generation and other downstream tasks. However, language modelling purely in the speech domain is still a relatively unexplored topic, with traditional speech LMs often depending on auxiliary text LMs for learning distributional aspects of the language. For the English language, these LMs treat words… ▽ More

    Submitted 31 October, 2021; originally announced November 2021.