Search | arXiv e-print repository

arXiv:2406.20005 [pdf, other]

Malaria Cell Detection Using Deep Neural Networks

Abstract: Malaria remains one of the most pressing public health concerns globally, causing significant morbidity and mortality, especially in sub-Saharan Africa. Rapid and accurate diagnosis is crucial for effective treatment and disease management. Traditional diagnostic methods, such as microscopic examination of blood smears, are labor-intensive and require significant expertise, which may not be readil… ▽ More Malaria remains one of the most pressing public health concerns globally, causing significant morbidity and mortality, especially in sub-Saharan Africa. Rapid and accurate diagnosis is crucial for effective treatment and disease management. Traditional diagnostic methods, such as microscopic examination of blood smears, are labor-intensive and require significant expertise, which may not be readily available in resource-limited settings. This project aims to automate the detection of malaria-infected cells using a deep learning approach. We employed a convolutional neural network (CNN) based on the ResNet50 architecture, leveraging transfer learning to enhance performance. The Malaria Cell Images Dataset from Kaggle, containing 27,558 images categorized into infected and uninfected cells, was used for training and evaluation. Our model demonstrated high accuracy, precision, and recall, indicating its potential as a reliable tool for assisting in malaria diagnosis. Additionally, a web application was developed using Streamlit to allow users to upload cell images and receive predictions about malaria infection, making the technology accessible and user-friendly. This paper provides a comprehensive overview of the methodology, experiments, and results, highlighting the effectiveness of deep learning in medical image analysis. △ Less

Submitted 28 June, 2024; originally announced June 2024.

arXiv:2406.17124 [pdf, other]

Investigating Confidence Estimation Measures for Speaker Diarization

Authors: Anurag Chowdhury, Abhinav Misra, Mark C. Fuhs, Monika Woszczyna

Abstract: Speaker diarization systems segment a conversation recording based on the speakers' identity. Such systems can misclassify the speaker of a portion of audio due to a variety of factors, such as speech pattern variation, background noise, and overlap** speech. These errors propagate to, and can adversely affect, downstream systems that rely on the speaker's identity, such as speaker-adapted speec… ▽ More Speaker diarization systems segment a conversation recording based on the speakers' identity. Such systems can misclassify the speaker of a portion of audio due to a variety of factors, such as speech pattern variation, background noise, and overlap** speech. These errors propagate to, and can adversely affect, downstream systems that rely on the speaker's identity, such as speaker-adapted speech recognition. One of the ways to mitigate these errors is to provide segment-level diarization confidence scores to downstream systems. In this work, we investigate multiple methods for generating diarization confidence scores, including those derived from the original diarization system and those derived from an external model. Our experiments across multiple datasets and diarization systems demonstrate that the most competitive confidence score methods can isolate ~30% of the diarization errors within segments with the lowest ~10% of confidence scores. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: Accepted in INTERSPEECH 2024

arXiv:2406.15117 [pdf, other]

FA-Net: A Fuzzy Attention-aided Deep Neural Network for Pneumonia Detection in Chest X-Rays

Authors: Ayush Roy, Anurag Bhattacharjee, Diego Oliva, Oscar Ramos-Soto, Francisco J. Alvarez-Padilla, Ram Sarkar

Abstract: Pneumonia is a respiratory infection caused by bacteria, fungi, or viruses. It affects many people, particularly those in develo** or underdeveloped nations with high pollution levels, unhygienic living conditions, overcrowding, and insufficient medical infrastructure. Pneumonia can cause pleural effusion, where fluids fill the lungs, leading to respiratory difficulty. Early diagnosis is crucial… ▽ More Pneumonia is a respiratory infection caused by bacteria, fungi, or viruses. It affects many people, particularly those in develo** or underdeveloped nations with high pollution levels, unhygienic living conditions, overcrowding, and insufficient medical infrastructure. Pneumonia can cause pleural effusion, where fluids fill the lungs, leading to respiratory difficulty. Early diagnosis is crucial to ensure effective treatment and increase survival rates. Chest X-ray imaging is the most commonly used method for diagnosing pneumonia. However, visual examination of chest X-rays can be difficult and subjective. In this study, we have developed a computer-aided diagnosis system for automatic pneumonia detection using chest X-ray images. We have used DenseNet-121 and ResNet50 as the backbone for the binary class (pneumonia and normal) and multi-class (bacterial pneumonia, viral pneumonia, and normal) classification tasks, respectively. We have also implemented a channel-specific spatial attention mechanism, called Fuzzy Channel Selective Spatial Attention Module (FCSSAM), to highlight the specific spatial regions of relevant channels while removing the irrelevant channels of the extracted features by the backbone. We evaluated the proposed approach on a publicly available chest X-ray dataset, using binary and multi-class classification setups. Our proposed method achieves accuracy rates of 97.15\% and 79.79\% for the binary and multi-class classification setups, respectively. The results of our proposed method are superior to state-of-the-art (SOTA) methods. The code of the proposed model will be available at: https://github.com/AyushRoy2001/FA-Net. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.14861 [pdf, other]

Resilience of the Electric Grid through Trustable IoT-Coordinated Assets

Authors: Vineet J. Nair, Venkatesh Venkataramanan, Priyank Srivastava, Partha S. Sarker, Anurag Srivastava, Laurentiu D. Marinovici, Jun Zha, Christopher Irwin, Prateek Mittal, John Williams, H. Vincent Poor, Anuradha M. Annaswamy

Abstract: The electricity grid has evolved from a physical system to a cyber-physical system with digital devices that perform measurement, control, communication, computation, and actuation. The increased penetration of distributed energy resources (DERs) that include renewable generation, flexible loads, and storage provides extraordinary opportunities for improvements in efficiency and sustainability. Ho… ▽ More The electricity grid has evolved from a physical system to a cyber-physical system with digital devices that perform measurement, control, communication, computation, and actuation. The increased penetration of distributed energy resources (DERs) that include renewable generation, flexible loads, and storage provides extraordinary opportunities for improvements in efficiency and sustainability. However, they can introduce new vulnerabilities in the form of cyberattacks, which can cause significant challenges in ensuring grid resilience. %, i.e. the ability to rapidly restore grid services in the face of severe disruptions. We propose a framework in this paper for achieving grid resilience through suitably coordinated assets including a network of Internet of Things (IoT) devices. A local electricity market is proposed to identify trustable assets and carry out this coordination. Situational Awareness (SA) of locally available DERs with the ability to inject power or reduce consumption is enabled by the market, together with a monitoring procedure for their trustability and commitment. With this SA, we show that a variety of cyberattacks can be mitigated using local trustable resources without stressing the bulk grid. The demonstrations are carried out using a variety of platforms with a high-fidelity co-simulation platform, real-time hardware-in-the-loop validation, and a utility-friendly simulator. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: Submitted to the Proceedings of the National Academy of Sciences (PNAS), under review

arXiv:2406.11619 [pdf, other]

AV-CrossNet: an Audiovisual Complex Spectral Map** Network for Speech Separation By Leveraging Narrow- and Cross-Band Modeling

Authors: Vahid Ahmadi Kalkhorani, Cheng Yu, Anurag Kumar, Ke Tan, Buye Xu, DeLiang Wang

Abstract: Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the CrossNet architecture, which is a recently proposed network that performs complex spectral map** for speech separation by lever… ▽ More Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the CrossNet architecture, which is a recently proposed network that performs complex spectral map** for speech separation by leveraging global attention and positional encoding. To effectively utilize visual cues, the proposed system incorporates pre-extracted visual embeddings and employs a visual encoder comprising temporal convolutional layers. Audio and visual features are fused in an early fusion layer before feeding to AV-CrossNet blocks. We evaluate AV-CrossNet on multiple datasets, including LRS, VoxCeleb, and COG-MHEAR challenge. Evaluation results demonstrate that AV-CrossNet advances the state-of-the-art performance in all audiovisual tasks, even on untrained and mismatched datasets. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 10 pages, 4 Figures, and 4 Tables

arXiv:2406.08914 [pdf, other]

Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition

Authors: William Ravenscroft, George Close, Stefan Goetze, Thomas Hain, Mohammad Soleymanpour, Anurag Chowdhury, Mark C. Fuhs

Abstract: One solution to automatic speech recognition (ASR) of overlap** speakers is to separate speech and then perform ASR on the separated signals. Commonly, the separator produces artefacts which often degrade ASR performance. Addressing this issue typically requires reference transcriptions to jointly train the separation and ASR networks. This is often not viable for training on real-world in-domai… ▽ More One solution to automatic speech recognition (ASR) of overlap** speakers is to separate speech and then perform ASR on the separated signals. Commonly, the separator produces artefacts which often degrade ASR performance. Addressing this issue typically requires reference transcriptions to jointly train the separation and ASR networks. This is often not viable for training on real-world in-domain audio where reference transcript information is not always available. This paper proposes a transcription-free method for joint training using only audio signals. The proposed method uses embedding differences of pre-trained ASR encoders as a loss with a proposed modification to permutation invariant training (PIT) called guided PIT (GPIT). The method achieves a 6.4% improvement in word error rate (WER) measures over a signal-level loss and also shows enhancement improvements in perceptual measures such as short-time objective intelligibility (STOI). △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: 5 pages, 3 Figures, 3 Tables, Accepted for Interspeech 2024

arXiv:2406.04660 [pdf, other]

URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

Authors: Wangyou Zhang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Chenda Li, Zhaoheng Ni, Anurag Kumar, Jan Pirklbauer, Marvin Sach, Shinji Watanabe, Tim Fingscheidt, Yanmin Qian

Abstract: The last decade has witnessed significant advancements in deep learning-based speech enhancement (SE). However, most existing SE research has limitations on the coverage of SE sub-tasks, data diversity and amount, and evaluation metrics. To fill this gap and promote research toward universal SE, we establish a new SE challenge, named URGENT, to focus on the universality, robustness, and generaliza… ▽ More The last decade has witnessed significant advancements in deep learning-based speech enhancement (SE). However, most existing SE research has limitations on the coverage of SE sub-tasks, data diversity and amount, and evaluation metrics. To fill this gap and promote research toward universal SE, we establish a new SE challenge, named URGENT, to focus on the universality, robustness, and generalizability of SE. We aim to extend the SE definition to cover different sub-tasks to explore the limits of SE models, starting from denoising, dereverberation, bandwidth extension, and declip**. A novel framework is proposed to unify all these sub-tasks in a single model, allowing the use of all existing SE approaches. We collected public speech and noise data from different domains to construct diverse evaluation data. Finally, we discuss the insights gained from our preliminary baseline experiments based on both generative and discriminative SE methods with 12 curated metrics. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: 6 pages, 3 figures, 3 tables. Accepted by Interspeech 2024. An extended version of the accepted manuscript with appendix

arXiv:2405.20402 [pdf, other]

Cross-Talk Reduction

Authors: Zhong-Qiu Wang, Anurag Kumar, Shinji Watanabe

Abstract: While far-field multi-talker mixtures are recorded, each speaker can wear a close-talk microphone so that close-talk mixtures can be recorded at the same time. Although each close-talk mixture has a high signal-to-noise ratio (SNR) of the wearer, it has a very limited range of applications, as it also contains significant cross-talk speech by other speakers and is not clean enough. In this context… ▽ More While far-field multi-talker mixtures are recorded, each speaker can wear a close-talk microphone so that close-talk mixtures can be recorded at the same time. Although each close-talk mixture has a high signal-to-noise ratio (SNR) of the wearer, it has a very limited range of applications, as it also contains significant cross-talk speech by other speakers and is not clean enough. In this context, we propose a novel task named cross-talk reduction (CTR) which aims at reducing cross-talk speech, and a novel solution named CTRnet which is based on unsupervised or weakly-supervised neural speech separation. In unsupervised CTRnet, close-talk and far-field mixtures are stacked as input for a DNN to estimate the close-talk speech of each speaker. It is trained in an unsupervised, discriminative way such that the DNN estimate for each speaker can be linearly filtered to cancel out the speaker's cross-talk speech captured at other microphones. In weakly-supervised CTRnet, we assume the availability of each speaker's activity timestamps during training, and leverage them to improve the training of unsupervised CTRnet. Evaluation results on a simulated two-speaker CTR task and on a real-recorded conversational speech separation and recognition task show the effectiveness and potential of CTRnet. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: in International Joint Conference on Artificial Intelligence (IJCAI), 2024

arXiv:2405.01040 [pdf, other]

Few Shot Class Incremental Learning using Vision-Language models

Authors: Anurag Kumar, Chinmay Bharti, Saikat Dutta, Srikrishna Karanam, Biplab Banerjee

Abstract: Recent advancements in deep learning have demonstrated remarkable performance comparable to human capabilities across various supervised computer vision tasks. However, the prevalent assumption of having an extensive pool of training data encompassing all classes prior to model training often diverges from real-world scenarios, where limited data availability for novel classes is the norm. The cha… ▽ More Recent advancements in deep learning have demonstrated remarkable performance comparable to human capabilities across various supervised computer vision tasks. However, the prevalent assumption of having an extensive pool of training data encompassing all classes prior to model training often diverges from real-world scenarios, where limited data availability for novel classes is the norm. The challenge emerges in seamlessly integrating new classes with few samples into the training data, demanding the model to adeptly accommodate these additions without compromising its performance on base classes. To address this exigency, the research community has introduced several solutions under the realm of few-shot class incremental learning (FSCIL). In this study, we introduce an innovative FSCIL framework that utilizes language regularizer and subspace regularizer. During base training, the language regularizer helps incorporate semantic information extracted from a Vision-Language model. The subspace regularizer helps in facilitating the model's acquisition of nuanced connections between image and text semantics inherent to base classes during incremental training. Our proposed framework not only empowers the model to embrace novel classes with limited data, but also ensures the preservation of performance on base classes. To substantiate the efficacy of our approach, we conduct comprehensive experiments on three distinct FSCIL benchmarks, where our framework attains state-of-the-art performance. △ Less

Submitted 2 May, 2024; originally announced May 2024.

Comments: under review at Pattern Recognition Letters

arXiv:2404.15009 [pdf, other]

The Brain Tumor Segmentation in Pediatrics (BraTS-PEDs) Challenge: Focus on Pediatrics (CBTN-CONNECT-DIPGR-ASNR-MICCAI BraTS-PEDs)

Authors: Anahita Fathi Kazerooni, Nastaran Khalili, Deep Gandhi, Xinyang Liu, Zhifan Jiang, Syed Muhammed Anwar, Jake Albrecht, Maruf Adewole, Udunna Anazodo, Hannah Anderson, Sina Bagheri, Ujjwal Baid, Timothy Bergquist, Austin J. Borja, Evan Calabrese, Verena Chung, Gian-Marco Conte, Farouk Dako, James Eddy, Ivan Ezhov, Ariana Familiar, Keyvan Farahani, Anurag Gottipati, Debanjan Haldar, Shuvanjan Haldar , et al. (51 additional authors not shown)

Abstract: Pediatric tumors of the central nervous system are the most common cause of cancer-related death in children. The five-year survival rate for high-grade gliomas in children is less than 20%. Due to their rarity, the diagnosis of these entities is often delayed, their treatment is mainly based on historic treatment concepts, and clinical trials require multi-institutional collaborations. Here we pr… ▽ More Pediatric tumors of the central nervous system are the most common cause of cancer-related death in children. The five-year survival rate for high-grade gliomas in children is less than 20%. Due to their rarity, the diagnosis of these entities is often delayed, their treatment is mainly based on historic treatment concepts, and clinical trials require multi-institutional collaborations. Here we present the CBTN-CONNECT-DIPGR-ASNR-MICCAI BraTS-PEDs challenge, focused on pediatric brain tumors with data acquired across multiple international consortia dedicated to pediatric neuro-oncology and clinical trials. The CBTN-CONNECT-DIPGR-ASNR-MICCAI BraTS-PEDs challenge brings together clinicians and AI/imaging scientists to lead to faster development of automated segmentation techniques that could benefit clinical trials, and ultimately the care of children with brain tumors. △ Less

Submitted 29 April, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2305.17033

arXiv:2404.14729 [pdf, other]

Emergent Cooperation for Energy-efficient Connectivity via Wireless Power Transfer

Authors: Winston Hurst, Anurag Pallaprolu, Yasamin Mostofi

Abstract: This paper addresses the challenge of incentivizing energy-constrained, non-cooperative user equipment (UE) to serve as cooperative relays. We consider a source UE with a non-line-of-sight channel to an access point (AP), where direct communication may be infeasible or may necessitate a substantial transmit power. Other UEs in the vicinity are viewed as relay candidates, and our aim is to enable e… ▽ More This paper addresses the challenge of incentivizing energy-constrained, non-cooperative user equipment (UE) to serve as cooperative relays. We consider a source UE with a non-line-of-sight channel to an access point (AP), where direct communication may be infeasible or may necessitate a substantial transmit power. Other UEs in the vicinity are viewed as relay candidates, and our aim is to enable energy-efficient connectivity for the source, while accounting for the self-interested behavior and private channel state information of these candidates, by allowing the source to "pay" the candidates via wireless power transfer (WPT). We propose a cooperation-inducing protocol, inspired by Myerson auction theory, which ensures that candidates truthfully report power requirements while minimizing the expected power used by the source. Through rigorous analysis, we establish the regularity of valuations for lognormal fading channels, which allows for the efficient determination of the optimal source transmit power. Extensive simulation experiments, employing real-world communication and WPT parameters, validate our theoretical framework. Our results demonstrate a 91% reduction in outage probability with as few as 4 relay candidates, compared to the non-cooperative scenario, and as much as 48% source power savings compared to a baseline approach, highlighting the efficacy of our proposed methodology. △ Less

Submitted 23 April, 2024; originally announced April 2024.

arXiv:2403.18821 [pdf, other]

Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark

Authors: Ziyang Chen, Israel D. Gebru, Christian Richardt, Anurag Kumar, William Laney, Andrew Owens, Alexander Richard

Abstract: We present a new dataset called Real Acoustic Fields (RAF) that captures real acoustic room data from multiple modalities. The dataset includes high-quality and densely captured room impulse response data paired with multi-view images, and precise 6DoF pose tracking data for sound emitters and listeners in the rooms. We used this dataset to evaluate existing methods for novel-view acoustic synthes… ▽ More We present a new dataset called Real Acoustic Fields (RAF) that captures real acoustic room data from multiple modalities. The dataset includes high-quality and densely captured room impulse response data paired with multi-view images, and precise 6DoF pose tracking data for sound emitters and listeners in the rooms. We used this dataset to evaluate existing methods for novel-view acoustic synthesis and impulse response generation which previously relied on synthetic data. In our evaluation, we thoroughly assessed existing audio and audio-visual models against multiple criteria and proposed settings to enhance their performance on real-world data. We also conducted experiments to investigate the impact of incorporating visual data (i.e., images and depth) into neural acoustic field models. Additionally, we demonstrated the effectiveness of a simple sim2real approach, where a model is pre-trained with simulated data and fine-tuned with sparse real-world data, resulting in significant improvements in the few-shot learning approach. RAF is the first dataset to provide densely captured room acoustic data, making it an ideal resource for researchers working on audio and audio-visual neural acoustic field modeling techniques. Demos and datasets are available on our project page: https://facebookresearch.github.io/real-acoustic-fields/ △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: Accepted to CVPR 2024. Project site: https://facebookresearch.github.io/real-acoustic-fields/

arXiv:2403.01369 [pdf, other]

A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement

Authors: Ravi Shankar, Ke Tan, Buye Xu, Anurag Kumar

Abstract: Self-supervised learned models have been found to be very effective for certain speech tasks such as automatic speech recognition, speaker identification, keyword spotting and others. While the features are undeniably useful in speech recognition and associated tasks, their utility in speech enhancement systems is yet to be firmly established, and perhaps not properly understood. In this paper, we… ▽ More Self-supervised learned models have been found to be very effective for certain speech tasks such as automatic speech recognition, speaker identification, keyword spotting and others. While the features are undeniably useful in speech recognition and associated tasks, their utility in speech enhancement systems is yet to be firmly established, and perhaps not properly understood. In this paper, we investigate the uses of SSL representations for single-channel speech enhancement in challenging conditions and find that they add very little value for the enhancement task. Our constraints are designed around on-device real-time speech enhancement -- model is causal, the compute footprint is small. Additionally, we focus on low SNR conditions where such models struggle to provide good enhancement. In order to systematically examine how SSL representations impact performance of such enhancement models, we propose a variety of techniques to utilize these embeddings which include different forms of knowledge-distillation and pre-training. △ Less

Submitted 2 March, 2024; originally announced March 2024.

Comments: 8 pages; Shorter form accepted in ICASSP 2024

arXiv:2402.18968 [pdf, other]

Ambisonics Networks -- The Effect Of Radial Functions Regularization

Authors: Bar Shaybet, Anurag Kumar, Vladimir Tourbabin, Boaz Rafaely

Abstract: Ambisonics, a popular format of spatial audio, is the spherical harmonic (SH) representation of the plane wave density function of a sound field. Many algorithms operate in the SH domain and utilize the Ambisonics as their input signal. The process of encoding Ambisonics from a spherical microphone array involves dividing by the radial functions, which may amplify noise at low frequencies. This ca… ▽ More Ambisonics, a popular format of spatial audio, is the spherical harmonic (SH) representation of the plane wave density function of a sound field. Many algorithms operate in the SH domain and utilize the Ambisonics as their input signal. The process of encoding Ambisonics from a spherical microphone array involves dividing by the radial functions, which may amplify noise at low frequencies. This can be overcome by regularization, with the downside of introducing errors to the Ambisonics encoding. This paper aims to investigate the impact of different ways of regularization on Deep Neural Network (DNN) training and performance. Ideally, these networks should be robust to the way of regularization. Simulated data of a single speaker in a room and experimental data from the LOCATA challenge were used to evaluate this robustness on an example algorithm of speaker localization based on the direct-path dominance (DPD) test. Results show that performance may be sensitive to the way of regularization, and an informed approach is proposed and investigated, highlighting the importance of regularization information. △ Less

Submitted 29 February, 2024; originally announced February 2024.

Comments: to be published in Icassp 2024

arXiv:2401.06148 [pdf, other]

doi 10.1038/s44222-023-00096-8

Artificial Intelligence for Digital and Computational Pathology

Authors: Andrew H. Song, Guillaume Jaume, Drew F. K. Williamson, Ming Y. Lu, Anurag Vaidya, Tiffany R. Miller, Faisal Mahmood

Abstract: Advances in digitizing tissue slides and the fast-paced progress in artificial intelligence, including deep learning, have boosted the field of computational pathology. This field holds tremendous potential to automate clinical diagnosis, predict patient prognosis and response to therapy, and discover new morphological biomarkers from tissue images. Some of these artificial intelligence-based syst… ▽ More Advances in digitizing tissue slides and the fast-paced progress in artificial intelligence, including deep learning, have boosted the field of computational pathology. This field holds tremendous potential to automate clinical diagnosis, predict patient prognosis and response to therapy, and discover new morphological biomarkers from tissue images. Some of these artificial intelligence-based systems are now getting approved to assist clinical diagnosis; however, technical barriers remain for their widespread clinical adoption and integration as a research tool. This Review consolidates recent methodological advances in computational pathology for predicting clinical end points in whole-slide images and highlights how these developments enable the automation of clinical practice and the discovery of new biomarkers. We then provide future perspectives as the field expands into a broader range of clinical and research tasks with increasingly diverse modalities of clinical data. △ Less

Submitted 12 December, 2023; originally announced January 2024.

Journal ref: Nature Reviews Bioengineering 2023

arXiv:2311.18168 [pdf, other]

Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications

Authors: Karren D. Yang, Anurag Ranjan, Jen-Hao Rick Chang, Raviteja Vemulapalli, Oncel Tuzel

Abstract: We consider the task of animating 3D facial geometry from speech signal. Existing works are primarily deterministic, focusing on learning a one-to-one map** from speech signal to 3D face meshes on small datasets with limited speakers. While these models can achieve high-quality lip articulation for speakers in the training set, they are unable to capture the full and diverse distribution of 3D f… ▽ More We consider the task of animating 3D facial geometry from speech signal. Existing works are primarily deterministic, focusing on learning a one-to-one map** from speech signal to 3D face meshes on small datasets with limited speakers. While these models can achieve high-quality lip articulation for speakers in the training set, they are unable to capture the full and diverse distribution of 3D facial motions that accompany speech in the real world. Importantly, the relationship between speech and facial motion is one-to-many, containing both inter-speaker and intra-speaker variations and necessitating a probabilistic approach. In this paper, we identify and address key challenges that have so far limited the development of probabilistic models: lack of datasets and metrics that are suitable for training and evaluating them, as well as the difficulty of designing a model that generates diverse results while remaining faithful to a strong conditioning signal as speech. We first propose large-scale benchmark datasets and metrics suitable for probabilistic modeling. Then, we demonstrate a probabilistic model that achieves both diversity and fidelity to speech, outperforming other methods across the proposed benchmarks. Finally, we showcase useful applications of probabilistic models trained on these large-scale datasets: we can generate diverse speech-driven 3D facial motion that matches unseen speaker styles extracted from reference clips; and our synthetic meshes can be used to improve the performance of downstream audio-visual models. △ Less

Submitted 29 November, 2023; originally announced November 2023.

arXiv:2311.12585 [pdf]

An IoT-based Smart Parking System

Authors: Ridhi Choudhary, Arnav Sanjay Sinha, Krishna Jaiswal, Anurag Chandra

Abstract: The number of vehicles on the road is growing every day, thus there's a growing need to develop effective and hassle-free parking systems. Finding a parking space may be a big challenge, especially in crowded cities or areas with scheduled sporting or cultural events. The project suggests an automated parking system that makes use of technology like sensor systems and microcontrollers. In order to… ▽ More The number of vehicles on the road is growing every day, thus there's a growing need to develop effective and hassle-free parking systems. Finding a parking space may be a big challenge, especially in crowded cities or areas with scheduled sporting or cultural events. The project suggests an automated parking system that makes use of technology like sensor systems and microcontrollers. In order to make it easier for drivers to park in empty spots and cut down on the time and effort needed for manual searches, this system is made to identify empty parking spaces and display the available parking spots on an LCD screen. △ Less

Submitted 21 November, 2023; originally announced November 2023.

Comments: 3 pages

arXiv:2310.18820 [pdf, other]

Demand-Side Threats to Power Grid Operations from IoT-Enabled Edge

Authors: Subhash Lakshminarayana, Carsten Maple, Andrew Larkins, Daryl Flack, Christopher Few, Anurag. K. Srivastava

Abstract: The growing adoption of Internet-of-Things (IoT)-enabled energy smart appliances (ESAs) at the consumer end, such as smart heat pumps, electric vehicle chargers, etc., is seen as key to enabling demand-side response (DSR) services. However, these smart appliances are often poorly engineered from a security point of view and present a new threat to power grid operations. They may become convenient… ▽ More The growing adoption of Internet-of-Things (IoT)-enabled energy smart appliances (ESAs) at the consumer end, such as smart heat pumps, electric vehicle chargers, etc., is seen as key to enabling demand-side response (DSR) services. However, these smart appliances are often poorly engineered from a security point of view and present a new threat to power grid operations. They may become convenient entry points for malicious parties to gain access to the system and disrupt important grid operations by abruptly changing the demand. Unlike utility-side and SCADA assets, ESAs are not monitored continuously due to their large numbers and the lack of extensive monitoring infrastructure at consumer sites. This article presents an in-depth analysis of the demand side threats to power grid operations including (i) an overview of the vulnerabilities in ESAs and the wider risk from the DSR ecosystem and (ii) key factors influencing the attack impact on power grid operations. Finally, it presents measures to improve the cyber-physical resilience of power grids, putting them in the context of ongoing efforts from the industry and regulatory bodies worldwide. △ Less

Submitted 28 October, 2023; originally announced October 2023.

arXiv:2310.17864 [pdf, other]

TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

Authors: Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, **chuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe, Yangyang Shi, Yumeng Tao, Robin Scheibler, Samuele Cornell, Sean Kim, Stavros Petridis

Abstract: TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by develo** impactful features. Here, we survey TorchAudio's devel… ▽ More TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by develo** impactful features. Here, we survey TorchAudio's development principles and contents and highlight key features we include in its latest version (2.1): self-supervised learning pre-trained pipelines and training recipes, high-performance CTC decoders, speech recognition models and training recipes, advanced media I/O capabilities, and tools for performing forced alignment, multi-channel speech enhancement, and reference-less speech assessment. For a selection of these features, through empirical studies, we demonstrate their efficacy and show that they achieve competitive or state-of-the-art performance. △ Less

Submitted 26 October, 2023; originally announced October 2023.

arXiv:2310.15130 [pdf, other]

Novel-View Acoustic Synthesis from 3D Reconstructed Rooms

Authors: Byeongjoo Ahn, Karren Yang, Brian Hamilton, Jonathan Sheaffer, Anurag Ranjan, Miguel Sarabia, Oncel Tuzel, Jen-Hao Rick Chang

Abstract: We investigate the benefit of combining blind audio recordings with 3D scene information for novel-view acoustic synthesis. Given audio recordings from 2-4 microphones and the 3D geometry and material of a scene containing multiple unknown sound sources, we estimate the sound anywhere in the scene. We identify the main challenges of novel-view acoustic synthesis as sound source localization, separ… ▽ More We investigate the benefit of combining blind audio recordings with 3D scene information for novel-view acoustic synthesis. Given audio recordings from 2-4 microphones and the 3D geometry and material of a scene containing multiple unknown sound sources, we estimate the sound anywhere in the scene. We identify the main challenges of novel-view acoustic synthesis as sound source localization, separation, and dereverberation. While naively training an end-to-end network fails to produce high-quality results, we show that incorporating room impulse responses (RIRs) derived from 3D reconstructed rooms enables the same network to jointly tackle these tasks. Our method outperforms existing methods designed for the individual tasks, demonstrating its effectiveness at utilizing 3D visual information. In a simulated study on the Matterport3D-NVAS dataset, our model achieves near-perfect accuracy on source localization, a PSNR of 26.44 dB and a SDR of 14.23 dB for source separation and dereverberation, resulting in a PSNR of 25.55 dB and a SDR of 14.20 dB on novel-view acoustic synthesis. Code, pretrained model, and video results are available on the project webpage (https://github.com/apple/ml-nvas3d). △ Less

Submitted 23 October, 2023; originally announced October 2023.

arXiv:2310.08718 [pdf, other]

A Framework for Develo** and Evaluating Algorithms for Estimating Multipath Propagation Parameters from Channel Sounder Measurements

Authors: Akbar Sayeed, Damla Guven, Michael Doebereiner, Sebastian Semper, Camillo Gentile, Anuraag Bodi, Zihang Cheng

Abstract: A framework is proposed for develo** and evaluating algorithms for extracting multipath propagation components (MPCs) from measurements collected by channel sounders at millimeter-wave frequencies. Sounders equipped with an omnidirectional transmitter and a receiver with a uniform planar array (UPA) are considered. An accurate mathematical model is developed for the spatial frequency response of… ▽ More A framework is proposed for develo** and evaluating algorithms for extracting multipath propagation components (MPCs) from measurements collected by channel sounders at millimeter-wave frequencies. Sounders equipped with an omnidirectional transmitter and a receiver with a uniform planar array (UPA) are considered. An accurate mathematical model is developed for the spatial frequency response of the sounder that incorporates the non-ideal cross-polar beampatterns for the UPA elements. Due to the limited Field-of-View (FoV) of each element, the model is extended to accommodate multi-FoV measurements in distinct azimuth directions. A beamspace representation of the spatial frequency response is leveraged to develop three progressively complex algorithms aimed at solving the singlesnapshot maximum likelihood estimation problem: greedy matching pursuit (CLEAN), space-alternative generalized expectationmaximization (SAGE), and RiMAX. The first two are based on purely specular MPCs whereas RiMAX also accommodates diffuse MPCs. Two approaches for performance evaluation are proposed, one with knowledge of ground truth parameters, and one based on reconstruction mean-squared error. The three algorithms are compared through a demanding channel model with hundreds of MPCs and through real measurements. The results demonstrate that CLEAN gives quite reasonable estimates which are improved by SAGE and RiMAX. Lessons learned and directions for future research are discussed. △ Less

Submitted 12 October, 2023; originally announced October 2023.

Comments: 17 pages

arXiv:2310.04467 [pdf, other]

Design Principles for Lifelong Learning AI Accelerators

Authors: Dhireesha Kudithipudi, Anurag Daram, Abdullah M. Zyarah, Fatima Tuz Zohora, James B. Aimone, Angel Yanguas-Gil, Nicholas Soures, Emre Neftci, Matthew Mattina, Vincenzo Lomonaco, Clare D. Thiem, Benjamin Epstein

Abstract: Lifelong learning - an agent's ability to learn throughout its lifetime - is a hallmark of biological learning systems and a central challenge for artificial intelligence (AI). The development of lifelong learning algorithms could lead to a range of novel AI applications, but this will also require the development of appropriate hardware accelerators, particularly if the models are to be deployed… ▽ More Lifelong learning - an agent's ability to learn throughout its lifetime - is a hallmark of biological learning systems and a central challenge for artificial intelligence (AI). The development of lifelong learning algorithms could lead to a range of novel AI applications, but this will also require the development of appropriate hardware accelerators, particularly if the models are to be deployed on edge platforms, which have strict size, weight, and power constraints. Here, we explore the design of lifelong learning AI accelerators that are intended for deployment in untethered environments. We identify key desirable capabilities for lifelong learning accelerators and highlight metrics to evaluate such accelerators. We then discuss current edge AI accelerators and explore the future design of lifelong learning accelerators, considering the role that different emerging technologies could play. △ Less

Submitted 5 October, 2023; originally announced October 2023.

arXiv:2309.15977 [pdf, other]

Neural Acoustic Context Field: Rendering Realistic Room Impulse Response With Neural Fields

Authors: Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu

Abstract: Room impulse response (RIR), which measures the sound propagation within an environment, is critical for synthesizing high-fidelity audio for a given environment. Some prior work has proposed representing RIR as a neural field function of the sound emitter and receiver positions. However, these methods do not sufficiently consider the acoustic properties of an audio scene, leading to unsatisfactor… ▽ More Room impulse response (RIR), which measures the sound propagation within an environment, is critical for synthesizing high-fidelity audio for a given environment. Some prior work has proposed representing RIR as a neural field function of the sound emitter and receiver positions. However, these methods do not sufficiently consider the acoustic properties of an audio scene, leading to unsatisfactory performance. This letter proposes a novel Neural Acoustic Context Field approach, called NACF, to parameterize an audio scene by leveraging multiple acoustic contexts, such as geometry, material property, and spatial information. Driven by the unique properties of RIR, i.e., temporal un-smoothness and monotonic energy attenuation, we design a temporal correlation module and multi-scale energy decay criterion. Experimental results show that NACF outperforms existing field-based methods by a notable margin. Please visit our project page for more qualitative results. △ Less

Submitted 27 September, 2023; originally announced September 2023.

arXiv:2309.10788 [pdf, other]

Physics-Informed Machine Learning for Data Anomaly Detection, Classification, Localization, and Mitigation: A Review, Challenges, and Path Forward

Authors: Mehdi Jabbari Zideh, Paroma Chatterjee, Anurag K. Srivastava

Abstract: Advancements in digital automation for smart grids have led to the installation of measurement devices like phasor measurement units (PMUs), micro-PMUs ($μ$-PMUs), and smart meters. However, a large amount of data collected by these devices brings several challenges as control room operators need to use this data with models to make confident decisions for reliable and resilient operation of the c… ▽ More Advancements in digital automation for smart grids have led to the installation of measurement devices like phasor measurement units (PMUs), micro-PMUs ($μ$-PMUs), and smart meters. However, a large amount of data collected by these devices brings several challenges as control room operators need to use this data with models to make confident decisions for reliable and resilient operation of the cyber-power systems. Machine-learning (ML) based tools can provide a reliable interpretation of the deluge of data obtained from the field. For the decision-makers to ensure reliable network operation under all operating conditions, these tools need to identify solutions that are feasible and satisfy the system constraints, while being efficient, trustworthy, and interpretable. This resulted in the increasing popularity of physics-informed machine learning (PIML) approaches, as these methods overcome challenges that model-based or data-driven ML methods face in silos. This work aims at the following: a) review existing strategies and techniques for incorporating underlying physical principles of the power grid into different types of ML approaches (supervised/semi-supervised learning, unsupervised learning, and reinforcement learning (RL)); b) explore the existing works on PIML methods for anomaly detection, classification, localization, and mitigation in power transmission and distribution systems, c) discuss improvements in existing methods through consideration of potential challenges while also addressing the limitations to make them suitable for real-world applications. △ Less

Submitted 19 September, 2023; originally announced September 2023.

arXiv:2309.02404 [pdf, other]

Voice Morphing: Two Identities in One Voice

Authors: Sushanta K. Pani, Anurag Chowdhury, Morgan Sandler, Arun Ross

Abstract: In a biometric system, each biometric sample or template is typically associated with a single identity. However, recent research has demonstrated the possibility of generating "morph" biometric samples that can successfully match more than a single identity. Morph attacks are now recognized as a potential security threat to biometric systems. However, most morph attacks have been studied on biome… ▽ More In a biometric system, each biometric sample or template is typically associated with a single identity. However, recent research has demonstrated the possibility of generating "morph" biometric samples that can successfully match more than a single identity. Morph attacks are now recognized as a potential security threat to biometric systems. However, most morph attacks have been studied on biometric modalities operating in the image domain, such as face, fingerprint, and iris. In this preliminary work, we introduce Voice Identity Morphing (VIM) - a voice-based morph attack that can synthesize speech samples that impersonate the voice characteristics of a pair of individuals. Our experiments evaluate the vulnerabilities of two popular speaker recognition systems, ECAPA-TDNN and x-vector, to VIM, with a success rate (MMPMR) of over 80% at a false match rate of 1% on the Librispeech dataset. △ Less

Submitted 5 September, 2023; originally announced September 2023.

Comments: Accepted oral paper at BIOSIG 2023

arXiv:2308.00122 [pdf, other]

DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion Models

Authors: Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, Chenliang Xu

Abstract: We propose DAVIS, a Diffusion model-based Audio-VIusal Separation framework that solves the audio-visual sound source separation task through a generative manner. While existing discriminative methods that perform mask regression have made remarkable progress in this field, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse… ▽ More We propose DAVIS, a Diffusion model-based Audio-VIusal Separation framework that solves the audio-visual sound source separation task through a generative manner. While existing discriminative methods that perform mask regression have made remarkable progress in this field, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS leverages a generative diffusion model and a Separation U-Net to synthesize separated magnitudes starting from Gaussian noises, conditioned on both the audio mixture and the visual footage. With its generative objective, DAVIS is better suited to achieving the goal of high-quality sound separation across diverse categories. We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the domain-specific MUSIC dataset and the open-domain AVE dataset, and results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of our framework for tackling the audio-visual source separation task. △ Less

Submitted 31 July, 2023; originally announced August 2023.

arXiv:2306.14018 [pdf, other]

doi 10.1109/TSG.2023.3310893

Multi-agent Deep Reinforcement Learning for Distributed Load Restoration

Authors: Linh Vu, Tuyen Vu, Thanh-Long Vu, Anurag Srivastava

Abstract: This paper addresses the load restoration problem after power outage events. Our primary proposed methodology is using multi-agent deep reinforcement learning to optimize the load restoration process in distribution systems, modeled as networked microgrids, via determining the optimal operational sequence of circuit breakers (switches). An innovative invalid action masking technique is incorporate… ▽ More This paper addresses the load restoration problem after power outage events. Our primary proposed methodology is using multi-agent deep reinforcement learning to optimize the load restoration process in distribution systems, modeled as networked microgrids, via determining the optimal operational sequence of circuit breakers (switches). An innovative invalid action masking technique is incorporated into the multi-agent method to handle both the physical constraints in the restoration process and the curse of dimensionality as the action space of operational decisions grows exponentially with the number of circuit breakers. The features of our proposed method include centralized training for multi-agents to overcome non-stationary environment problems, decentralized execution to ease the deployment, and zero constraint violations to prevent harmful actions. Our simulations are performed in OpenDSS and Python environments to demonstrate the effectiveness of the proposed approach using the IEEE 13, 123, and 8500-node distribution test feeders. The results show that the proposed algorithm can achieve a significantly better learning curve and stability than the conventional methods. △ Less

Submitted 24 June, 2023; originally announced June 2023.

Comments: 12 pages, 19 figures, journal under review

arXiv:2305.05479 [pdf, other]

Multiple-stop** time Sequential Detection for Energy Efficient Mining in Blockchain-Enabled IoT

Authors: Anurag Gupta, Vikram Krishnamurthy

Abstract: What are the optimal times for an Internet of Things (IoT) device to act as a blockchain miner? The aim is to minimize the energy consumed by low-power IoT devices that log their data into a secure (tamper-proof) distributed ledger. We formulate a multiple stop** time Bayesian sequential detection problem to address energy-efficient blockchain mining for IoT devices. The objective is to identify… ▽ More What are the optimal times for an Internet of Things (IoT) device to act as a blockchain miner? The aim is to minimize the energy consumed by low-power IoT devices that log their data into a secure (tamper-proof) distributed ledger. We formulate a multiple stop** time Bayesian sequential detection problem to address energy-efficient blockchain mining for IoT devices. The objective is to identify $L$ optimal stops for mining, thereby maximizing the probability of successfully adding a block to the blockchain; we also present a model to optimize the number of stops (mining instants). The formulation is equivalent to a multiple stop** time POMDP. Since POMDPs are in general computationally intractable to solve, we show mathematically using submodularity arguments that the optimal mining policy has a useful structure: 1) it is monotone in belief space, and 2) it exhibits a threshold structure, which divides the belief space into two connected sets. Exploiting the structural results, we formulate a computationally-efficient linear mining policy for the blockchain-enabled IoT device. We present a policy gradient technique to optimize the parameters of the linear mining policy. Finally, we use synthetic and real Bitcoin datasets to study the performance of our proposed mining policy. We demonstrate the energy efficiency achieved by the optimal linear mining policy in contrast to other heuristic strategies. △ Less

Submitted 17 August, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

arXiv:2304.01448 [pdf, other]

TorchAudio-Squim: Reference-less Speech Quality and Intelligibility measures in TorchAudio

Authors: Anurag Kumar, Ke Tan, Zhaoheng Ni, Pranay Manocha, Xiaohui Zhang, Ethan Henderson, Buye Xu

Abstract: Measuring quality and intelligibility of a speech signal is usually a critical step in development of speech processing systems. To enable this, a variety of metrics to measure quality and intelligibility under different assumptions have been developed. Through this paper, we introduce tools and a set of models to estimate such known metrics using deep neural networks. These models are made availa… ▽ More Measuring quality and intelligibility of a speech signal is usually a critical step in development of speech processing systems. To enable this, a variety of metrics to measure quality and intelligibility under different assumptions have been developed. Through this paper, we introduce tools and a set of models to estimate such known metrics using deep neural networks. These models are made available in the well-established TorchAudio library, the core audio and speech processing library within the PyTorch deep learning framework. We refer to it as TorchAudio-Squim, TorchAudio-Speech QUality and Intelligibility Measures. More specifically, in the current version of TorchAudio-squim, we establish and release models for estimating PESQ, STOI and SI-SDR among objective metrics and MOS among subjective metrics. We develop a novel approach for objective metric estimation and use a recently developed approach for subjective metric estimation. These models operate in a ``reference-less" manner, that is they do not require the corresponding clean speech as reference for speech assessment. Given the unavailability of clean speech and the effortful process of subjective evaluation in real-world situations, such easy-to-use tools would greatly benefit speech processing research and development. △ Less

Submitted 3 April, 2023; originally announced April 2023.

Comments: ICASSP 2023

arXiv:2303.13471 [pdf, other]

Egocentric Audio-Visual Object Localization

Authors: Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu

Abstract: Humans naturally perceive surrounding scenes by unifying sound and sight in a first-person view. Likewise, machines are advanced to approach human intelligence by learning with multisensory inputs from an egocentric perspective. In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even w… ▽ More Humans naturally perceive surrounding scenes by unifying sound and sight in a first-person view. Likewise, machines are advanced to approach human intelligence by learning with multisensory inputs from an egocentric perspective. In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even within a short duration; 2) The out-of-view sound components can be created while wearers shift their attention. To address the first problem, we propose a geometry-aware temporal aggregation module to handle the egomotion explicitly. The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations. Moreover, we propose a cascaded feature enhancement module to tackle the second issue. It improves cross-modal localization robustness by disentangling visually-indicated audio representation. During training, we take advantage of the naturally available audio-visual temporal synchronization as the ``free'' self-supervision to avoid costly labeling. We also annotate and create the Epic Sounding Object dataset for evaluation purposes. Extensive experiments show that our method achieves state-of-the-art localization performance in egocentric videos and can be generalized to diverse audio-visual scenes. △ Less

Submitted 23 March, 2023; originally announced March 2023.

Comments: Accepted by CVPR 2023

arXiv:2302.11753 [pdf]

Securely implementing and managing neighborhood solar with storage and peer to peer transactive energy

Authors: Steven Knudsen, Subir Majumder, Anurag K. Srivastava

Abstract: In this paper, we aim to leverage peer to peer (P2P) transactive energy framework for optimal control of rooftop or neighborhood solar power with battery electric storage systems (BESS). Here we propose that the multiple neighboring customers would interconnect to form a community DC grid while still being connected to the main utility grid. The proposed infrastructure would (i) increase the resil… ▽ More In this paper, we aim to leverage peer to peer (P2P) transactive energy framework for optimal control of rooftop or neighborhood solar power with battery electric storage systems (BESS). Here we propose that the multiple neighboring customers would interconnect to form a community DC grid while still being connected to the main utility grid. The proposed infrastructure would (i) increase the resilience of the customers in case the utility grid becomes unavailable, (ii) reduce AC/DC conversion losses since a majority of the DER generation is in DC, and (iii) with the help of BESS, the customers will have a lesser impact on the utility grid in terms of California 'duck curve.' The exchange of resources and further utilization of economies of scale would significantly reduce the cost of this new infrastructure. Customers will be able to fall back on the utility AC grid in case of a cyber-attack on the community DC grid, improving cyber-security further. The proposed P2P transactive energy scheme is based on cooperation among the homeowners, using additional metering nodes to allow the sale or purchase of energy among residents. While it is expected that the scope will be limited to customers within a neighborhood, multiple neighborhoods can still interact via a utility AC grid. While it is likely that the P2P controller carrying out optimal control would work seamlessly, customers' interactions will be based on a zero-trust approach. As a part of this approach, multifactor authentication, role-based access control, etc., would be utilized, significantly improving overall cybersecurity. We examine these three strategies as case studies: (i) single home with solar power, (ii) single home with Solar power and BESS, and (iii) two or more homes with solar power and community energy storage to facilitate the policymakers making the recommendation. △ Less

Submitted 22 February, 2023; originally announced February 2023.

Comments: in Cigre Annual Meeting, Paris, Aug. 2022

arXiv:2302.08095 [pdf, other]

PAAPLoss: A Phonetic-Aligned Acoustic Parameter Loss for Speech Enhancement

Authors: Muqiao Yang, Joseph Konan, David Bick, Yunyang Zeng, Shuo Han, Anurag Kumar, Shinji Watanabe, Bhiksha Raj

Abstract: Despite rapid advancement in recent years, current speech enhancement models often produce speech that differs in perceptual quality from real clean speech. We propose a learning objective that formalizes differences in perceptual quality, by using domain knowledge of acoustic-phonetics. We identify temporal acoustic parameters -- such as spectral tilt, spectral flux, shimmer, etc. -- that are non… ▽ More Despite rapid advancement in recent years, current speech enhancement models often produce speech that differs in perceptual quality from real clean speech. We propose a learning objective that formalizes differences in perceptual quality, by using domain knowledge of acoustic-phonetics. We identify temporal acoustic parameters -- such as spectral tilt, spectral flux, shimmer, etc. -- that are non-differentiable, and we develop a neural network estimator that can accurately predict their time-series values across an utterance. We also model phoneme-specific weights for each feature, as the acoustic parameters are known to show different behavior in different phonemes. We can add this criterion as an auxiliary loss to any model that produces speech, to optimize speech outputs to match the values of clean speech in these features. Experimentally we show that it improves speech enhancement workflows in both time-domain and time-frequency domain, as measured by standard evaluation metrics. We also provide an analysis of phoneme-dependent improvement on acoustic parameters, demonstrating the additional interpretability that our method provides. This analysis can suggest which features are currently the bottleneck for improvement. △ Less

Submitted 16 February, 2023; originally announced February 2023.

Comments: Accepted at ICASSP 2023

arXiv:2302.08088 [pdf, other]

TAPLoss: A Temporal Acoustic Parameter Loss for Speech Enhancement

Authors: Yunyang Zeng, Joseph Konan, Shuo Han, David Bick, Muqiao Yang, Anurag Kumar, Shinji Watanabe, Bhiksha Raj

Abstract: Speech enhancement models have greatly progressed in recent years, but still show limits in perceptual quality of their speech outputs. We propose an objective for perceptual quality based on temporal acoustic parameters. These are fundamental speech features that play an essential role in various applications, including speaker recognition and paralinguistic analysis. We provide a differentiable… ▽ More Speech enhancement models have greatly progressed in recent years, but still show limits in perceptual quality of their speech outputs. We propose an objective for perceptual quality based on temporal acoustic parameters. These are fundamental speech features that play an essential role in various applications, including speaker recognition and paralinguistic analysis. We provide a differentiable estimator for four categories of low-level acoustic descriptors involving: frequency-related parameters, energy or amplitude-related parameters, spectral balance parameters, and temporal features. Unlike prior work that looks at aggregated acoustic parameters or a few categories of acoustic parameters, our temporal acoustic parameter (TAP) loss enables auxiliary optimization and improvement of many fine-grain speech characteristics in enhancement workflows. We show that adding TAPLoss as an auxiliary objective in speech enhancement produces speech with improved perceptual quality and intelligibility. We use data from the Deep Noise Suppression 2020 Challenge to demonstrate that both time-domain models and time-frequency domain models can benefit from our method. △ Less

Submitted 15 February, 2023; originally announced February 2023.

Comments: Accepted at ICASSP 2023

arXiv:2302.02088 [pdf, other]

AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis

Authors: Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu

Abstract: Can machines recording an audio-visual scene produce realistic, matching audio-visual experiences at novel positions and novel view directions? We answer it by studying a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning. Concretely, given a video recording of an audio-visual scene, the task is to synthesize new videos with s… ▽ More Can machines recording an audio-visual scene produce realistic, matching audio-visual experiences at novel positions and novel view directions? We answer it by studying a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning. Concretely, given a video recording of an audio-visual scene, the task is to synthesize new videos with spatial audios along arbitrary novel camera trajectories in that scene. We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF, in which we implicitly associate audio generation with the 3D geometry and material properties of a visual environment. Furthermore, we present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields. To facilitate the study of this new task, we collect a high-quality Real-World Audio-Visual Scene (RWAVS) dataset. We demonstrate the advantages of our method on this real-world dataset and the simulation-based SoundSpaces dataset. △ Less

Submitted 16 October, 2023; v1 submitted 3 February, 2023; originally announced February 2023.

Comments: NeurIPS 2023

arXiv:2301.04320 [pdf, other]

Rethinking complex-valued deep neural networks for monaural speech enhancement

Authors: Haibin Wu, Ke Tan, Buye Xu, Anurag Kumar, Daniel Wong

Abstract: Despite multiple efforts made towards adopting complex-valued deep neural networks (DNNs), it remains an open question whether complex-valued DNNs are generally more effective than real-valued DNNs for monaural speech enhancement. This work is devoted to presenting a critical assessment by systematically examining complex-valued DNNs against their real-valued counterparts. Specifically, we investi… ▽ More Despite multiple efforts made towards adopting complex-valued deep neural networks (DNNs), it remains an open question whether complex-valued DNNs are generally more effective than real-valued DNNs for monaural speech enhancement. This work is devoted to presenting a critical assessment by systematically examining complex-valued DNNs against their real-valued counterparts. Specifically, we investigate complex-valued DNN atomic units, including linear layers, convolutional layers, long short-term memory (LSTM), and gated linear units. By comparing complex- and real-valued versions of fundamental building blocks in the recently developed gated convolutional recurrent network (GCRN), we show how different mechanisms for basic blocks affect the performance. We also find that the use of complex-valued operations hinders the model capacity when the model size is small. In addition, we examine two recent complex-valued DNNs, i.e. deep complex convolutional recurrent network (DCCRN) and deep complex U-Net (DCUNET). Evaluation results show that both DNNs produce identical performance to their real-valued counterparts while requiring much more computation. Based on these comprehensive comparisons, we conclude that complex-valued DNNs do not provide a performance gain over their real-valued counterparts for monaural speech enhancement, and thus are less desirable due to their higher computational costs. △ Less

Submitted 11 January, 2023; originally announced January 2023.

arXiv:2211.10999 [pdf, other]

LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

Authors: Rodrigo Mira, Buye Xu, Jacob Donley, Anurag Kumar, Stavros Petridis, Vamsi Krishna Ithapu, Maja Pantic

Abstract: Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging not only the audio itself but also the target speaker's lip movements. This approach has been shown to yield improvements over audio-only speech enhancement, particularly for the removal of interfering speech. Despite recent advances in speech synthesis, most audio-visual approaches continue to use… ▽ More Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging not only the audio itself but also the target speaker's lip movements. This approach has been shown to yield improvements over audio-only speech enhancement, particularly for the removal of interfering speech. Despite recent advances in speech synthesis, most audio-visual approaches continue to use spectral map**/masking to reproduce the clean audio, often resulting in visual backbones added to existing speech enhancement architectures. In this work, we propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture, and then converts them into waveform audio using a neural vocoder (HiFi-GAN). We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference. Our experiments show that LA-VocE outperforms existing methods according to multiple metrics, particularly under very noisy scenarios. △ Less

Submitted 13 March, 2023; v1 submitted 20 November, 2022; originally announced November 2022.

Comments: accepted to ICASSP 2023

arXiv:2211.08624 [pdf, ps, other]

Leveraging Heteroscedastic Uncertainty in Learning Complex Spectral Map** for Single-channel Speech Enhancement

Authors: Kuan-Lin Chen, Daniel D. E. Wong, Ke Tan, Buye Xu, Anurag Kumar, Vamsi Krishna Ithapu

Abstract: Most speech enhancement (SE) models learn a point estimate and do not make use of uncertainty estimation in the learning process. In this paper, we show that modeling heteroscedastic uncertainty by minimizing a multivariate Gaussian negative log-likelihood (NLL) improves SE performance at no extra cost. During training, our approach augments a model learning complex spectral map** with a tempora… ▽ More Most speech enhancement (SE) models learn a point estimate and do not make use of uncertainty estimation in the learning process. In this paper, we show that modeling heteroscedastic uncertainty by minimizing a multivariate Gaussian negative log-likelihood (NLL) improves SE performance at no extra cost. During training, our approach augments a model learning complex spectral map** with a temporary submodel to predict the covariance of the enhancement error at each time-frequency bin. Due to unrestricted heteroscedastic uncertainty, the covariance introduces an undersampling effect, detrimental to SE performance. To mitigate undersampling, our approach inflates the uncertainty lower bound and weights each loss component with their uncertainty, effectively compensating severely undersampled components with more penalties. Our multivariate setting reveals common covariance assumptions such as scalar and diagonal matrices. By weakening these assumptions, we show that the NLL achieves superior performance compared to popular loss functions including the mean squared error (MSE), mean absolute error (MAE), and scale-invariant signal-to-distortion ratio (SI-SDR). △ Less

Submitted 8 March, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

Comments: 5 pages. Accepted at ICASSP 2023

arXiv:2210.14800 [pdf, other]

Naturalistic Head Motion Generation from Speech

Authors: Trisha Mittal, Zakaria Aldeneh, Masha Fedzechkina, Anurag Ranjan, Barry-John Theobald

Abstract: Synthesizing natural head motion to accompany speech for an embodied conversational agent is necessary for providing a rich interactive experience. Most prior works assess the quality of generated head motion by comparing them against a single ground-truth using an objective metric. Yet there are many plausible head motion sequences to accompany a speech utterance. In this work, we study the varia… ▽ More Synthesizing natural head motion to accompany speech for an embodied conversational agent is necessary for providing a rich interactive experience. Most prior works assess the quality of generated head motion by comparing them against a single ground-truth using an objective metric. Yet there are many plausible head motion sequences to accompany a speech utterance. In this work, we study the variation in the perceptual quality of head motions sampled from a generative model. We show that, despite providing more diverse head motions, the generative model produces motions with varying degrees of perceptual quality. We finally show that objective metrics commonly used in previous research do not accurately reflect the perceptual quality of generated head motions. These results open an interesting avenue for future work to investigate better objective metrics that correlate with human perception of quality. △ Less

Submitted 26 October, 2022; originally announced October 2022.

Comments: Submitted to ICASSP 2023

arXiv:2209.10043 [pdf, other]

SynthA1c: Towards Clinically Interpretable Patient Representations for Diabetes Risk Stratification

Authors: Michael S. Yao, Allison Chae, Matthew T. MacLean, Anurag Verma, Jeffrey Duda, James Gee, Drew A. Torigian, Daniel Rader, Charles Kahn, Walter R. Witschey, Hersh Sagreiya

Abstract: Early diagnosis of Type 2 Diabetes Mellitus (T2DM) is crucial to enable timely therapeutic interventions and lifestyle modifications. As the time available for clinical office visits shortens and medical imaging data become more widely available, patient image data could be used to opportunistically identify patients for additional T2DM diagnostic workup by physicians. We investigated whether imag… ▽ More Early diagnosis of Type 2 Diabetes Mellitus (T2DM) is crucial to enable timely therapeutic interventions and lifestyle modifications. As the time available for clinical office visits shortens and medical imaging data become more widely available, patient image data could be used to opportunistically identify patients for additional T2DM diagnostic workup by physicians. We investigated whether image-derived phenotypic data could be leveraged in tabular learning classifier models to predict T2DM risk in an automated fashion to flag high-risk patients without the need for additional blood laboratory measurements. In contrast to traditional binary classifiers, we leverage neural networks and decision tree models to represent patient data as 'SynthA1c' latent variables, which mimic blood hemoglobin A1c empirical lab measurements, that achieve sensitivities as high as 87.6%. To evaluate how SynthA1c models may generalize to other patient populations, we introduce a novel generalizable metric that uses vanilla data augmentation techniques to predict model performance on input out-of-domain covariates. We show that image-derived phenotypes and physical examination data together can accurately predict diabetes risk as a means of opportunistic risk stratification enabled by artificial intelligence and medical imaging. Our code is available at https://github.com/allisonjchae/DMT2RiskAssessment. △ Less

Submitted 27 July, 2023; v1 submitted 20 September, 2022; originally announced September 2022.

Comments: 12 pages. Accepted to PRIME MICCAI 2023

arXiv:2209.03715 [pdf, other]

Optimization-based framework for low-voltage grid reinforcement assessment under various levels of flexibility and coordination

Authors: Soner Candas, Beneharo Reveron Baecker, Anurag Mohapatra, Thomas Hamacher

Abstract: The rapid electrification of residential heating and mobility sectors is expected to drive the existing distribution grid assets beyond their planned operating conditions. This change will also reveal new potentials through sector coupling, flexibilities, and the local exchange of decentralized generation. This paper thus presents an optimization framework for multi-modal energy systems at the low… ▽ More The rapid electrification of residential heating and mobility sectors is expected to drive the existing distribution grid assets beyond their planned operating conditions. This change will also reveal new potentials through sector coupling, flexibilities, and the local exchange of decentralized generation. This paper thus presents an optimization framework for multi-modal energy systems at the low voltage (LV) distribution grid level. In this, we focus on the reinforcement requirements of the grid and the techno-economic assessment of flexibility components and coordination between agents. By employing a multi-level approach, computational complexity is reduced, and various levels of coordination and flexibilities are implemented. We conclude the work with a case study for a representative rural grid in Germany, in which we observe high economic potential in the flexible operation of buildings, majorly thanks to better integration of photovoltaics. Across all paradigms barring a best-case benchmark, grid reinforcement based on a mix of passive and active measures was necessary. A synergy effect is observed between flexibilities and coordination, as their combination reduces the peaks to the extent of completely avoiding grid reinforcement. The presented framework can be applied with a wide range of grid and component types to outline the broad landscape of future LV distribution grids. △ Less

Submitted 19 April, 2023; v1 submitted 8 September, 2022; originally announced September 2022.

arXiv:2207.00237 [pdf, other]

Improving Speech Enhancement through Fine-Grained Speech Characteristics

Authors: Muqiao Yang, Joseph Konan, David Bick, Anurag Kumar, Shinji Watanabe, Bhiksha Raj

Abstract: While deep learning based speech enhancement systems have made rapid progress in improving the quality of speech signals, they can still produce outputs that contain artifacts and can sound unnatural. We propose a novel approach to speech enhancement aimed at improving perceptual quality and naturalness of enhanced signals by optimizing for key characteristics of speech. We first identify key acou… ▽ More While deep learning based speech enhancement systems have made rapid progress in improving the quality of speech signals, they can still produce outputs that contain artifacts and can sound unnatural. We propose a novel approach to speech enhancement aimed at improving perceptual quality and naturalness of enhanced signals by optimizing for key characteristics of speech. We first identify key acoustic parameters that have been found to correlate well with voice quality (e.g. jitter, shimmer, and spectral flux) and then propose objective functions which are aimed at reducing the difference between clean speech and enhanced speech with respect to these features. The full set of acoustic features is the extended Geneva Acoustic Parameter Set (eGeMAPS), which includes 25 different attributes associated with perception of speech. Given the non-differentiable nature of these feature computation, we first build differentiable estimators of the eGeMAPS and then use them to fine-tune existing speech enhancement systems. Our approach is generic and can be applied to any existing deep learning based enhancement systems to further improve the enhanced speech signals. Experimental results conducted on the Deep Noise Suppression (DNS) Challenge dataset shows that our approach can improve the state-of-the-art deep learning based enhancement systems. △ Less

Submitted 11 July, 2022; v1 submitted 1 July, 2022; originally announced July 2022.

Comments: Accepted at InterSpeech 2022

arXiv:2206.12297 [pdf, other]

SAQAM: Spatial Audio Quality Assessment Metric

Authors: Pranay Manocha, Anurag Kumar, Buye Xu, Anjali Menon, Israel D. Gebru, Vamsi K. Ithapu, Paul Calamia

Abstract: Audio quality assessment is critical for assessing the perceptual realism of sounds. However, the time and expense of obtaining ''gold standard'' human judgments limit the availability of such data. For AR&VR, good perceived sound quality and localizability of sources are among the key elements to ensure complete immersion of the user. Our work introduces SAQAM which uses a multi-task learning fra… ▽ More Audio quality assessment is critical for assessing the perceptual realism of sounds. However, the time and expense of obtaining ''gold standard'' human judgments limit the availability of such data. For AR&VR, good perceived sound quality and localizability of sources are among the key elements to ensure complete immersion of the user. Our work introduces SAQAM which uses a multi-task learning framework to assess listening quality (LQ) and spatialization quality (SQ) between any given pair of binaural signals without using any subjective data. We model LQ by training on a simulated dataset of triplet human judgments, and SQ by utilizing activation-level distances from networks trained for direction of arrival (DOA) estimation. We show that SAQAM correlates well with human responses across four diverse datasets. Since it is a deep network, the metric is differentiable, making it suitable as a loss function for other tasks. For example, simply replacing an existing loss with our metric yields improvement in a speech-enhancement network. △ Less

Submitted 24 June, 2022; originally announced June 2022.

Comments: To Appear, Interspeech 2022

arXiv:2206.12285 [pdf, other]

Speech Quality Assessment through MOS using Non-Matching References

Authors: Pranay Manocha, Anurag Kumar

Abstract: Human judgments obtained through Mean Opinion Scores (MOS) are the most reliable way to assess the quality of speech signals. However, several recent attempts to automatically estimate MOS using deep learning approaches lack robustness and generalization capabilities, limiting their use in real-world applications. In this work, we present a novel framework, NORESQA-MOS, for estimating the MOS of a… ▽ More Human judgments obtained through Mean Opinion Scores (MOS) are the most reliable way to assess the quality of speech signals. However, several recent attempts to automatically estimate MOS using deep learning approaches lack robustness and generalization capabilities, limiting their use in real-world applications. In this work, we present a novel framework, NORESQA-MOS, for estimating the MOS of a speech signal. Unlike prior works, our approach uses non-matching references as a form of conditioning to ground the MOS estimation by neural networks. We show that NORESQA-MOS provides better generalization and more robust MOS estimation than previous state-of-the-art methods such as DNSMOS and NISQA, even though we use a smaller training set. Moreover, we also show that our generic framework can be combined with other learning methods such as self-supervised learning and can further supplement the benefits from these methods. △ Less

Submitted 24 June, 2022; originally announced June 2022.

Comments: To Appear, Interspeech 2022

arXiv:2203.05643 [pdf, ps, other]

Controlling Transaction Rate in Tangle Ledger: A Principal Agent Problem Approach

Authors: Anurag Gupta, Vikram Krishnamurthy

Abstract: Tangle is a distributed ledger technology that stores data as a directed acyclic graph (DAG). Unlike blockchain, Tangle does not require dedicated miners for its operation; this makes Tangle suitable for Internet of Things (IoT) applications. Distributed ledgers have a built-in transaction rate control mechanism to prevent congestion and spamming; this is typically achieved by increasing or decrea… ▽ More Tangle is a distributed ledger technology that stores data as a directed acyclic graph (DAG). Unlike blockchain, Tangle does not require dedicated miners for its operation; this makes Tangle suitable for Internet of Things (IoT) applications. Distributed ledgers have a built-in transaction rate control mechanism to prevent congestion and spamming; this is typically achieved by increasing or decreasing the proof of work (PoW) difficulty level based on the number of users. Unfortunately, this simplistic mechanism gives an unfair advantage to users with high computing power. This paper proposes a principal-agent problem (PAP) framework from microeconomics to control the transaction rate in Tangle. With users as agents and the transaction rate controller as the principal, we design a truth-telling mechanism to assign PoW difficulty levels to agents as a function of their computing power. The solution of the PAP is achieved by compensating a higher PoW difficulty level with a larger weight/reputation for the transaction. The mechanism has two benefits, (1) the security of Tangle is increased as agents are incentivized to perform difficult PoW, and (2) the rate of new transactions is moderated in Tangle. The solution of PAP is obtained by solving a mixed-integer optimization problem. We show that the optimal solution of the PAP increases with the computing power of agents. The structural results reduce the search space of the mixed-integer program and enable efficient computation of the optimal mechanism. Finally, via numerical examples, we illustrate the transaction rate control mechanism and study its impact on the dynamics of Tangle. △ Less

Submitted 18 April, 2023; v1 submitted 10 March, 2022; originally announced March 2022.

arXiv:2202.08883 [pdf, other]

Curriculum optimization for low-resource speech recognition

Authors: Anastasia Kuznetsova, Anurag Kumar, Jennifer Drexler Fox, Francis Tyers

Abstract: Modern end-to-end speech recognition models show astonishing results in transcribing audio signals into written text. However, conventional data feeding pipelines may be sub-optimal for low-resource speech recognition, which still remains a challenging task. We propose an automated curriculum learning approach to optimize the sequence of training examples based on both the progress of the model wh… ▽ More Modern end-to-end speech recognition models show astonishing results in transcribing audio signals into written text. However, conventional data feeding pipelines may be sub-optimal for low-resource speech recognition, which still remains a challenging task. We propose an automated curriculum learning approach to optimize the sequence of training examples based on both the progress of the model while training and prior knowledge about the difficulty of the training examples. We introduce a new difficulty measure called compression ratio that can be used as a scoring function for raw audio in various noise conditions. The proposed method improves speech recognition Word Error Rate performance by up to 33% relative over the baseline system △ Less

Submitted 17 February, 2022; originally announced February 2022.

arXiv:2202.08862 [pdf, other]

doi 10.1109/JSTSP.2022.3200911

RemixIT: Continual self-training of speech enhancement models via bootstrapped remixing

Authors: Efthymios Tzinis, Yossi Adi, Vamsi Krishna Ithapu, Buye Xu, Paris Smaragdis, Anurag Kumar

Abstract: We present RemixIT, a simple yet effective self-supervised method for training speech enhancement without the need of a single isolated in-domain speech nor a noise waveform. Our approach overcomes limitations of previous methods which make them dependent on clean in-domain target signals and thus, sensitive to any domain mismatch between train and test samples. RemixIT is based on a continuous se… ▽ More We present RemixIT, a simple yet effective self-supervised method for training speech enhancement without the need of a single isolated in-domain speech nor a noise waveform. Our approach overcomes limitations of previous methods which make them dependent on clean in-domain target signals and thus, sensitive to any domain mismatch between train and test samples. RemixIT is based on a continuous self-training scheme in which a pre-trained teacher model on out-of-domain data infers estimated pseudo-target signals for in-domain mixtures. Then, by permuting the estimated clean and noise signals and remixing them together, we generate a new set of bootstrapped mixtures and corresponding pseudo-targets which are used to train the student network. Vice-versa, the teacher periodically refines its estimates using the updated parameters of the latest student models. Experimental results on multiple speech enhancement datasets and tasks not only show the superiority of our method over prior approaches but also showcase that RemixIT can be combined with any separation model as well as be applied towards any semi-supervised and unsupervised domain adaptation task. Our analysis, paired with empirical evidence, sheds light on the inside functioning of our self-training scheme wherein the student model keeps obtaining better performance while observing severely degraded pseudo-targets. △ Less

Submitted 3 August, 2022; v1 submitted 17 February, 2022; originally announced February 2022.

Comments: To appear in IEEE Journal of Selected Topics in Signal Processing

Journal ref: J-STSP-SLSAP-00040-2022

arXiv:2202.00538 [pdf, other]

The impact of removing head movements on audio-visual speech enhancement

Authors: Zhiqi Kang, Mostafa Sadeghi, Radu Horaud, Xavier Alameda-Pineda, Jacob Donley, Anurag Kumar

Abstract: This paper investigates the impact of head movements on audio-visual speech enhancement (AVSE). Although being a common conversational feature, head movements have been ignored by past and recent studies: they challenge today's learning-based methods as they often degrade the performance of models that are trained on clean, frontal, and steady face images. To alleviate this problem, we propose to… ▽ More This paper investigates the impact of head movements on audio-visual speech enhancement (AVSE). Although being a common conversational feature, head movements have been ignored by past and recent studies: they challenge today's learning-based methods as they often degrade the performance of models that are trained on clean, frontal, and steady face images. To alleviate this problem, we propose to use robust face frontalization (RFF) in combination with an AVSE method based on a variational auto-encoder (VAE) model. We briefly describe the basic ingredients of the proposed pipeline and we perform experiments with a recently released audio-visual dataset. In the light of these experiments, and based on three standard metrics, namely STOI, PESQ and SI-SDR, we conclude that RFF improves the performance of AVSE by a considerable margin. △ Less

Submitted 2 February, 2022; v1 submitted 1 February, 2022; originally announced February 2022.

arXiv:2112.04613 [pdf, other]

NICE-Beam: Neural Integrated Covariance Estimators for Time-Varying Beamformers

Authors: Jonah Casebeer, Jacob Donley, Daniel Wong, Buye Xu, Anurag Kumar

Abstract: Estimating a time-varying spatial covariance matrix for a beamforming algorithm is a challenging task, especially for wearable devices, as the algorithm must compensate for time-varying signal statistics due to rapid pose-changes. In this paper, we propose Neural Integrated Covariance Estimators for Beamformers, NICE-Beam. NICE-Beam is a general technique for learning how to estimate time-varying… ▽ More Estimating a time-varying spatial covariance matrix for a beamforming algorithm is a challenging task, especially for wearable devices, as the algorithm must compensate for time-varying signal statistics due to rapid pose-changes. In this paper, we propose Neural Integrated Covariance Estimators for Beamformers, NICE-Beam. NICE-Beam is a general technique for learning how to estimate time-varying spatial covariance matrices, which we apply to joint speech enhancement and dereverberation. It is based on training a neural network module to non-linearly track and leverage scene information across time. We integrate our solution into a beamforming pipeline, which enables simple training, faster than real-time inference, and a variety of test-time adaptation options. We evaluate the proposed model against a suite of baselines in scenes with both stationary and moving microphones. Our results show that the proposed method can outperform a hand-tuned estimator, despite the hand-tuned estimator using oracle source separation knowledge. △ Less

Submitted 8 December, 2021; originally announced December 2021.

arXiv:2111.13920 [pdf]

doi 10.1109/LGRS.2021.3112603

Sparse Subspace Clustering Friendly Deep Dictionary Learning for Hyperspectral Image Classification

Authors: Anurag Goel, Angshul Majumdar

Abstract: Subspace clustering techniques have shown promise in hyperspectral image segmentation. The fundamental assumption in subspace clustering is that the samples belonging to different clusters/segments lie in separable subspaces. What if this condition does not hold? We surmise that even if the condition does not hold in the original space, the data may be nonlinearly transformed to a space where it w… ▽ More Subspace clustering techniques have shown promise in hyperspectral image segmentation. The fundamental assumption in subspace clustering is that the samples belonging to different clusters/segments lie in separable subspaces. What if this condition does not hold? We surmise that even if the condition does not hold in the original space, the data may be nonlinearly transformed to a space where it will be separable into subspaces. In this work, we propose a transformation based on the tenets of deep dictionary learning (DDL). In particular, we incorporate the sparse subspace clustering (SSC) loss in the DDL formulation. Here DDL nonlinearly transforms the data such that the transformed representation (of the data) is separable into subspaces. We show that the proposed formulation improves over the state-of-the-art deep learning techniques in hyperspectral image clustering. △ Less

Submitted 27 November, 2021; originally announced November 2021.

Comments: IEEE Geoscience And Remote Sensing Letters

arXiv:2111.00610 [pdf, other]

Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units

Authors: Anurag Katakkar, Alan W Black

Abstract: Language models (LMs) for text data have been studied extensively for their usefulness in language generation and other downstream tasks. However, language modelling purely in the speech domain is still a relatively unexplored topic, with traditional speech LMs often depending on auxiliary text LMs for learning distributional aspects of the language. For the English language, these LMs treat words… ▽ More Language models (LMs) for text data have been studied extensively for their usefulness in language generation and other downstream tasks. However, language modelling purely in the speech domain is still a relatively unexplored topic, with traditional speech LMs often depending on auxiliary text LMs for learning distributional aspects of the language. For the English language, these LMs treat words as atomic units, which presents inherent challenges to language modelling in the speech domain. In this paper, we propose a novel LSTM-based generative speech LM that is inspired by the CBOW model and built on linguistic units including syllables and phonemes. This offers better acoustic consistency across utterances in the dataset, as opposed to single melspectrogram frames, or whole words. With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech. We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features. Through our experiments, we also highlight some well known, but poorly documented challenges in training generative speech LMs, including the mismatch between the supervised learning objective with which these models are trained such as Mean Squared Error (MSE), and the true objective, which is speech quality. Our experiments provide an early indication that while validation loss and Mel Cepstral Distortion (MCD) are not strongly correlated with generated speech quality, traditional text language modelling metrics like perplexity and next-token-prediction accuracy might be. △ Less

Submitted 31 October, 2021; originally announced November 2021.

Showing 1–50 of 95 results for author: Anurag