-
Accelerating Longitudinal MRI using Prior Informed Latent Diffusion
Authors:
Yonatan Urman,
Zachary Shah,
Ashwin Kumar,
Bruno P. Soares,
Kawin Setsompop
Abstract:
MRI is a widely used ionization-free soft-tissue imaging modality, often employed repeatedly over a patient's lifetime. However, prolonged scanning durations, among other issues, can limit availability and accessibility. In this work, we aim to substantially reduce scan times by leveraging prior scans of the same patient. These prior scans typically contain considerable shared information with the…
▽ More
MRI is a widely used ionization-free soft-tissue imaging modality, often employed repeatedly over a patient's lifetime. However, prolonged scanning durations, among other issues, can limit availability and accessibility. In this work, we aim to substantially reduce scan times by leveraging prior scans of the same patient. These prior scans typically contain considerable shared information with the current scan, thereby enabling higher acceleration rates when appropriately utilized. We propose a prior informed reconstruction method with a trained diffusion model in conjunction with data-consistency steps. Our method can be trained with unlabeled image data, eliminating the need for a dataset of either k-space measurements or paired longitudinal scans as is required of other learning-based methods. We demonstrate superiority of our method over previously suggested approaches in effectively utilizing prior information without over-biasing prior consistency, which we validate on both an open-source dataset of healthy patients as well as several longitudinal cases of clinical interest.
△ Less
Submitted 29 June, 2024;
originally announced July 2024.
-
AV-CrossNet: an Audiovisual Complex Spectral Map** Network for Speech Separation By Leveraging Narrow- and Cross-Band Modeling
Authors:
Vahid Ahmadi Kalkhorani,
Cheng Yu,
Anurag Kumar,
Ke Tan,
Buye Xu,
DeLiang Wang
Abstract:
Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the CrossNet architecture, which is a recently proposed network that performs complex spectral map** for speech separation by lever…
▽ More
Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the CrossNet architecture, which is a recently proposed network that performs complex spectral map** for speech separation by leveraging global attention and positional encoding. To effectively utilize visual cues, the proposed system incorporates pre-extracted visual embeddings and employs a visual encoder comprising temporal convolutional layers. Audio and visual features are fused in an early fusion layer before feeding to AV-CrossNet blocks. We evaluate AV-CrossNet on multiple datasets, including LRS, VoxCeleb, and COG-MHEAR challenge. Evaluation results demonstrate that AV-CrossNet advances the state-of-the-art performance in all audiovisual tasks, even on untrained and mismatched datasets.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement
Authors:
Wangyou Zhang,
Robin Scheibler,
Kohei Saijo,
Samuele Cornell,
Chenda Li,
Zhaoheng Ni,
Anurag Kumar,
Jan Pirklbauer,
Marvin Sach,
Shinji Watanabe,
Tim Fingscheidt,
Yanmin Qian
Abstract:
The last decade has witnessed significant advancements in deep learning-based speech enhancement (SE). However, most existing SE research has limitations on the coverage of SE sub-tasks, data diversity and amount, and evaluation metrics. To fill this gap and promote research toward universal SE, we establish a new SE challenge, named URGENT, to focus on the universality, robustness, and generaliza…
▽ More
The last decade has witnessed significant advancements in deep learning-based speech enhancement (SE). However, most existing SE research has limitations on the coverage of SE sub-tasks, data diversity and amount, and evaluation metrics. To fill this gap and promote research toward universal SE, we establish a new SE challenge, named URGENT, to focus on the universality, robustness, and generalizability of SE. We aim to extend the SE definition to cover different sub-tasks to explore the limits of SE models, starting from denoising, dereverberation, bandwidth extension, and declip**. A novel framework is proposed to unify all these sub-tasks in a single model, allowing the use of all existing SE approaches. We collected public speech and noise data from different domains to construct diverse evaluation data. Finally, we discuss the insights gained from our preliminary baseline experiments based on both generative and discriminative SE methods with 12 curated metrics.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Cross-Talk Reduction
Authors:
Zhong-Qiu Wang,
Anurag Kumar,
Shinji Watanabe
Abstract:
While far-field multi-talker mixtures are recorded, each speaker can wear a close-talk microphone so that close-talk mixtures can be recorded at the same time. Although each close-talk mixture has a high signal-to-noise ratio (SNR) of the wearer, it has a very limited range of applications, as it also contains significant cross-talk speech by other speakers and is not clean enough. In this context…
▽ More
While far-field multi-talker mixtures are recorded, each speaker can wear a close-talk microphone so that close-talk mixtures can be recorded at the same time. Although each close-talk mixture has a high signal-to-noise ratio (SNR) of the wearer, it has a very limited range of applications, as it also contains significant cross-talk speech by other speakers and is not clean enough. In this context, we propose a novel task named cross-talk reduction (CTR) which aims at reducing cross-talk speech, and a novel solution named CTRnet which is based on unsupervised or weakly-supervised neural speech separation. In unsupervised CTRnet, close-talk and far-field mixtures are stacked as input for a DNN to estimate the close-talk speech of each speaker. It is trained in an unsupervised, discriminative way such that the DNN estimate for each speaker can be linearly filtered to cancel out the speaker's cross-talk speech captured at other microphones. In weakly-supervised CTRnet, we assume the availability of each speaker's activity timestamps during training, and leverage them to improve the training of unsupervised CTRnet. Evaluation results on a simulated two-speaker CTR task and on a real-recorded conversational speech separation and recognition task show the effectiveness and potential of CTRnet.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Few Shot Class Incremental Learning using Vision-Language models
Authors:
Anurag Kumar,
Chinmay Bharti,
Saikat Dutta,
Srikrishna Karanam,
Biplab Banerjee
Abstract:
Recent advancements in deep learning have demonstrated remarkable performance comparable to human capabilities across various supervised computer vision tasks. However, the prevalent assumption of having an extensive pool of training data encompassing all classes prior to model training often diverges from real-world scenarios, where limited data availability for novel classes is the norm. The cha…
▽ More
Recent advancements in deep learning have demonstrated remarkable performance comparable to human capabilities across various supervised computer vision tasks. However, the prevalent assumption of having an extensive pool of training data encompassing all classes prior to model training often diverges from real-world scenarios, where limited data availability for novel classes is the norm. The challenge emerges in seamlessly integrating new classes with few samples into the training data, demanding the model to adeptly accommodate these additions without compromising its performance on base classes. To address this exigency, the research community has introduced several solutions under the realm of few-shot class incremental learning (FSCIL).
In this study, we introduce an innovative FSCIL framework that utilizes language regularizer and subspace regularizer. During base training, the language regularizer helps incorporate semantic information extracted from a Vision-Language model. The subspace regularizer helps in facilitating the model's acquisition of nuanced connections between image and text semantics inherent to base classes during incremental training. Our proposed framework not only empowers the model to embrace novel classes with limited data, but also ensures the preservation of performance on base classes. To substantiate the efficacy of our approach, we conduct comprehensive experiments on three distinct FSCIL benchmarks, where our framework attains state-of-the-art performance.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
A Flexible 2.5D Medical Image Segmentation Approach with In-Slice and Cross-Slice Attention
Authors:
Amarjeet Kumar,
Hongxu Jiang,
Muhammad Imran,
Cyndi Valdes,
Gabriela Leon,
Dahyun Kang,
Parvathi Nataraj,
Yuyin Zhou,
Michael D. Weiss,
Wei Shao
Abstract:
Deep learning has become the de facto method for medical image segmentation, with 3D segmentation models excelling in capturing complex 3D structures and 2D models offering high computational efficiency. However, segmenting 2.5D images, which have high in-plane but low through-plane resolution, is a relatively unexplored challenge. While applying 2D models to individual slices of a 2.5D image is f…
▽ More
Deep learning has become the de facto method for medical image segmentation, with 3D segmentation models excelling in capturing complex 3D structures and 2D models offering high computational efficiency. However, segmenting 2.5D images, which have high in-plane but low through-plane resolution, is a relatively unexplored challenge. While applying 2D models to individual slices of a 2.5D image is feasible, it fails to capture the spatial relationships between slices. On the other hand, 3D models face challenges such as resolution inconsistencies in 2.5D images, along with computational complexity and susceptibility to overfitting when trained with limited data. In this context, 2.5D models, which capture inter-slice correlations using only 2D neural networks, emerge as a promising solution due to their reduced computational demand and simplicity in implementation. In this paper, we introduce CSA-Net, a flexible 2.5D segmentation model capable of processing 2.5D images with an arbitrary number of slices through an innovative Cross-Slice Attention (CSA) module. This module uses the cross-slice attention mechanism to effectively capture 3D spatial information by learning long-range dependencies between the center slice (for segmentation) and its neighboring slices. Moreover, CSA-Net utilizes the self-attention mechanism to understand correlations among pixels within the center slice. We evaluated CSA-Net on three 2.5D segmentation tasks: (1) multi-class brain MRI segmentation, (2) binary prostate MRI segmentation, and (3) multi-class prostate MRI segmentation. CSA-Net outperformed leading 2D and 2.5D segmentation methods across all three tasks, demonstrating its efficacy and superiority. Our code is publicly available at https://github.com/mirthAI/CSA-Net.
△ Less
Submitted 30 April, 2024;
originally announced May 2024.
-
Efficient Verification of a RADAR SoC Using Formal and Simulation-Based Methods
Authors:
Aman Kumar,
Mark Litterick,
Samuele Candido
Abstract:
As the demand for Internet of Things (IoT) and Human-to-Machine Interaction (HMI) increases, modern System-on-Chips (SoCs) offering such solutions are becoming increasingly complex. This intricate design poses significant challenges for verification, particularly when time-to-market is a crucial factor for consumer electronics products. This paper presents a case study based on our work to verify…
▽ More
As the demand for Internet of Things (IoT) and Human-to-Machine Interaction (HMI) increases, modern System-on-Chips (SoCs) offering such solutions are becoming increasingly complex. This intricate design poses significant challenges for verification, particularly when time-to-market is a crucial factor for consumer electronics products. This paper presents a case study based on our work to verify a complex Radio Detection And Ranging (RADAR) based SoC that performs on-chip sensing of human motion with millimetre accuracy. We leverage both formal and simulation-based methods to complement each other and achieve verification sign-off with high confidence. While employing a requirements-driven flow approach, we demonstrate the use of different verification methods to cater to multiple requirements and highlight our know-how from the project. Additionally, we used Machine Learning (ML) based methods, specifically the Xcelium ML tool from Cadence, to improve verification throughput.
△ Less
Submitted 20 April, 2024;
originally announced April 2024.
-
Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark
Authors:
Ziyang Chen,
Israel D. Gebru,
Christian Richardt,
Anurag Kumar,
William Laney,
Andrew Owens,
Alexander Richard
Abstract:
We present a new dataset called Real Acoustic Fields (RAF) that captures real acoustic room data from multiple modalities. The dataset includes high-quality and densely captured room impulse response data paired with multi-view images, and precise 6DoF pose tracking data for sound emitters and listeners in the rooms. We used this dataset to evaluate existing methods for novel-view acoustic synthes…
▽ More
We present a new dataset called Real Acoustic Fields (RAF) that captures real acoustic room data from multiple modalities. The dataset includes high-quality and densely captured room impulse response data paired with multi-view images, and precise 6DoF pose tracking data for sound emitters and listeners in the rooms. We used this dataset to evaluate existing methods for novel-view acoustic synthesis and impulse response generation which previously relied on synthetic data. In our evaluation, we thoroughly assessed existing audio and audio-visual models against multiple criteria and proposed settings to enhance their performance on real-world data. We also conducted experiments to investigate the impact of incorporating visual data (i.e., images and depth) into neural acoustic field models. Additionally, we demonstrated the effectiveness of a simple sim2real approach, where a model is pre-trained with simulated data and fine-tuned with sparse real-world data, resulting in significant improvements in the few-shot learning approach. RAF is the first dataset to provide densely captured room acoustic data, making it an ideal resource for researchers working on audio and audio-visual neural acoustic field modeling techniques. Demos and datasets are available on our project page: https://facebookresearch.github.io/real-acoustic-fields/
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
Intelligent fault diagnosis of worm gearbox based on adaptive CNN using amended gorilla troop optimization with quantum gate mutation strategy
Authors:
Govind Vashishtha,
Sumika Chauhan,
Surinder Kumar,
Rajesh Kumar,
Radoslaw Zimroz,
Anil Kumar
Abstract:
The worm gearbox is a high-speed transmission system that plays a vital role in various industries. Therefore it becomes necessary to develop a robust fault diagnosis scheme for worm gearbox. Due to advancements in sensor technology, researchers from academia and industries prefer deep learning models for fault diagnosis purposes. The optimal selection of hyperparameters (HPs) of deep learning mod…
▽ More
The worm gearbox is a high-speed transmission system that plays a vital role in various industries. Therefore it becomes necessary to develop a robust fault diagnosis scheme for worm gearbox. Due to advancements in sensor technology, researchers from academia and industries prefer deep learning models for fault diagnosis purposes. The optimal selection of hyperparameters (HPs) of deep learning models plays a significant role in stable performance. Existing methods mainly focused on manual tunning of these parameters, which is a troublesome process and sometimes leads to inaccurate results. Thus, exploring more sophisticated methods to optimize the HPs automatically is important. In this work, a novel optimization, i.e. amended gorilla troop optimization (AGTO), has been proposed to make the convolutional neural network (CNN) adaptive for extracting the features to identify the worm gearbox defects. Initially, the vibration and acoustic signals are converted into 2D images by the Morlet wavelet function. Then, the initial model of CNN is developed by setting hyperparameters. Further, the search space of each Hp is identified and optimized by the developed AGTO algorithm. The classification accuracy has been evaluated by AGTO-CNN, which is further validated by the confusion matrix. The performance of the developed model has also been compared with other models. The AGTO algorithm is examined on twenty-three classical benchmark functions and the Wilcoxon test which demonstrates the effectiveness and dominance of the developed optimization algorithm. The results obtained suggested that the AGTO-CNN has the highest diagnostic accuracy and is stable and robust while diagnosing the worm gearbox.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
CoroNetGAN: Controlled Pruning of GANs via Hypernetworks
Authors:
Aman Kumar,
Khushboo Anand,
Shubham Mandloi,
Ashutosh Mishra,
Avinash Thakur,
Neeraj Kasera,
Prathosh A P
Abstract:
Generative Adversarial Networks (GANs) have proven to exhibit remarkable performance and are widely used across many generative computer vision applications. However, the unprecedented demand for the deployment of GANs on resource-constrained edge devices still poses a challenge due to huge number of parameters involved in the generation process. This has led to focused attention on the area of co…
▽ More
Generative Adversarial Networks (GANs) have proven to exhibit remarkable performance and are widely used across many generative computer vision applications. However, the unprecedented demand for the deployment of GANs on resource-constrained edge devices still poses a challenge due to huge number of parameters involved in the generation process. This has led to focused attention on the area of compressing GANs. Most of the existing works use knowledge distillation with the overhead of teacher dependency. Moreover, there is no ability to control the degree of compression in these methods. Hence, we propose CoroNet-GAN for compressing GAN using the combined strength of differentiable pruning method via hypernetworks. The proposed method provides the advantage of performing controllable compression while training along with reducing training time by a substantial factor. Experiments have been done on various conditional GAN architectures (Pix2Pix and CycleGAN) to signify the effectiveness of our approach on multiple benchmark datasets such as Edges-to-Shoes, Horse-to-Zebra and Summer-to-Winter. The results obtained illustrate that our approach succeeds to outperform the baselines on Zebra-to-Horse and Summer-to-Winter achieving the best FID score of 32.3 and 72.3 respectively, yielding high-fidelity images across all the datasets. Additionally, our approach also outperforms the state-of-the-art methods in achieving better inference time on various smart-phone chipsets and data-types making it a feasible solution for deployment on edge devices.
△ Less
Submitted 13 March, 2024;
originally announced March 2024.
-
Selective Encryption using Segmentation Mask with Chaotic Henon Map for Multidimensional Medical Images
Authors:
S Arut Prakash,
Aditya Ganesh Kumar,
Prabhu Shankar K. C.,
Lithicka Anandavel,
Aditya Lakshmi Narayanan
Abstract:
A user-centric design and resource optimization should be at the center of any technology or innovation. The user-centric perspective gives the developer the opportunity to develop with task-based optimization. The user in the medical image field is a medical professional who analyzes the medical images and gives their diagnosis results to the patient. This scheme, having the medical professional…
▽ More
A user-centric design and resource optimization should be at the center of any technology or innovation. The user-centric perspective gives the developer the opportunity to develop with task-based optimization. The user in the medical image field is a medical professional who analyzes the medical images and gives their diagnosis results to the patient. This scheme, having the medical professional user's perspective, innovates in the area of Medical Image storage and security. The architecture is designed with three main segments, namely: Segmentation, Storage, and Retrieval. This architecture was designed owing to the fact that the number of retrieval operations done by medical professionals was toweringly higher when compared to the storage operations done for some handful number of times for a particular medical image. This gives room for our innovation to segment out the medically indispensable part of the medical image, encrypt it, and store it. By encrypting the vital parts of the image using a strong encryption algorithm like the chaotic Henon map, we are able to keep the security intact. Now retrieving the medical image demands only the computationally less stressing decryption of the segmented region of interest. The decryption of the segmented region of interest results in the full recovery of the medical image which can be viewed on demand by the medical professionals for various diagnosis purposes. In this scheme, we were able to achieve a retrieval speed improvement of around 47% when compared to a full image encryption of brain medical CT images.
△ Less
Submitted 2 March, 2024;
originally announced March 2024.
-
A Survey of Application of Machine Learning in Wireless Indoor Positioning Systems
Authors:
Amala Sonny,
Abhinav Kumar,
Linga Reddy Cenkeramaddi
Abstract:
Indoor human positioning has become increasingly important for applications such as health monitoring, breath monitoring, human identification, safety and rescue operations, and security surveillance. However, achieving robust indoor human positioning remains challenging due to various constraints. Numerous attempts have been made in the literature to develop efficient indoor positioning systems (…
▽ More
Indoor human positioning has become increasingly important for applications such as health monitoring, breath monitoring, human identification, safety and rescue operations, and security surveillance. However, achieving robust indoor human positioning remains challenging due to various constraints. Numerous attempts have been made in the literature to develop efficient indoor positioning systems (IPSs), with a growing focus on machine learning (ML) based techniques. This paper aims to compare and analyze current ML-based wireless techniques and approaches for indoor positioning, providing a comprehensive review of enabling technologies for human detection, positioning, and activity recognition. The study explores different input measurement data, including RSSI, TDOA, etc., for various IPSs. Key positioning techniques such as RSSI-based fingerprinting, Angle-based, and Time-based approaches are examined in conjunction with various ML methods. The survey compares the positioning accuracy, scalability, and algorithm complexity, with the goal of determining the suitable technology in various services. Finally, the paper compares distinct datasets focused on indoor localization, which have been published using diverse technologies. Overall, the paper presents a comprehensive comparison of existing techniques and localization models.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement
Authors:
Ravi Shankar,
Ke Tan,
Buye Xu,
Anurag Kumar
Abstract:
Self-supervised learned models have been found to be very effective for certain speech tasks such as automatic speech recognition, speaker identification, keyword spotting and others. While the features are undeniably useful in speech recognition and associated tasks, their utility in speech enhancement systems is yet to be firmly established, and perhaps not properly understood. In this paper, we…
▽ More
Self-supervised learned models have been found to be very effective for certain speech tasks such as automatic speech recognition, speaker identification, keyword spotting and others. While the features are undeniably useful in speech recognition and associated tasks, their utility in speech enhancement systems is yet to be firmly established, and perhaps not properly understood. In this paper, we investigate the uses of SSL representations for single-channel speech enhancement in challenging conditions and find that they add very little value for the enhancement task. Our constraints are designed around on-device real-time speech enhancement -- model is causal, the compute footprint is small. Additionally, we focus on low SNR conditions where such models struggle to provide good enhancement. In order to systematically examine how SSL representations impact performance of such enhancement models, we propose a variety of techniques to utilize these embeddings which include different forms of knowledge-distillation and pre-training.
△ Less
Submitted 2 March, 2024;
originally announced March 2024.
-
Ambisonics Networks -- The Effect Of Radial Functions Regularization
Authors:
Bar Shaybet,
Anurag Kumar,
Vladimir Tourbabin,
Boaz Rafaely
Abstract:
Ambisonics, a popular format of spatial audio, is the spherical harmonic (SH) representation of the plane wave density function of a sound field. Many algorithms operate in the SH domain and utilize the Ambisonics as their input signal. The process of encoding Ambisonics from a spherical microphone array involves dividing by the radial functions, which may amplify noise at low frequencies. This ca…
▽ More
Ambisonics, a popular format of spatial audio, is the spherical harmonic (SH) representation of the plane wave density function of a sound field. Many algorithms operate in the SH domain and utilize the Ambisonics as their input signal. The process of encoding Ambisonics from a spherical microphone array involves dividing by the radial functions, which may amplify noise at low frequencies. This can be overcome by regularization, with the downside of introducing errors to the Ambisonics encoding. This paper aims to investigate the impact of different ways of regularization on Deep Neural Network (DNN) training and performance. Ideally, these networks should be robust to the way of regularization. Simulated data of a single speaker in a room and experimental data from the LOCATA challenge were used to evaluate this robustness on an example algorithm of speaker localization based on the direct-path dominance (DPD) test. Results show that performance may be sensitive to the way of regularization, and an informed approach is proposed and investigated, highlighting the importance of regularization information.
△ Less
Submitted 29 February, 2024;
originally announced February 2024.
-
CIS-UNet: Multi-Class Segmentation of the Aorta in Computed Tomography Angiography via Context-Aware Shifted Window Self-Attention
Authors:
Muhammad Imran,
Jonathan R Krebs,
Veera Rajasekhar Reddy Gopu,
Brian Fazzone,
Vishal Balaji Sivaraman,
Amarjeet Kumar,
Chelsea Viscardi,
Robert Evans Heithaus,
Benjamin Shickel,
Yuyin Zhou,
Michol A Cooper,
Wei Shao
Abstract:
Advancements in medical imaging and endovascular grafting have facilitated minimally invasive treatments for aortic diseases. Accurate 3D segmentation of the aorta and its branches is crucial for interventions, as inaccurate segmentation can lead to erroneous surgical planning and endograft construction. Previous methods simplified aortic segmentation as a binary image segmentation problem, overlo…
▽ More
Advancements in medical imaging and endovascular grafting have facilitated minimally invasive treatments for aortic diseases. Accurate 3D segmentation of the aorta and its branches is crucial for interventions, as inaccurate segmentation can lead to erroneous surgical planning and endograft construction. Previous methods simplified aortic segmentation as a binary image segmentation problem, overlooking the necessity of distinguishing between individual aortic branches. In this paper, we introduce Context Infused Swin-UNet (CIS-UNet), a deep learning model designed for multi-class segmentation of the aorta and thirteen aortic branches. Combining the strengths of Convolutional Neural Networks (CNNs) and Swin transformers, CIS-UNet adopts a hierarchical encoder-decoder structure comprising a CNN encoder, symmetric decoder, skip connections, and a novel Context-aware Shifted Window Self-Attention (CSW-SA) as the bottleneck block. Notably, CSW-SA introduces a unique utilization of the patch merging layer, distinct from conventional Swin transformers. It efficiently condenses the feature map, providing a global spatial context and enhancing performance when applied at the bottleneck layer, offering superior computational efficiency and segmentation accuracy compared to the Swin transformers. We trained our model on computed tomography (CT) scans from 44 patients and tested it on 15 patients. CIS-UNet outperformed the state-of-the-art SwinUNetR segmentation model, which is solely based on Swin transformers, by achieving a superior mean Dice coefficient of 0.713 compared to 0.697, and a mean surface distance of 2.78 mm compared to 3.39 mm. CIS-UNet's superior 3D aortic segmentation offers improved precision and optimization for planning endovascular treatments. Our dataset and code will be publicly available.
△ Less
Submitted 23 January, 2024;
originally announced January 2024.
-
Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition
Authors:
Vahid Noroozi,
Somshubra Majumdar,
Ankur Kumar,
Jagadeesh Balam,
Boris Ginsburg
Abstract:
In this paper, we propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture. We adapted the FastConformer architecture for streaming applications through: (1) constraining both the look-ahead and past contexts in the encoder, and (2) introducing an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively du…
▽ More
In this paper, we propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture. We adapted the FastConformer architecture for streaming applications through: (1) constraining both the look-ahead and past contexts in the encoder, and (2) introducing an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference. The proposed model is thoughtfully designed in a way to eliminate the accuracy disparity between the train and inference time which is common for many streaming models. Furthermore, our proposed encoder works with various decoder configurations including Connectionist Temporal Classification (CTC) and RNN-Transducer (RNNT) decoders. Additionally, we introduced a hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation. We evaluate the proposed model on LibriSpeech dataset and a multi-domain large scale dataset and demonstrate that it can achieve better accuracy with lower latency and inference time compared to a conventional buffered streaming model baseline. We also showed that training a model with multiple latencies can achieve better accuracy than single latency models while it enables us to support multiple latencies with a single model. Our experiments also showed the hybrid architecture would not only speedup the convergence of the CTC decoder but also improves the accuracy of streaming models compared to single decoder models.
△ Less
Submitted 2 May, 2024; v1 submitted 27 December, 2023;
originally announced December 2023.
-
Impact of Urban Street Geometry on the Detection Probability of Automotive Radars
Authors:
Mohammad Taha Shah,
Ankit Kumar,
Gourab Ghatak,
Shobha Sundar Ram
Abstract:
Prior works have analyzed the performance of millimeter wave automotive radars in the presence of diverse clutter and interference scenarios using stochastic geometry tools instead of more time-consuming measurement studies or system-level simulations. In these works, the distributions of radars or discrete clutter scatterers were modeled as Poisson point processes in the Euclidean space. However,…
▽ More
Prior works have analyzed the performance of millimeter wave automotive radars in the presence of diverse clutter and interference scenarios using stochastic geometry tools instead of more time-consuming measurement studies or system-level simulations. In these works, the distributions of radars or discrete clutter scatterers were modeled as Poisson point processes in the Euclidean space. However, since most automotive radars are likely to be mounted on vehicles and road infrastructure, road geometries are an important factor that must be considered. Instead of considering each road geometry as an individual case for study, in this work, we model each case as a specific instance of an underlying Poisson line process and further model the distribution of vehicles on the road as a Poisson point process - forming a Poisson line Cox process. Then, through the use of stochastic geometry tools, we estimate the average number of interfering radars for specific road and vehicular densities and the effect of radar parameters such as noise and beamwidth on the radar detection metrics. The numerical results are validated with Monte Carlo simulations.
△ Less
Submitted 9 December, 2023;
originally announced December 2023.
-
Echocardiogram Foundation Model -- Application 1: Estimating Ejection Fraction
Authors:
Adil Dahlan,
Cyril Zakka,
Abhinav Kumar,
Laura Tang,
Rohan Shad,
Robyn Fong,
William Hiesinger
Abstract:
Cardiovascular diseases stand as the primary global cause of mortality. Among the various imaging techniques available for visualising the heart and evaluating its function, echocardiograms emerge as the preferred choice due to their safety and low cost. Quantifying cardiac function based on echocardiograms is very laborious, time-consuming and subject to high interoperator variability. In this wo…
▽ More
Cardiovascular diseases stand as the primary global cause of mortality. Among the various imaging techniques available for visualising the heart and evaluating its function, echocardiograms emerge as the preferred choice due to their safety and low cost. Quantifying cardiac function based on echocardiograms is very laborious, time-consuming and subject to high interoperator variability. In this work, we introduce EchoAI, an echocardiogram foundation model, that is trained using self-supervised learning (SSL) on 1.5 million echocardiograms. We evaluate our approach by fine-tuning EchoAI to estimate the ejection fraction achieving a mean absolute percentage error of 9.40%. This level of accuracy aligns with the performance of expert sonographers.
△ Less
Submitted 21 November, 2023;
originally announced November 2023.
-
A Portable Ultrasound Imaging Pipeline Implementation with GPU Acceleration on Nvidia CLARA AGX
Authors:
A. N. Madhavanunni,
V. Arun Kumar,
Mahesh Raveendranatha Panicker
Abstract:
In this paper, we present a GPU-accelerated prototype implementation of a portable ultrasound imaging pipeline on an Nvidia CLARA AGX development kit. The raw data is acquired with nonsteered plane wave transmit using a programmable handheld open platform that supports 128-channel transmit and 64-channel receive. The received signals are transferred to the Nvidia CLARA AGX developer platform throu…
▽ More
In this paper, we present a GPU-accelerated prototype implementation of a portable ultrasound imaging pipeline on an Nvidia CLARA AGX development kit. The raw data is acquired with nonsteered plane wave transmit using a programmable handheld open platform that supports 128-channel transmit and 64-channel receive. The received signals are transferred to the Nvidia CLARA AGX developer platform through a host system for accelerated imaging. GPU-accelerated implementation of the conventional delay and sum (DAS) beamformer along with two adaptive nonlinear beamformers and two Fourier-based techniques was performed. The feasibility of the complete pipeline and its imaging performance was evaluated with in-vitro phantom imaging experiments and the efficacy is demonstrated with preliminary in-vivo scans. The image quality quantified by the standard contrast and resolution metrics was comparable with that of the CPU implementation. The execution speed of the implemented beamformers was also investigated for different sizes of imaging grids and a significant speedup as high as 180 times that of the CPU implementation was observed. Since the proposed pipeline involves Nvidia CLARA AGX, there is always the potential for easy incorporation of online/active learning approaches.
△ Less
Submitted 31 October, 2023;
originally announced November 2023.
-
TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch
Authors:
Jeff Hwang,
Moto Hira,
Caroline Chen,
Xiaohui Zhang,
Zhaoheng Ni,
Guangzhi Sun,
**chuan Ma,
Ruizhe Huang,
Vineel Pratap,
Yuekai Zhang,
Anurag Kumar,
Chin-Yun Yu,
Chuang Zhu,
Chunxi Liu,
Jacob Kahn,
Mirco Ravanelli,
Peng Sun,
Shinji Watanabe,
Yangyang Shi,
Yumeng Tao,
Robin Scheibler,
Samuele Cornell,
Sean Kim,
Stavros Petridis
Abstract:
TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by develo** impactful features. Here, we survey TorchAudio's devel…
▽ More
TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by develo** impactful features. Here, we survey TorchAudio's development principles and contents and highlight key features we include in its latest version (2.1): self-supervised learning pre-trained pipelines and training recipes, high-performance CTC decoders, speech recognition models and training recipes, advanced media I/O capabilities, and tools for performing forced alignment, multi-channel speech enhancement, and reference-less speech assessment. For a selection of these features, through empirical studies, we demonstrate their efficacy and show that they achieve competitive or state-of-the-art performance.
△ Less
Submitted 26 October, 2023;
originally announced October 2023.
-
Mean-field games among teams
Authors:
Jayakumar Subramanian,
Akshat Kumar,
Aditya Mahajan
Abstract:
In this paper, we present a model of a game among teams. Each team consists of a homogeneous population of agents. Agents within a team are cooperative while the teams compete with other teams. The dynamics and the costs are coupled through the empirical distribution (or the mean field) of the state of agents in each team. This mean-field is assumed to be observed by all agents. Agents have asymme…
▽ More
In this paper, we present a model of a game among teams. Each team consists of a homogeneous population of agents. Agents within a team are cooperative while the teams compete with other teams. The dynamics and the costs are coupled through the empirical distribution (or the mean field) of the state of agents in each team. This mean-field is assumed to be observed by all agents. Agents have asymmetric information (also called a non-classical information structure). We propose a mean-field based refinement of the Team-Nash equilibrium of the game, which we call mean-field Markov perfect equilibrium (MF-MPE). We identify a dynamic programming decomposition to characterize MF-MPE. We then consider the case where each team has a large number of players and present a mean-field approximation which approximates the game among large-population teams as a game among infinite-population teams. We show that MF-MPE of the game among teams of infinite population is easier to compute and is an $\varepsilon$-approximate MF-MPE of the game among teams of finite population.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
Information Geometry for the Working Information Theorist
Authors:
Kumar Vijay Mishra,
M. Ashok Kumar,
Ting-Kam Leonard Wong
Abstract:
Information geometry is a study of statistical manifolds, that is, spaces of probability distributions from a geometric perspective. Its classical information-theoretic applications relate to statistical concepts such as Fisher information, sufficient statistics, and efficient estimators. Today, information geometry has emerged as an interdisciplinary field that finds applications in diverse areas…
▽ More
Information geometry is a study of statistical manifolds, that is, spaces of probability distributions from a geometric perspective. Its classical information-theoretic applications relate to statistical concepts such as Fisher information, sufficient statistics, and efficient estimators. Today, information geometry has emerged as an interdisciplinary field that finds applications in diverse areas such as radar sensing, array signal processing, quantum physics, deep learning, and optimal transport. This article presents an overview of essential information geometry to initiate an information theorist, who may be unfamiliar with this exciting area of research. We explain the concepts of divergences on statistical manifolds, generalized notions of distances, orthogonality, and geodesics, thereby paving the way for concrete applications and novel theoretical investigations. We also highlight some recent information-geometric developments, which are of interest to the broader information theory community.
△ Less
Submitted 5 October, 2023;
originally announced October 2023.
-
Neural Acoustic Context Field: Rendering Realistic Room Impulse Response With Neural Fields
Authors:
Susan Liang,
Chao Huang,
Yapeng Tian,
Anurag Kumar,
Chenliang Xu
Abstract:
Room impulse response (RIR), which measures the sound propagation within an environment, is critical for synthesizing high-fidelity audio for a given environment. Some prior work has proposed representing RIR as a neural field function of the sound emitter and receiver positions. However, these methods do not sufficiently consider the acoustic properties of an audio scene, leading to unsatisfactor…
▽ More
Room impulse response (RIR), which measures the sound propagation within an environment, is critical for synthesizing high-fidelity audio for a given environment. Some prior work has proposed representing RIR as a neural field function of the sound emitter and receiver positions. However, these methods do not sufficiently consider the acoustic properties of an audio scene, leading to unsatisfactory performance. This letter proposes a novel Neural Acoustic Context Field approach, called NACF, to parameterize an audio scene by leveraging multiple acoustic contexts, such as geometry, material property, and spatial information. Driven by the unique properties of RIR, i.e., temporal un-smoothness and monotonic energy attenuation, we design a temporal correlation module and multi-scale energy decay criterion. Experimental results show that NACF outperforms existing field-based methods by a notable margin. Please visit our project page for more qualitative results.
△ Less
Submitted 27 September, 2023;
originally announced September 2023.
-
AxOMaP: Designing FPGA-based Approximate Arithmetic Operators using Mathematical Programming
Authors:
Siva Satyendra Sahoo,
Salim Ullah,
Akash Kumar
Abstract:
With the increasing application of machine learning (ML) algorithms in embedded systems, there is a rising necessity to design low-cost computer arithmetic for these resource-constrained systems. As a result, emerging models of computation, such as approximate and stochastic computing, that leverage the inherent error-resilience of such algorithms are being actively explored for implementing ML in…
▽ More
With the increasing application of machine learning (ML) algorithms in embedded systems, there is a rising necessity to design low-cost computer arithmetic for these resource-constrained systems. As a result, emerging models of computation, such as approximate and stochastic computing, that leverage the inherent error-resilience of such algorithms are being actively explored for implementing ML inference on resource-constrained systems. Approximate computing (AxC) aims to provide disproportionate gains in the power, performance, and area (PPA) of an application by allowing some level of reduction in its behavioral accuracy (BEHAV). Using approximate operators (AxOs) for computer arithmetic forms one of the more prevalent methods of implementing AxC. AxOs provide the additional scope for finer granularity of optimization, compared to only precision scaling of computer arithmetic. To this end, designing platform-specific and cost-efficient approximate operators forms an important research goal. Recently, multiple works have reported using AI/ML-based approaches for synthesizing novel FPGA-based AxOs. However, most of such works limit usage of AI/ML to designing ML-based surrogate functions used during iterative optimization processes. To this end, we propose a novel data analysis-driven mathematical programming-based approach to synthesizing approximate operators for FPGAs. Specifically, we formulate mixed integer quadratically constrained programs based on the results of correlation analysis of the characterization data and use the solutions to enable a more directed search approach for evolutionary optimization algorithms. Compared to traditional evolutionary algorithms-based optimization, we report up to 21% improvement in the hypervolume, for joint optimization of PPA and BEHAV, in the design of signed 8-bit multipliers.
△ Less
Submitted 23 September, 2023;
originally announced September 2023.
-
AxOCS: Scaling FPGA-based Approximate Operators using Configuration Supersampling
Authors:
Siva Satyendra Sahoo,
Salim Ullah,
Soumyo Bhattacharjee,
Akash Kumar
Abstract:
The rising usage of AI and ML-based processing across application domains has exacerbated the need for low-cost ML implementation, specifically for resource-constrained embedded systems. To this end, approximate computing, an approach that explores the power, performance, area (PPA), and behavioral accuracy (BEHAV) trade-offs, has emerged as a possible solution for implementing embedded machine le…
▽ More
The rising usage of AI and ML-based processing across application domains has exacerbated the need for low-cost ML implementation, specifically for resource-constrained embedded systems. To this end, approximate computing, an approach that explores the power, performance, area (PPA), and behavioral accuracy (BEHAV) trade-offs, has emerged as a possible solution for implementing embedded machine learning. Due to the predominance of MAC operations in ML, designing platform-specific approximate arithmetic operators forms one of the major research problems in approximate computing. Recently there has been a rising usage of AI/ML-based design space exploration techniques for implementing approximate operators. However, most of these approaches are limited to using ML-based surrogate functions for predicting the PPA and BEHAV impact of a set of related design decisions. While this approach leverages the regression capabilities of ML methods, it does not exploit the more advanced approaches in ML. To this end, we propose AxOCS, a methodology for designing approximate arithmetic operators through ML-based supersampling. Specifically, we present a method to leverage the correlation of PPA and BEHAV metrics across operators of varying bit-widths for generating larger bit-width operators. The proposed approach involves traversing the relatively smaller design space of smaller bit-width operators and employing its associated Design-PPA-BEHAV relationship to generate initial solutions for metaheuristics-based optimization for larger operators. The experimental evaluation of AxOCS for FPGA-optimized approximate operators shows that the proposed approach significantly improves the quality-resulting hypervolume for multi-objective optimization-of 8x8 signed approximate multipliers.
△ Less
Submitted 22 September, 2023;
originally announced September 2023.
-
Non-parametric Ensemble Empirical Mode Decomposition for extracting weak features to identify bearing defects
Authors:
Anil Kumar,
Yaakoub Berrouche,
Radosław Zimroz,
Govind Vashishtha,
Sumika Chauhan,
C. P. Gandhi,
Hesheng Tang,
Jiawei Xiang
Abstract:
A non-parametric complementary ensemble empirical mode decomposition (NPCEEMD) is proposed for identifying bearing defects using weak features. NPCEEMD is non-parametric because, unlike existing decomposition methods such as ensemble empirical mode decomposition, it does not require defining the ideal SNR of noise and the number of ensembles, every time while processing the signals. The simulation…
▽ More
A non-parametric complementary ensemble empirical mode decomposition (NPCEEMD) is proposed for identifying bearing defects using weak features. NPCEEMD is non-parametric because, unlike existing decomposition methods such as ensemble empirical mode decomposition, it does not require defining the ideal SNR of noise and the number of ensembles, every time while processing the signals. The simulation results show that mode mixing in NPCEEMD is less than the existing decomposition methods. After conducting in-depth simulation analysis, the proposed method is applied to experimental data. The proposed NPCEEMD method works in following steps. First raw signal is obtained. Second, the obtained signal is decomposed. Then, the mutual information (MI) of the raw signal with NPCEEMD-generated IMFs is computed. Further IMFs with MI above 0.1 are selected and combined to form a resulting signal. Finally, envelope spectrum of resulting signal is computed to confirm the presence of defect.
△ Less
Submitted 2 October, 2023; v1 submitted 12 September, 2023;
originally announced September 2023.
-
Temporal Patience: Efficient Adaptive Deep Learning for Embedded Radar Data Processing
Authors:
Max Sponner,
Julius Ott,
Lorenzo Servadei,
Bernd Waschneck,
Robert Wille,
Akash Kumar
Abstract:
Radar sensors offer power-efficient solutions for always-on smart devices, but processing the data streams on resource-constrained embedded platforms remains challenging. This paper presents novel techniques that leverage the temporal correlation present in streaming radar data to enhance the efficiency of Early Exit Neural Networks for Deep Learning inference on embedded devices. These networks a…
▽ More
Radar sensors offer power-efficient solutions for always-on smart devices, but processing the data streams on resource-constrained embedded platforms remains challenging. This paper presents novel techniques that leverage the temporal correlation present in streaming radar data to enhance the efficiency of Early Exit Neural Networks for Deep Learning inference on embedded devices. These networks add additional classifier branches between the architecture's hidden layers that allow for an early termination of the inference if their result is deemed sufficient enough by an at-runtime decision mechanism. Our methods enable more informed decisions on when to terminate the inference, reducing computational costs while maintaining a minimal loss of accuracy.
Our results demonstrate that our techniques save up to 26% of operations per inference over a Single Exit Network and 12% over a confidence-based Early Exit version. Our proposed techniques work on commodity hardware and can be combined with traditional optimizations, making them accessible for resource-constrained embedded platforms commonly used in smart devices. Such efficiency gains enable real-time radar data processing on resource-constrained platforms, allowing for new applications in the context of smart homes, Internet-of-Things, and human-computer interaction.
△ Less
Submitted 11 September, 2023;
originally announced September 2023.
-
On the Achievable Rate of MIMO Narrowband PLC with Spatio-Temporal Correlated Noise
Authors:
Mohammadreza Bakhshizadeh Mohajer,
Sadaf Moaveninejad,
Atul Kumar,
Mahmoud Elgenedy,
Naofal Al-Dhahir,
Luca Barletta,
Maurizio Magarini
Abstract:
Narrowband power line communication (NB-PLC) systems are an attractive solution for supporting current and future smart grids. A technology proposed to enhance data rate in NB-PLC is multiple-input multiple-output (MIMO) transmission over multiple power line phases. To achieve reliable communication over MIMO NB-PLC, a key challenge is to take into account and mitigate the effects of temporally an…
▽ More
Narrowband power line communication (NB-PLC) systems are an attractive solution for supporting current and future smart grids. A technology proposed to enhance data rate in NB-PLC is multiple-input multiple-output (MIMO) transmission over multiple power line phases. To achieve reliable communication over MIMO NB-PLC, a key challenge is to take into account and mitigate the effects of temporally and spatially correlated cyclostationary noise. Noise samples in a cycle can be divided into three classes with different distributions, i.e. Gaussian, moderate impulsive, and strong impulsive. However, in this paper we first show that the impulsive classes in their turn can be divided into sub-classes with normal distributions and, after deriving the theoretical capacity, two noise sample sets with such characteristics are used to evaluate achievable information rates: one sample set is the measured noise in laboratory and the other is produced through MIMO frequency-shift (FRESH) filtering. The achievable information rates are attained by means of a spatio-temporal whitening of the portions of the cyclostationary correlated noise samples that belong to the Gaussian sub-classes. The proposed approach can be useful to design the optimal receiver in terms of bit allocation using waterfilling algorithm and to adapt modulation order.
△ Less
Submitted 28 August, 2023;
originally announced August 2023.
-
Debiasing Counterfactuals In the Presence of Spurious Correlations
Authors:
Amar Kumar,
Nima Fathi,
Raghav Mehta,
Brennan Nichyporuk,
Jean-Pierre R. Falet,
Sotirios Tsaftaris,
Tal Arbel
Abstract:
Deep learning models can perform well in complex medical imaging classification tasks, even when basing their conclusions on spurious correlations (i.e. confounders), should they be prevalent in the training dataset, rather than on the causal image markers of interest. This would thereby limit their ability to generalize across the population. Explainability based on counterfactual image generatio…
▽ More
Deep learning models can perform well in complex medical imaging classification tasks, even when basing their conclusions on spurious correlations (i.e. confounders), should they be prevalent in the training dataset, rather than on the causal image markers of interest. This would thereby limit their ability to generalize across the population. Explainability based on counterfactual image generation can be used to expose the confounders but does not provide a strategy to mitigate the bias. In this work, we introduce the first end-to-end training framework that integrates both (i) popular debiasing classifiers (e.g. distributionally robust optimization (DRO)) to avoid latching onto the spurious correlations and (ii) counterfactual image generation to unveil generalizable imaging markers of relevance to the task. Additionally, we propose a novel metric, Spurious Correlation Latching Score (SCLS), to quantify the extent of the classifier reliance on the spurious correlation as exposed by the counterfactual images. Through comprehensive experiments on two public datasets (with the simulated and real visual artifacts), we demonstrate that the debiasing method: (i) learns generalizable markers across the population, and (ii) successfully ignores spurious correlations and focuses on the underlying disease pathology.
△ Less
Submitted 21 August, 2023;
originally announced August 2023.
-
DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion Models
Authors:
Chao Huang,
Susan Liang,
Yapeng Tian,
Anurag Kumar,
Chenliang Xu
Abstract:
We propose DAVIS, a Diffusion model-based Audio-VIusal Separation framework that solves the audio-visual sound source separation task through a generative manner. While existing discriminative methods that perform mask regression have made remarkable progress in this field, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse…
▽ More
We propose DAVIS, a Diffusion model-based Audio-VIusal Separation framework that solves the audio-visual sound source separation task through a generative manner. While existing discriminative methods that perform mask regression have made remarkable progress in this field, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS leverages a generative diffusion model and a Separation U-Net to synthesize separated magnitudes starting from Gaussian noises, conditioned on both the audio mixture and the visual footage. With its generative objective, DAVIS is better suited to achieving the goal of high-quality sound separation across diverse categories. We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the domain-specific MUSIC dataset and the open-domain AVE dataset, and results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of our framework for tackling the audio-visual source separation task.
△ Less
Submitted 31 July, 2023;
originally announced August 2023.
-
Musical Excellence of Mridangam: an introductory review
Authors:
Arvind Shankar Kumar
Abstract:
This is an introductory review of Musical Excellence of Mridangam by Dr. Umayalpuram K Sivaraman, Dr. T Ramasami and Dr. Naresh, which is a scientific treatise exploring the unique tonal properties of the ancient Indian classical percussive instrument -- the Mridangam. This review aims to bridge the gap between the primary intended audience of Musical Excellence of Mridangam - listeners, artistes…
▽ More
This is an introductory review of Musical Excellence of Mridangam by Dr. Umayalpuram K Sivaraman, Dr. T Ramasami and Dr. Naresh, which is a scientific treatise exploring the unique tonal properties of the ancient Indian classical percussive instrument -- the Mridangam. This review aims to bridge the gap between the primary intended audience of Musical Excellence of Mridangam - listeners, artistes and makers -- and the scientific rigour with which the original treatise is written, by first introducing the concepts of musical analysis and then presenting and explaining the discoveries made within this context. The first three chapters of this review introduce the basic scientific concepts used in Musical Excellence of Mridangam and provides background to previous scientific research into this instrument, starting from the seminal work of Dr. CV Raman. This also includes brief discussions of the corresponding chapters in Musical Excellence of Mridangam. The next chapters all serve the purpose of explaining the main scientific results presented in Musical Excellence of Mridangam in each of the corresponding chapters in the treatise, and finally summarizing the relevance of the work.
△ Less
Submitted 11 July, 2023;
originally announced July 2023.
-
ShredGP: Guitarist Style-Conditioned Tablature Generation
Authors:
Pedro Sarmento,
Adarsh Kumar,
Dekun Xie,
CJ Carr,
Zack Zukowski,
Mathieu Barthet
Abstract:
GuitarPro format tablatures are a type of digital music notation that encapsulates information about guitar playing techniques and fingerings. We introduce ShredGP, a GuitarPro tablature generative Transformer-based model conditioned to imitate the style of four distinct iconic electric guitarists. In order to assess the idiosyncrasies of each guitar player, we adopt a computational musicology met…
▽ More
GuitarPro format tablatures are a type of digital music notation that encapsulates information about guitar playing techniques and fingerings. We introduce ShredGP, a GuitarPro tablature generative Transformer-based model conditioned to imitate the style of four distinct iconic electric guitarists. In order to assess the idiosyncrasies of each guitar player, we adopt a computational musicology methodology by analysing features computed from the tokens yielded by the DadaGP encoding scheme. Statistical analyses of the features evidence significant differences between the four guitarists. We trained two variants of the ShredGP model, one using a multi-instrument corpus, the other using solo guitar data. We present a BERT-based model for guitar player classification and use it to evaluate the generated examples. Overall, results from the classifier show that ShredGP is able to generate content congruent with the style of the targeted guitar player. Finally, we reflect on prospective applications for ShredGP for human-AI music interaction.
△ Less
Submitted 11 July, 2023;
originally announced July 2023.
-
Fast, Smooth, and Safe: Implicit Control Barrier Functions through Reach-Avoid Differential Dynamic Programming
Authors:
Athindran Ramesh Kumar,
Kai-Chieh Hsu,
Peter J. Ramadge,
Jaime F. Fisac
Abstract:
Safety is a central requirement for autonomous system operation across domains. Hamilton-Jacobi (HJ) reachability analysis can be used to construct "least-restrictive" safety filters that result in infrequent, but often extreme, control overrides. In contrast, control barrier function (CBF) methods apply smooth control corrections to guard the system against an often conservative safety boundary.…
▽ More
Safety is a central requirement for autonomous system operation across domains. Hamilton-Jacobi (HJ) reachability analysis can be used to construct "least-restrictive" safety filters that result in infrequent, but often extreme, control overrides. In contrast, control barrier function (CBF) methods apply smooth control corrections to guard the system against an often conservative safety boundary. This paper provides an online scheme to construct an implicit CBF through HJ reach-avoid differential dynamic programming in a receding-horizon framework, enabling smooth safety filtering with infinite-time safety guarantees. Simulations with the Dubins car and 5D bicycle dynamics demonstrate the scheme's ability to preserve safety smoothly without the conservativeness of handcrafted CBFs.
△ Less
Submitted 30 June, 2023;
originally announced July 2023.
-
Learnable Digital Twin for Efficient Wireless Network Evaluation
Authors:
Boning Li,
Timofey Efimov,
Abhishek Kumar,
Jose Cortes,
Gunjan Verma,
Ananthram Swami,
Santiago Segarra
Abstract:
Network digital twins (NDTs) facilitate the estimation of key performance indicators (KPIs) before physically implementing a network, thereby enabling efficient optimization of the network configuration. In this paper, we propose a learning-based NDT for network simulators. The proposed method offers a holistic representation of information flow in a wireless network by integrating node, edge, and…
▽ More
Network digital twins (NDTs) facilitate the estimation of key performance indicators (KPIs) before physically implementing a network, thereby enabling efficient optimization of the network configuration. In this paper, we propose a learning-based NDT for network simulators. The proposed method offers a holistic representation of information flow in a wireless network by integrating node, edge, and path embeddings. Through this approach, the model is trained to map the network configuration to KPIs in a single forward pass. Hence, it offers a more efficient alternative to traditional simulation-based methods, thus allowing for rapid experimentation and optimization. Our proposed method has been extensively tested through comprehensive experimentation in various scenarios, including wired and wireless networks. Results show that it outperforms baseline learning models in terms of accuracy and robustness. Moreover, our approach achieves comparable performance to simulators but with significantly higher computational efficiency.
△ Less
Submitted 10 June, 2023;
originally announced June 2023.
-
An ensemble of convolution-based methods for fault detection using vibration signals
Authors:
Xian Yeow Lee,
Aman Kumar,
Lasitha Vidyaratne,
Aniruddha Rajendra Rao,
Ahmed Farahat,
Chetan Gupta
Abstract:
This paper focuses on solving a fault detection problem using multivariate time series of vibration signals collected from planetary gearboxes in a test rig. Various traditional machine learning and deep learning methods have been proposed for multivariate time-series classification, including distance-based, functional data-oriented, feature-driven, and convolution kernel-based methods. Recent st…
▽ More
This paper focuses on solving a fault detection problem using multivariate time series of vibration signals collected from planetary gearboxes in a test rig. Various traditional machine learning and deep learning methods have been proposed for multivariate time-series classification, including distance-based, functional data-oriented, feature-driven, and convolution kernel-based methods. Recent studies have shown using convolution kernel-based methods like ROCKET, and 1D convolutional neural networks with ResNet and FCN, have robust performance for multivariate time-series data classification. We propose an ensemble of three convolution kernel-based methods and show its efficacy on this fault detection problem by outperforming other approaches and achieving an accuracy of more than 98.8\%.
△ Less
Submitted 4 May, 2023;
originally announced May 2023.
-
Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition
Authors:
Dima Rekesh,
Nithin Rao Koluguri,
Samuel Kriman,
Somshubra Majumdar,
Vahid Noroozi,
He Huang,
Oleksii Hrinchuk,
Krishna Puvvada,
Ankur Kumar,
Jagadeesh Balam,
Boris Ginsburg
Abstract:
Conformer-based models have become the dominant end-to-end architecture for speech processing tasks. With the objective of enhancing the conformer architecture for efficient training and inference, we carefully redesigned Conformer with a novel downsampling schema. The proposed model, named Fast Conformer(FC), is 2.8x faster than the original Conformer, supports scaling to Billion parameters witho…
▽ More
Conformer-based models have become the dominant end-to-end architecture for speech processing tasks. With the objective of enhancing the conformer architecture for efficient training and inference, we carefully redesigned Conformer with a novel downsampling schema. The proposed model, named Fast Conformer(FC), is 2.8x faster than the original Conformer, supports scaling to Billion parameters without any changes to the core architecture and also achieves state-of-the-art accuracy on Automatic Speech Recognition benchmarks. To enable transcription of long-form speech up to 11 hours, we replaced global attention with limited context attention post-training, while also improving accuracy through fine-tuning with the addition of a global token. Fast Conformer, when combined with a Transformer decoder also outperforms the original Conformer in accuracy and in speed for Speech Translation and Spoken Language Understanding.
△ Less
Submitted 30 September, 2023; v1 submitted 8 May, 2023;
originally announced May 2023.
-
A Novel Deep Learning based Model for Erythrocytes Classification and Quantification in Sickle Cell Disease
Authors:
Manish Bhatia,
Balram Meena,
Vipin Kumar Rathi,
Prayag Tiwari,
Amit Kumar Jaiswal,
Shagaf M Ansari,
Ajay Kumar,
Pekka Marttinen
Abstract:
The shape of erythrocytes or red blood cells is altered in several pathological conditions. Therefore, identifying and quantifying different erythrocyte shapes can help diagnose various diseases and assist in designing a treatment strategy. Machine Learning (ML) can be efficiently used to identify and quantify distorted erythrocyte morphologies. In this paper, we proposed a customized deep convolu…
▽ More
The shape of erythrocytes or red blood cells is altered in several pathological conditions. Therefore, identifying and quantifying different erythrocyte shapes can help diagnose various diseases and assist in designing a treatment strategy. Machine Learning (ML) can be efficiently used to identify and quantify distorted erythrocyte morphologies. In this paper, we proposed a customized deep convolutional neural network (CNN) model to classify and quantify the distorted and normal morphology of erythrocytes from the images taken from the blood samples of patients suffering from Sickle cell disease ( SCD). We chose SCD as a model disease condition due to the presence of diverse erythrocyte morphologies in the blood samples of SCD patients. For the analysis, we used 428 raw microscopic images of SCD blood samples and generated the dataset consisting of 10, 377 single-cell images. We focused on three well-defined erythrocyte shapes, including discocytes, oval, and sickle. We used 18 layered deep CNN architecture to identify and quantify these shapes with 81% accuracy, outperforming other models. We also used SHAP and LIME for further interpretability. The proposed model can be helpful for the quick and accurate analysis of SCD blood samples by the clinicians and help them make the right decision for better management of SCD.
△ Less
Submitted 2 May, 2023;
originally announced May 2023.
-
From Words to Music: A Study of Subword Tokenization Techniques in Symbolic Music Generation
Authors:
Adarsh Kumar,
Pedro Sarmento
Abstract:
Subword tokenization has been widely successful in text-based natural language processing (NLP) tasks with Transformer-based models. As Transformer models become increasingly popular in symbolic music-related studies, it is imperative to investigate the efficacy of subword tokenization in the symbolic music domain. In this paper, we explore subword tokenization techniques, such as byte-pair encodi…
▽ More
Subword tokenization has been widely successful in text-based natural language processing (NLP) tasks with Transformer-based models. As Transformer models become increasingly popular in symbolic music-related studies, it is imperative to investigate the efficacy of subword tokenization in the symbolic music domain. In this paper, we explore subword tokenization techniques, such as byte-pair encoding (BPE), in symbolic music generation and its impact on the overall structure of generated songs. Our experiments are based on three types of MIDI datasets: single track-melody only, multi-track with a single instrument, and multi-track and multi-instrument. We apply subword tokenization on post-musical tokenization schemes and find that it enables the generation of longer songs at the same time and improves the overall structure of the generated music in terms of objective metrics like structure indicator (SI), Pitch Class Entropy, etc. We also compare two subword tokenization methods, BPE and Unigram, and observe that both methods lead to consistent improvements. Our study suggests that subword tokenization is a promising technique for symbolic music generation and may have broader implications for music composition, particularly in cases involving complex data such as multi-track songs.
△ Less
Submitted 25 April, 2023; v1 submitted 18 April, 2023;
originally announced April 2023.
-
Cashew dataset generation using augmentation and RaLSGAN and a transfer learning based tinyML approach towards disease detection
Authors:
Varsha Jayaprakash,
Akilesh K,
Ajay kumar,
Balamurugan M. S,
Manoj Kumar Rajagopal
Abstract:
Cashew is one of the most extensively consumed nuts in the world, and it is also known as a cash crop. A tree may generate a substantial yield in a few months and has a lifetime of around 70 to 80 years. Yet, in addition to the benefits, there are certain constraints to its cultivation. With the exception of parasites and algae, anthracnose is the most common disease affecting trees. When it comes…
▽ More
Cashew is one of the most extensively consumed nuts in the world, and it is also known as a cash crop. A tree may generate a substantial yield in a few months and has a lifetime of around 70 to 80 years. Yet, in addition to the benefits, there are certain constraints to its cultivation. With the exception of parasites and algae, anthracnose is the most common disease affecting trees. When it comes to cashew, the dense structure of the tree makes it difficult to diagnose the disease with ease compared to short crops. Hence, we present a dataset that exclusively consists of healthy and diseased cashew leaves and fruits. The dataset is authenticated by adding RGB color transformation to highlight diseased regions, photometric and geometric augmentations, and RaLSGAN to enlarge the initial collection of images and boost performance in real-time situations when working with a constrained dataset. Further, transfer learning is used to test the classification efficiency of the dataset using algorithms such as MobileNet and Inception. TensorFlow lite is utilized to develop these algorithms for disease diagnosis utilizing drones in real-time. Several post-training optimization strategies are utilized, and their memory size is compared. They have proven their effectiveness by delivering high accuracy (up to 99%) and a decrease in memory and latency, making them ideal for use in applications with limited resources.
△ Less
Submitted 18 April, 2023;
originally announced April 2023.
-
Cross-modulated Few-shot Image Generation for Colorectal Tissue Classification
Authors:
Amandeep Kumar,
Ankan kumar Bhunia,
Sanath Narayan,
Hisham Cholakkal,
Rao Muhammad Anwer,
Jorma Laaksonen,
Fahad Shahbaz Khan
Abstract:
In this work, we propose a few-shot colorectal tissue image generation method for addressing the scarcity of histopathological training data for rare cancer tissues. Our few-shot generation method, named XM-GAN, takes one base and a pair of reference tissue images as input and generates high-quality yet diverse images. Within our XM-GAN, a novel controllable fusion block densely aggregates local r…
▽ More
In this work, we propose a few-shot colorectal tissue image generation method for addressing the scarcity of histopathological training data for rare cancer tissues. Our few-shot generation method, named XM-GAN, takes one base and a pair of reference tissue images as input and generates high-quality yet diverse images. Within our XM-GAN, a novel controllable fusion block densely aggregates local regions of reference images based on their similarity to those in the base image, resulting in locally consistent features. To the best of our knowledge, we are the first to investigate few-shot generation in colorectal tissue images. We evaluate our few-shot colorectral tissue image generation by performing extensive qualitative, quantitative and subject specialist (pathologist) based evaluations. Specifically, in specialist-based evaluation, pathologists could differentiate between our XM-GAN generated tissue images and real images only 55% time. Moreover, we utilize these generated images as data augmentation to address the few-shot tissue image classification task, achieving a gain of 4.4% in terms of mean accuracy over the vanilla few-shot classifier. Code: \url{https://github.com/VIROBO-15/XM-GAN}
△ Less
Submitted 4 July, 2023; v1 submitted 4 April, 2023;
originally announced April 2023.
-
TorchAudio-Squim: Reference-less Speech Quality and Intelligibility measures in TorchAudio
Authors:
Anurag Kumar,
Ke Tan,
Zhaoheng Ni,
Pranay Manocha,
Xiaohui Zhang,
Ethan Henderson,
Buye Xu
Abstract:
Measuring quality and intelligibility of a speech signal is usually a critical step in development of speech processing systems. To enable this, a variety of metrics to measure quality and intelligibility under different assumptions have been developed. Through this paper, we introduce tools and a set of models to estimate such known metrics using deep neural networks. These models are made availa…
▽ More
Measuring quality and intelligibility of a speech signal is usually a critical step in development of speech processing systems. To enable this, a variety of metrics to measure quality and intelligibility under different assumptions have been developed. Through this paper, we introduce tools and a set of models to estimate such known metrics using deep neural networks. These models are made available in the well-established TorchAudio library, the core audio and speech processing library within the PyTorch deep learning framework. We refer to it as TorchAudio-Squim, TorchAudio-Speech QUality and Intelligibility Measures. More specifically, in the current version of TorchAudio-squim, we establish and release models for estimating PESQ, STOI and SI-SDR among objective metrics and MOS among subjective metrics. We develop a novel approach for objective metric estimation and use a recently developed approach for subjective metric estimation. These models operate in a ``reference-less" manner, that is they do not require the corresponding clean speech as reference for speech assessment. Given the unavailability of clean speech and the effortful process of subjective evaluation in real-world situations, such easy-to-use tools would greatly benefit speech processing research and development.
△ Less
Submitted 3 April, 2023;
originally announced April 2023.
-
Egocentric Audio-Visual Object Localization
Authors:
Chao Huang,
Yapeng Tian,
Anurag Kumar,
Chenliang Xu
Abstract:
Humans naturally perceive surrounding scenes by unifying sound and sight in a first-person view. Likewise, machines are advanced to approach human intelligence by learning with multisensory inputs from an egocentric perspective. In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even w…
▽ More
Humans naturally perceive surrounding scenes by unifying sound and sight in a first-person view. Likewise, machines are advanced to approach human intelligence by learning with multisensory inputs from an egocentric perspective. In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even within a short duration; 2) The out-of-view sound components can be created while wearers shift their attention. To address the first problem, we propose a geometry-aware temporal aggregation module to handle the egomotion explicitly. The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations. Moreover, we propose a cascaded feature enhancement module to tackle the second issue. It improves cross-modal localization robustness by disentangling visually-indicated audio representation. During training, we take advantage of the naturally available audio-visual temporal synchronization as the ``free'' self-supervision to avoid costly labeling. We also annotate and create the Epic Sounding Object dataset for evaluation purposes. Extensive experiments show that our method achieves state-of-the-art localization performance in egocentric videos and can be generalized to diverse audio-visual scenes.
△ Less
Submitted 23 March, 2023;
originally announced March 2023.
-
Legs as Manipulator: Pushing Quadrupedal Agility Beyond Locomotion
Authors:
Xuxin Cheng,
Ashish Kumar,
Deepak Pathak
Abstract:
Locomotion has seen dramatic progress for walking or running across challenging terrains. However, robotic quadrupeds are still far behind their biological counterparts, such as dogs, which display a variety of agile skills and can use the legs beyond locomotion to perform several basic manipulation tasks like interacting with objects and climbing. In this paper, we take a step towards bridging th…
▽ More
Locomotion has seen dramatic progress for walking or running across challenging terrains. However, robotic quadrupeds are still far behind their biological counterparts, such as dogs, which display a variety of agile skills and can use the legs beyond locomotion to perform several basic manipulation tasks like interacting with objects and climbing. In this paper, we take a step towards bridging this gap by training quadruped robots not only to walk but also to use the front legs to climb walls, press buttons, and perform object interaction in the real world. To handle this challenging optimization, we decouple the skill learning broadly into locomotion, which involves anything that involves movement whether via walking or climbing a wall, and manipulation, which involves using one leg to interact while balancing on the other three legs. These skills are trained in simulation using curriculum and transferred to the real world using our proposed sim2real variant that builds upon recent locomotion success. Finally, we combine these skills into a robust long-term plan by learning a behavior tree that encodes a high-level task hierarchy from one clean expert demonstration. We evaluate our method in both simulation and real-world showing successful executions of both short as well as long-range tasks and how robustness helps confront external perturbations. Videos at https://robot-skills.github.io
△ Less
Submitted 22 March, 2023; v1 submitted 20 March, 2023;
originally announced March 2023.
-
Malaria detection using Deep Convolution Neural Network
Authors:
Sumit Kumar,
Harsh Vardhan,
Sneha Priya,
Ayush Kumar
Abstract:
The latest WHO report showed that the number of malaria cases climbed to 219 million last year, two million higher than last year. The global efforts to fight malaria have hit a plateau and the most significant underlying reason is international funding has declined. Malaria, which is spread to people through the bites of infected female mosquitoes, occurs in 91 countries but about 90% of the case…
▽ More
The latest WHO report showed that the number of malaria cases climbed to 219 million last year, two million higher than last year. The global efforts to fight malaria have hit a plateau and the most significant underlying reason is international funding has declined. Malaria, which is spread to people through the bites of infected female mosquitoes, occurs in 91 countries but about 90% of the cases and deaths are in sub-Saharan Africa. The disease killed 4,35,000 people last year, the majority of them children under five in Africa. AI-backed technology has revolutionized malaria detection in some regions of Africa and the future impact of such work can be revolutionary. The malaria Cell Image Data-set is taken from the official NIH Website NIH data. The aim of the collection of the dataset was to reduce the burden for microscopists in resource-constrained regions and improve diagnostic accuracy using an AI-based algorithm to detect and segment the red blood cells. The goal of this work is to show that the state of the art accuracy can be obtained even by using 2 layer convolution network and show a new baseline in Malaria detection efforts using AI.
△ Less
Submitted 6 January, 2024; v1 submitted 4 March, 2023;
originally announced March 2023.
-
Two-Dimensional Wide Dynamic Range Displacement Sensor using Dielectric Resonator Coupled Microwave Circuit
Authors:
Premsai Regalla,
A. V. Praveen Kumar
Abstract:
In this paper, the authors propose a two-dimensional, wide dynamic range, linear displacement sensor using microwave methods. The microwave sensor circuit employs a cylindrical dielectric resonator proximity coupled to a pair of orthogonal microstrip lines formed on a microwave substrate. The DR rests on the substrate and is free to be displaced between the strips on the 2D plane of the substrate.…
▽ More
In this paper, the authors propose a two-dimensional, wide dynamic range, linear displacement sensor using microwave methods. The microwave sensor circuit employs a cylindrical dielectric resonator proximity coupled to a pair of orthogonal microstrip lines formed on a microwave substrate. The DR rests on the substrate and is free to be displaced between the strips on the 2D plane of the substrate. The strips excite the particular resonant mode of the DR, the intensity of which varies with the DRs proximity to the strips. The DRs position can thus be read out in terms of the 2 port S parameters of the circuit, at a fixed frequency determined by the resonant mode of DR. Such fixed frequency sensors are robust in operation and cost effective in realization, an important aspect of this sensor. Initial one-dimensional positioning simulations of the sensor through three fixed representative paths on the substrate reveal that the S parameters vary monotonically with the displacement. Prototype measurements reveal a dynamic range of 23 mm for horizontal or vertical displacement, and 30 mm for diagonal displacement at the resonant frequency of 3.67 GHz. Next, 2D positioning test is conducted and a technique for one to one map** from the S parameters to the 2D position is demonstrated. To conclude, the proposed sensors performance is compared with that of existing 2D sensors.
△ Less
Submitted 23 February, 2023;
originally announced February 2023.
-
PAAPLoss: A Phonetic-Aligned Acoustic Parameter Loss for Speech Enhancement
Authors:
Muqiao Yang,
Joseph Konan,
David Bick,
Yunyang Zeng,
Shuo Han,
Anurag Kumar,
Shinji Watanabe,
Bhiksha Raj
Abstract:
Despite rapid advancement in recent years, current speech enhancement models often produce speech that differs in perceptual quality from real clean speech. We propose a learning objective that formalizes differences in perceptual quality, by using domain knowledge of acoustic-phonetics. We identify temporal acoustic parameters -- such as spectral tilt, spectral flux, shimmer, etc. -- that are non…
▽ More
Despite rapid advancement in recent years, current speech enhancement models often produce speech that differs in perceptual quality from real clean speech. We propose a learning objective that formalizes differences in perceptual quality, by using domain knowledge of acoustic-phonetics. We identify temporal acoustic parameters -- such as spectral tilt, spectral flux, shimmer, etc. -- that are non-differentiable, and we develop a neural network estimator that can accurately predict their time-series values across an utterance. We also model phoneme-specific weights for each feature, as the acoustic parameters are known to show different behavior in different phonemes. We can add this criterion as an auxiliary loss to any model that produces speech, to optimize speech outputs to match the values of clean speech in these features. Experimentally we show that it improves speech enhancement workflows in both time-domain and time-frequency domain, as measured by standard evaluation metrics. We also provide an analysis of phoneme-dependent improvement on acoustic parameters, demonstrating the additional interpretability that our method provides. This analysis can suggest which features are currently the bottleneck for improvement.
△ Less
Submitted 16 February, 2023;
originally announced February 2023.
-
TAPLoss: A Temporal Acoustic Parameter Loss for Speech Enhancement
Authors:
Yunyang Zeng,
Joseph Konan,
Shuo Han,
David Bick,
Muqiao Yang,
Anurag Kumar,
Shinji Watanabe,
Bhiksha Raj
Abstract:
Speech enhancement models have greatly progressed in recent years, but still show limits in perceptual quality of their speech outputs. We propose an objective for perceptual quality based on temporal acoustic parameters. These are fundamental speech features that play an essential role in various applications, including speaker recognition and paralinguistic analysis. We provide a differentiable…
▽ More
Speech enhancement models have greatly progressed in recent years, but still show limits in perceptual quality of their speech outputs. We propose an objective for perceptual quality based on temporal acoustic parameters. These are fundamental speech features that play an essential role in various applications, including speaker recognition and paralinguistic analysis. We provide a differentiable estimator for four categories of low-level acoustic descriptors involving: frequency-related parameters, energy or amplitude-related parameters, spectral balance parameters, and temporal features. Unlike prior work that looks at aggregated acoustic parameters or a few categories of acoustic parameters, our temporal acoustic parameter (TAP) loss enables auxiliary optimization and improvement of many fine-grain speech characteristics in enhancement workflows. We show that adding TAPLoss as an auxiliary objective in speech enhancement produces speech with improved perceptual quality and intelligibility. We use data from the Deep Noise Suppression 2020 Challenge to demonstrate that both time-domain models and time-frequency domain models can benefit from our method.
△ Less
Submitted 15 February, 2023;
originally announced February 2023.
-
GTR-CTRL: Instrument and Genre Conditioning for Guitar-Focused Music Generation with Transformers
Authors:
Pedro Sarmento,
Adarsh Kumar,
Yu-Hua Chen,
CJ Carr,
Zack Zukowski,
Mathieu Barthet
Abstract:
Recently, symbolic music generation with deep learning techniques has witnessed steady improvements. Most works on this topic focus on MIDI representations, but less attention has been paid to symbolic music generation using guitar tablatures (tabs) which can be used to encode multiple instruments. Tabs include information on expressive techniques and fingerings for fretted string instruments in a…
▽ More
Recently, symbolic music generation with deep learning techniques has witnessed steady improvements. Most works on this topic focus on MIDI representations, but less attention has been paid to symbolic music generation using guitar tablatures (tabs) which can be used to encode multiple instruments. Tabs include information on expressive techniques and fingerings for fretted string instruments in addition to rhythm and pitch. In this work, we use the DadaGP dataset for guitar tab music generation, a corpus of over 26k songs in GuitarPro and token formats. We introduce methods to condition a Transformer-XL deep learning model to generate guitar tabs (GTR-CTRL) based on desired instrumentation (inst-CTRL) and genre (genre-CTRL). Special control tokens are appended at the beginning of each song in the training corpus. We assess the performance of the model with and without conditioning. We propose instrument presence metrics to assess the inst-CTRL model's response to a given instrumentation prompt. We trained a BERT model for downstream genre classification and used it to assess the results obtained with the genre-CTRL model. Statistical analyses evidence significant differences between the conditioned and unconditioned models. Overall, results indicate that the GTR-CTRL methods provide more flexibility and control for guitar-focused symbolic music generation than an unconditioned model.
△ Less
Submitted 10 February, 2023;
originally announced February 2023.
-
AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis
Authors:
Susan Liang,
Chao Huang,
Yapeng Tian,
Anurag Kumar,
Chenliang Xu
Abstract:
Can machines recording an audio-visual scene produce realistic, matching audio-visual experiences at novel positions and novel view directions? We answer it by studying a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning. Concretely, given a video recording of an audio-visual scene, the task is to synthesize new videos with s…
▽ More
Can machines recording an audio-visual scene produce realistic, matching audio-visual experiences at novel positions and novel view directions? We answer it by studying a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning. Concretely, given a video recording of an audio-visual scene, the task is to synthesize new videos with spatial audios along arbitrary novel camera trajectories in that scene. We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF, in which we implicitly associate audio generation with the 3D geometry and material properties of a visual environment. Furthermore, we present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields. To facilitate the study of this new task, we collect a high-quality Real-World Audio-Visual Scene (RWAVS) dataset. We demonstrate the advantages of our method on this real-world dataset and the simulation-based SoundSpaces dataset.
△ Less
Submitted 16 October, 2023; v1 submitted 3 February, 2023;
originally announced February 2023.
-
Rethinking complex-valued deep neural networks for monaural speech enhancement
Authors:
Haibin Wu,
Ke Tan,
Buye Xu,
Anurag Kumar,
Daniel Wong
Abstract:
Despite multiple efforts made towards adopting complex-valued deep neural networks (DNNs), it remains an open question whether complex-valued DNNs are generally more effective than real-valued DNNs for monaural speech enhancement. This work is devoted to presenting a critical assessment by systematically examining complex-valued DNNs against their real-valued counterparts. Specifically, we investi…
▽ More
Despite multiple efforts made towards adopting complex-valued deep neural networks (DNNs), it remains an open question whether complex-valued DNNs are generally more effective than real-valued DNNs for monaural speech enhancement. This work is devoted to presenting a critical assessment by systematically examining complex-valued DNNs against their real-valued counterparts. Specifically, we investigate complex-valued DNN atomic units, including linear layers, convolutional layers, long short-term memory (LSTM), and gated linear units. By comparing complex- and real-valued versions of fundamental building blocks in the recently developed gated convolutional recurrent network (GCRN), we show how different mechanisms for basic blocks affect the performance. We also find that the use of complex-valued operations hinders the model capacity when the model size is small. In addition, we examine two recent complex-valued DNNs, i.e. deep complex convolutional recurrent network (DCCRN) and deep complex U-Net (DCUNET). Evaluation results show that both DNNs produce identical performance to their real-valued counterparts while requiring much more computation. Based on these comprehensive comparisons, we conclude that complex-valued DNNs do not provide a performance gain over their real-valued counterparts for monaural speech enhancement, and thus are less desirable due to their higher computational costs.
△ Less
Submitted 11 January, 2023;
originally announced January 2023.