Search | arXiv e-print repository

Low-Resolution Chest X-ray Classification via Knowledge Distillation and Multi-task Learning

Authors: Yasmeena Akhter, Rishabh Ranjan, Richa Singh, Mayank Vatsa

Abstract: This research addresses the challenges of diagnosing chest X-rays (CXRs) at low resolutions, a common limitation in resource-constrained healthcare settings. High-resolution CXR imaging is crucial for identifying small but critical anomalies, such as nodules or opacities. However, when images are downsized for processing in Computer-Aided Diagnosis (CAD) systems, vital spatial details and receptiv… ▽ More This research addresses the challenges of diagnosing chest X-rays (CXRs) at low resolutions, a common limitation in resource-constrained healthcare settings. High-resolution CXR imaging is crucial for identifying small but critical anomalies, such as nodules or opacities. However, when images are downsized for processing in Computer-Aided Diagnosis (CAD) systems, vital spatial details and receptive fields are lost, hampering diagnosis accuracy. To address this, this paper presents the Multilevel Collaborative Attention Knowledge (MLCAK) method. This approach leverages the self-attention mechanism of Vision Transformers (ViT) to transfer critical diagnostic knowledge from high-resolution images to enhance the diagnostic efficacy of low-resolution CXRs. MLCAK incorporates local pathological findings to boost model explainability, enabling more accurate global predictions in a multi-task framework tailored for low-resolution CXR analysis. Our research, utilizing the Vindr CXR dataset, shows a considerable enhancement in the ability to diagnose diseases from low-resolution images (e.g. 28 x 28), suggesting a critical transition from the traditional reliance on high-resolution imaging (e.g. 224 x 224). △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: IEEE ISBI 2024

arXiv:2405.09101 [pdf, other]

Adaptive Koopman Embedding for Robust Control of Complex Nonlinear Dynamical Systems

Authors: Rajpal Singh, Chandan Kumar Sah, Jishnu Keshavan

Abstract: The discovery of linear embedding is the key to the synthesis of linear control techniques for nonlinear systems. In recent years, while Koopman operator theory has become a prominent approach for learning these linear embeddings through data-driven methods, these algorithms often exhibit limitations in generalizability beyond the distribution captured by training data and are not robust to change… ▽ More The discovery of linear embedding is the key to the synthesis of linear control techniques for nonlinear systems. In recent years, while Koopman operator theory has become a prominent approach for learning these linear embeddings through data-driven methods, these algorithms often exhibit limitations in generalizability beyond the distribution captured by training data and are not robust to changes in the nominal system dynamics induced by intrinsic or environmental factors. To overcome these limitations, this study presents an adaptive Koopman architecture capable of responding to the changes in system dynamics online. The proposed framework initially employs an autoencoder-based neural network that utilizes input-output information from the nominal system to learn the corresponding Koopman embedding offline. Subsequently, we augment this nominal Koopman architecture with a feed-forward neural network that learns to modify the nominal dynamics in response to any deviation between the predicted and observed lifted states, leading to improved generalization and robustness to a wide range of uncertainties and disturbances compared to contemporary methods. Extensive tracking control simulations, which are undertaken by integrating the proposed scheme within a Model Predictive Control framework, are used to highlight its robustness against measurement noise, disturbances, and parametric variations in system dynamics. △ Less

Submitted 20 May, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

Comments: Corrected the title

arXiv:2405.05937 [pdf, other]

Dynamics of a Towed Cable with Sensor-Array for Underwater Target Motion Analysis

Authors: Rohit Kumar Singh, Subrata Kumar, Shovan Bhaumik

Abstract: During a war situation, many times an underwater target motion analysis (TMA) is performed using bearing-only measurements, obtained from a sensor array, which is towed by an own-ship with the help of a connected cable. It is well known that the own-ship is required to perform a manoeuvre in order to make the system observable and localise the target successfully. During the maneuver, it is import… ▽ More During a war situation, many times an underwater target motion analysis (TMA) is performed using bearing-only measurements, obtained from a sensor array, which is towed by an own-ship with the help of a connected cable. It is well known that the own-ship is required to perform a manoeuvre in order to make the system observable and localise the target successfully. During the maneuver, it is important to know the location of the sensor array with respect to the own-ship. This paper develops a dynamic model of a cable-sensor array system to localise the sensor array, which is towed behind a sea-surface vessel. We adopt a lumped-mass approach to represent the towed cable. The discretized cable elements are modelled as an interconnected rigid body, kinematically related to one another. The governing equations are derived by balancing the moments acting on each node. The derived dynamics are solved simultaneously for all the nodes to determine the orientation of the cable and sensor array. The position of the sensor array obtained from this proposed model will further be used by TMA algorithms to enhance the accuracy of the tracking system. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2405.05676 [pdf, other]

Maximum Correntropy Polynomial Chaos Kalman Filter for Underwater Navigation

Authors: Rohit Kumar Singh, Joydeb Saha, Shovan Bhaumik

Abstract: This paper develops an underwater navigation solution that utilizes a strapdown inertial navigation system (SINS) and fuses a set of auxiliary sensors such as an acoustic positioning system, Doppler velocity log, depth meter, attitude meter, and magnetometer to accurately estimate an underwater vessel's position and orientation. The conventional integrated navigation system assumes Gaussian measur… ▽ More This paper develops an underwater navigation solution that utilizes a strapdown inertial navigation system (SINS) and fuses a set of auxiliary sensors such as an acoustic positioning system, Doppler velocity log, depth meter, attitude meter, and magnetometer to accurately estimate an underwater vessel's position and orientation. The conventional integrated navigation system assumes Gaussian measurement noise, while in reality, the noises are non-Gaussian, particularly contaminated by heavy-tailed impulsive noises. To address this issue, and to fuse the system model with the acquired sensor measurements efficiently, we develop a square root polynomial chaos Kalman filter based on maximum correntropy criteria. The filter is initialized using acoustic beaconing to accurately locate the initial position of the vehicle. The computational complexity of the proposed filter is calculated in terms of flops count. The proposed method is compared with the existing maximum correntropy sigma point filters in terms of estimation accuracy and computational complexity. The simulation results demonstrate an improved accuracy compared to the conventional deterministic sample point filters. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2403.15248 [pdf, other]

Self-Supervised Backbone Framework for Diverse Agricultural Vision Tasks

Authors: Sudhir Sornapudi, Rajhans Singh

Abstract: Computer vision in agriculture is game-changing with its ability to transform farming into a data-driven, precise, and sustainable industry. Deep learning has empowered agriculture vision to analyze vast, complex visual data, but heavily rely on the availability of large annotated datasets. This remains a bottleneck as manual labeling is error-prone, time-consuming, and expensive. The lack of effi… ▽ More Computer vision in agriculture is game-changing with its ability to transform farming into a data-driven, precise, and sustainable industry. Deep learning has empowered agriculture vision to analyze vast, complex visual data, but heavily rely on the availability of large annotated datasets. This remains a bottleneck as manual labeling is error-prone, time-consuming, and expensive. The lack of efficient labeling approaches inspired us to consider self-supervised learning as a paradigm shift, learning meaningful feature representations from raw agricultural image data. In this work, we explore how self-supervised representation learning unlocks the potential applicability to diverse agriculture vision tasks by eliminating the need for large-scale annotated datasets. We propose a lightweight framework utilizing SimCLR, a contrastive learning approach, to pre-train a ResNet-50 backbone on a large, unannotated dataset of real-world agriculture field images. Our experimental analysis and results indicate that the model learns robust features applicable to a broad range of downstream agriculture tasks discussed in the paper. Additionally, the reduced reliance on annotated data makes our approach more cost-effective and accessible, paving the way for broader adoption of computer vision in agriculture. △ Less

Submitted 22 March, 2024; originally announced March 2024.

arXiv:2402.15707 [pdf, other]

A Quick Guide to Quantum Communication

Authors: Rohit Singh, Roshan M. Bodile

Abstract: This article provides a quick overview of quantum communication, bringing together several innovative aspects of quantum enabled transmission. We first take a neutral look at the role of quantum communication, presenting its importance for the forthcoming wireless. Then, we summarise the principles and basic mechanisms involved in quantum communication, including quantum entanglement, quantum supe… ▽ More This article provides a quick overview of quantum communication, bringing together several innovative aspects of quantum enabled transmission. We first take a neutral look at the role of quantum communication, presenting its importance for the forthcoming wireless. Then, we summarise the principles and basic mechanisms involved in quantum communication, including quantum entanglement, quantum superposition, and quantum teleportation. Further, we highlight its groundbreaking features, opportunities, challenges and future prospects. △ Less

Submitted 23 February, 2024; originally announced February 2024.

arXiv:2402.09585 [pdf, other]

Domain Adaptation for Contrastive Audio-Language Models

Authors: Soham Deshmukh, Rita Singh, Bhiksha Raj

Abstract: Audio-Language Models (ALM) aim to be general-purpose audio models by providing zero-shot capabilities at test time. The zero-shot performance of ALM improves by using suitable text prompts for each domain. The text prompts are usually hand-crafted through an ad-hoc process and lead to a drop in ALM generalization and out-of-distribution performance. Existing approaches to improve domain performan… ▽ More Audio-Language Models (ALM) aim to be general-purpose audio models by providing zero-shot capabilities at test time. The zero-shot performance of ALM improves by using suitable text prompts for each domain. The text prompts are usually hand-crafted through an ad-hoc process and lead to a drop in ALM generalization and out-of-distribution performance. Existing approaches to improve domain performance, like few-shot learning or fine-tuning, require access to annotated data and iterations of training. Therefore, we propose a test-time domain adaptation method for ALMs that does not require access to annotations. Our method learns a domain vector by enforcing consistency across augmented views of the testing audio. We extensively evaluate our approach on 12 downstream tasks across domains. With just one example, our domain adaptation method leads to 3.2% (max 8.4%) average zero-shot performance improvement. After adaptation, the model still retains the generalization property of ALMs. △ Less

Submitted 14 February, 2024; originally announced February 2024.

arXiv:2402.09244 [pdf, other]

Zero-energy Devices for 6G: Technical Enablers at a Glance

Authors: Onel López, Ritesh Kumar Singh, Dinh-Thuy Phan-Huy, Efstathios Katranaras, Nafiseh Mazloum, Riku Jäntti, Hamza Khan, Osmel Rosabal, Pavlos Alexias, Prasoon Raghuwanshi, David Ruiz-Guirola, Bikramjit Singh, Andreas Höglund, Dung Pham Van, Amirhossein Azarbahram, Jeroen Famaey

Abstract: Low-cost, resource-constrained, maintenance-free, and energy-harvesting (EH) Internet of Things (IoT) devices, referred to as zero-energy devices (ZEDs), are rapidly attracting attention from industry and academia due to their myriad of applications. To date, such devices remain primarily unsupported by modern IoT connectivity solutions due to their intrinsic fabrication, hardware, deployment, and… ▽ More Low-cost, resource-constrained, maintenance-free, and energy-harvesting (EH) Internet of Things (IoT) devices, referred to as zero-energy devices (ZEDs), are rapidly attracting attention from industry and academia due to their myriad of applications. To date, such devices remain primarily unsupported by modern IoT connectivity solutions due to their intrinsic fabrication, hardware, deployment, and operation limitations, while lacking clarity on their key technical enablers and prospects. Herein, we address this by discussing the main characteristics and enabling technologies of ZEDs within the next generation of mobile networks, specifically focusing on unconventional EH sources, multi-source EH, power management, energy storage solutions, manufacturing material and practices, backscattering, and low-complexity receivers. Moreover, we highlight the need for lightweight and energy-aware computing, communication, and scheduling protocols, while discussing potential approaches related to TinyML, duty cycling, and infrastructure enablers like radio frequency wireless power transfer and wake-up protocols. Challenging aspects and open research directions are identified and discussed in all the cases. Finally, we showcase an experimental ZED proof-of-concept related to ambient cellular backscattering. △ Less

Submitted 14 February, 2024; originally announced February 2024.

Comments: 8 pages, 4 Figures

arXiv:2402.00282 [pdf, other]

PAM: Prompting Audio-Language Models for Audio Quality Assessment

Authors: Soham Deshmukh, Dareen Alharthi, Benjamin Elizalde, Hannes Gamper, Mahmoud Al Ismail, Rita Singh, Bhiksha Raj, Huaming Wang

Abstract: While audio quality is a key performance metric for various audio processing tasks, including generative modeling, its objective measurement remains a challenge. Audio-Language Models (ALMs) are pre-trained on audio-text pairs that may contain information about audio quality, the presence of artifacts, or noise. Given an audio input and a text prompt related to quality, an ALM can be used to calcu… ▽ More While audio quality is a key performance metric for various audio processing tasks, including generative modeling, its objective measurement remains a challenge. Audio-Language Models (ALMs) are pre-trained on audio-text pairs that may contain information about audio quality, the presence of artifacts, or noise. Given an audio input and a text prompt related to quality, an ALM can be used to calculate a similarity score between the two. Here, we exploit this capability and introduce PAM, a no-reference metric for assessing audio quality for different audio processing tasks. Contrary to other "reference-free" metrics, PAM does not require computing embeddings on a reference dataset nor training a task-specific model on a costly set of human listening scores. We extensively evaluate the reliability of PAM against established metrics and human listening scores on four tasks: text-to-audio (TTA), text-to-music generation (TTM), text-to-speech (TTS), and deep noise suppression (DNS). We perform multiple ablation studies with controlled distortions, in-the-wild setups, and prompt choices. Our evaluation shows that PAM correlates well with existing metrics and human listening scores. These results demonstrate the potential of ALMs for computing a general-purpose audio quality metric. △ Less

Submitted 31 January, 2024; originally announced February 2024.

arXiv:2401.12803 [pdf, other]

Enhancements for 5G NR PRACH Reception: An AI/ML Approach

Authors: Rohit Singh, Anil Kumar Yerrapragada, Jeeva Keshav S, Radha Krishna Ganti

Abstract: Random Access is an important step in enabling the initial attachment of a User Equipment (UE) to a Base Station (gNB). The UE identifies itself by embedding a Preamble Index (RAPID) in the phase rotation of a known base sequence, which it transmits on the Physical Random Access Channel (PRACH). The signal on the PRACH also enables the estimation of propagation delay, often known as Timing Advance… ▽ More Random Access is an important step in enabling the initial attachment of a User Equipment (UE) to a Base Station (gNB). The UE identifies itself by embedding a Preamble Index (RAPID) in the phase rotation of a known base sequence, which it transmits on the Physical Random Access Channel (PRACH). The signal on the PRACH also enables the estimation of propagation delay, often known as Timing Advance (TA), which is induced by virtue of the UE's position. Traditional receivers estimate the RAPID and TA using correlation-based techniques. This paper presents an alternative receiver approach that uses AI/ML models, wherein two neural networks are proposed, one for the RAPID and one for the TA. Different from other works, these two models can run in parallel as opposed to sequentially. Experiments with both simulated data and over-the-air hardware captures highlight the improved performance of the proposed AI/ML-based techniques compared to conventional correlation methods. △ Less

Submitted 12 January, 2024; originally announced January 2024.

arXiv:2310.13817 [pdf, other]

Deep Learning Based Forecasting-Aided State Estimation in Active Distribution Networks

Authors: Malek Alduhaymi, Ravindra Singh, Firdous Ul Nazir, Bikash C. Pal

Abstract: Operating an active distribution network (ADN) in the absence of enough measurements, the presence of distributed energy resources, and poor knowledge of responsive demand behaviour is a huge challenge. This paper introduces systematic modelling of demand response behaviour which is then included in Forecasting Aided State Estimation (FASE) for better control of the network. There are several inno… ▽ More Operating an active distribution network (ADN) in the absence of enough measurements, the presence of distributed energy resources, and poor knowledge of responsive demand behaviour is a huge challenge. This paper introduces systematic modelling of demand response behaviour which is then included in Forecasting Aided State Estimation (FASE) for better control of the network. There are several innovative elements in tuning parameters of FASE-based, demand profiling, and aggregation. The comprehensive case studies for three UK representative demand scenarios in 2023, 2035, and 2050 demonstrated the effectiveness of the proposed approach. △ Less

Submitted 20 October, 2023; originally announced October 2023.

arXiv:2310.02298 [pdf, other]

Prompting Audios Using Acoustic Properties For Emotion Representation

Authors: Hira Dhamyal, Benjamin Elizalde, Soham Deshmukh, Huaming Wang, Bhiksha Raj, Rita Singh

Abstract: Emotions lie on a continuum, but current models treat emotions as a finite valued discrete variable. This representation does not capture the diversity in the expression of emotion. To better represent emotions we propose the use of natural language descriptions (or prompts). In this work, we address the challenge of automatically generating these prompts and training a model to better learn emoti… ▽ More Emotions lie on a continuum, but current models treat emotions as a finite valued discrete variable. This representation does not capture the diversity in the expression of emotion. To better represent emotions we propose the use of natural language descriptions (or prompts). In this work, we address the challenge of automatically generating these prompts and training a model to better learn emotion representations from audio and prompt pairs. We use acoustic properties that are correlated to emotion like pitch, intensity, speech rate, and articulation rate to automatically generate prompts i.e. 'acoustic prompts'. We use a contrastive learning objective to map speech to their respective acoustic prompts. We evaluate our model on Emotion Audio Retrieval and Speech Emotion Recognition. Our results show that the acoustic prompts significantly improve the model's performance in EAR, in various Precision@K metrics. In SER, we observe a 3.8% relative accuracy improvement on the Ravdess dataset. △ Less

Submitted 6 December, 2023; v1 submitted 3 October, 2023; originally announced October 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2211.07737

arXiv:2310.00706 [pdf, other]

Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech

Authors: Dareen Alharthi, Roshan Sharma, Hira Dhamyal, Soumi Maiti, Bhiksha Raj, Rita Singh

Abstract: Modern speech synthesis systems have improved significantly, with synthetic speech being indistinguishable from real speech. However, efficient and holistic evaluation of synthetic speech still remains a significant challenge. Human evaluation using Mean Opinion Score (MOS) is ideal, but inefficient due to high costs. Therefore, researchers have developed auxiliary automatic metrics like Word Erro… ▽ More Modern speech synthesis systems have improved significantly, with synthetic speech being indistinguishable from real speech. However, efficient and holistic evaluation of synthetic speech still remains a significant challenge. Human evaluation using Mean Opinion Score (MOS) is ideal, but inefficient due to high costs. Therefore, researchers have developed auxiliary automatic metrics like Word Error Rate (WER) to measure intelligibility. Prior works focus on evaluating synthetic speech based on pre-trained speech recognition models, however, this can be limiting since this approach primarily measures speech intelligibility. In this paper, we propose an evaluation technique involving the training of an ASR model on synthetic speech and assessing its performance on real speech. Our main assumption is that by training the ASR model on the synthetic speech, the WER on real speech reflects the similarity between distributions, a broader assessment of synthetic speech quality beyond intelligibility. Our proposed metric demonstrates a strong correlation with both MOS naturalness and MOS intelligibility when compared to SpeechLMScore and MOSNet on three recent Text-to-Speech (TTS) systems: MQTTS, StyleTTS, and YourTTS. △ Less

Submitted 1 October, 2023; originally announced October 2023.

arXiv:2309.13544 [pdf]

Related Rhythms: Recommendation System To Discover Music You May Like

Authors: Rahul Singh, Pranav Kanuparthi

Abstract: Machine Learning models are being utilized extensively to drive recommender systems, which is a widely explored topic today. This is especially true of the music industry, where we are witnessing a surge in growth. Besides a large chunk of active users, these systems are fueled by massive amounts of data. These large-scale systems yield applications that aim to provide a better user experience and… ▽ More Machine Learning models are being utilized extensively to drive recommender systems, which is a widely explored topic today. This is especially true of the music industry, where we are witnessing a surge in growth. Besides a large chunk of active users, these systems are fueled by massive amounts of data. These large-scale systems yield applications that aim to provide a better user experience and to keep customers actively engaged. In this paper, a distributed Machine Learning (ML) pipeline is delineated, which is capable of taking a subset of songs as input and producing a new subset of songs identified as being similar to the inputted subset. The publicly accessible Million Songs Dataset (MSD) enables researchers to develop and explore reasonably efficient systems for audio track analysis and recommendations, without having to access a commercialized music platform. The objective of the proposed application is to leverage an ML system trained to optimally recommend songs that a user might like. △ Less

Submitted 24 September, 2023; originally announced September 2023.

ACM Class: I.2.6; H.3.3

arXiv:2309.13227 [pdf, other]

Importance of negative sampling in weak label learning

Authors: Ankit Shah, Fuyu Tang, Zelin Ye, Rita Singh, Bhiksha Raj

Abstract: Weak-label learning is a challenging task that requires learning from data "bags" containing positive and negative instances, but only the bag labels are known. The pool of negative instances is usually larger than positive instances, thus making selecting the most informative negative instance critical for performance. Such a selection strategy for negative instances from each bag is an open prob… ▽ More Weak-label learning is a challenging task that requires learning from data "bags" containing positive and negative instances, but only the bag labels are known. The pool of negative instances is usually larger than positive instances, thus making selecting the most informative negative instance critical for performance. Such a selection strategy for negative instances from each bag is an open problem that has not been well studied for weak-label learning. In this paper, we study several sampling strategies that can measure the usefulness of negative instances for weak-label learning and select them accordingly. We test our method on CIFAR-10 and AudioSet datasets and show that it improves the weak-label classification performance and reduces the computational cost compared to random sampling methods. Our work reveals that negative instances are not all equally irrelevant, and selecting them wisely can benefit weak-label learning. △ Less

Submitted 22 September, 2023; originally announced September 2023.

arXiv:2309.07372 [pdf, other]

Training Audio Captioning Models without Audio

Authors: Soham Deshmukh, Benjamin Elizalde, Dimitra Emmanouilidou, Bhiksha Raj, Rita Singh, Huaming Wang

Abstract: Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an a… ▽ More Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an approach to train AAC systems using only text. Our approach leverages the multimodal space of contrastively trained audio-text models, such as CLAP. During training, a decoder generates captions conditioned on the pretrained CLAP text encoder. During inference, the text encoder is replaced with the pretrained CLAP audio encoder. To bridge the modality gap between text and audio embeddings, we propose the use of noise injection or a learnable adapter, during training. We find that the proposed text-only framework performs competitively with state-of-the-art models trained with paired audio, showing that efficient text-to-audio transfer is possible. Finally, we showcase both stylized audio captioning and caption enrichment while training without audio or human-created text captions. △ Less

Submitted 13 September, 2023; originally announced September 2023.

arXiv:2308.14190 [pdf, other]

doi 10.59275/j.melba.2024-5d51

Score-Based Generative Models for PET Image Reconstruction

Authors: Imraj RD Singh, Alexander Denker, Riccardo Barbano, Željko Kereta, Bangti **, Kris Thielemans, Peter Maass, Simon Arridge

Abstract: Score-based generative models have demonstrated highly promising results for medical image reconstruction tasks in magnetic resonance imaging or computed tomography. However, their application to Positron Emission Tomography (PET) is still largely unexplored. PET image reconstruction involves a variety of challenges, including Poisson noise with high variance and a wide dynamic range. To address t… ▽ More Score-based generative models have demonstrated highly promising results for medical image reconstruction tasks in magnetic resonance imaging or computed tomography. However, their application to Positron Emission Tomography (PET) is still largely unexplored. PET image reconstruction involves a variety of challenges, including Poisson noise with high variance and a wide dynamic range. To address these challenges, we propose several PET-specific adaptations of score-based generative models. The proposed framework is developed for both 2D and 3D PET. In addition, we provide an extension to guided reconstruction using magnetic resonance images. We validate the approach through extensive 2D and 3D $\textit{in-silico}$ experiments with a model trained on patient-realistic data without lesions, and evaluate on data without lesions as well as out-of-distribution data with lesions. This demonstrates the proposed method's robustness and significant potential for improved PET reconstruction. △ Less

Submitted 23 January, 2024; v1 submitted 27 August, 2023; originally announced August 2023.

Comments: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2024:001

MSC Class: 15A29; 45Q05 ACM Class: I.4.9; J.2; I.2.1

Journal ref: Machine.Learning.for.Biomedical.Imaging. 2 (2024)

arXiv:2307.13953 [pdf, other]

The Hidden Dance of Phonemes and Visage: Unveiling the Enigmatic Link between Phonemes and Facial Features

Authors: Liao Qu, Xianwei Zou, Xiang Li, Yandong Wen, Rita Singh, Bhiksha Raj

Abstract: This work unveils the enigmatic link between phonemes and facial features. Traditional studies on voice-face correlations typically involve using a long period of voice input, including generating face images from voices and reconstructing 3D face meshes from voices. However, in situations like voice-based crimes, the available voice evidence may be short and limited. Additionally, from a physiolo… ▽ More This work unveils the enigmatic link between phonemes and facial features. Traditional studies on voice-face correlations typically involve using a long period of voice input, including generating face images from voices and reconstructing 3D face meshes from voices. However, in situations like voice-based crimes, the available voice evidence may be short and limited. Additionally, from a physiological perspective, each segment of speech -- phoneme -- corresponds to different types of airflow and movements in the face. Therefore, it is advantageous to discover the hidden link between phonemes and face attributes. In this paper, we propose an analysis pipeline to help us explore the voice-face relationship in a fine-grained manner, i.e., phonemes v.s. facial anthropometric measurements (AM). We build an estimator for each phoneme-AM pair and evaluate the correlation through hypothesis testing. Our results indicate that AMs are more predictable from vowels compared to consonants, particularly with plosives. Additionally, we observe that if a specific AM exhibits more movement during phoneme pronunciation, it is more predictable. Our findings support those in physiology regarding correlation and lay the groundwork for future research on speech-face multimodal learning. △ Less

Submitted 26 July, 2023; originally announced July 2023.

Comments: Interspeech 2023

arXiv:2307.13948 [pdf, other]

Rethinking Voice-Face Correlation: A Geometry View

Authors: Xiang Li, Yandong Wen, Muqiao Yang, **glu Wang, Rita Singh, Bhiksha Raj

Abstract: Previous works on voice-face matching and voice-guided face synthesis demonstrate strong correlations between voice and face, but mainly rely on coarse semantic cues such as gender, age, and emotion. In this paper, we aim to investigate the capability of reconstructing the 3D facial shape from voice from a geometry perspective without any semantic information. We propose a voice-anthropometric mea… ▽ More Previous works on voice-face matching and voice-guided face synthesis demonstrate strong correlations between voice and face, but mainly rely on coarse semantic cues such as gender, age, and emotion. In this paper, we aim to investigate the capability of reconstructing the 3D facial shape from voice from a geometry perspective without any semantic information. We propose a voice-anthropometric measurement (AM)-face paradigm, which identifies predictable facial AMs from the voice and uses them to guide 3D face reconstruction. By leveraging AMs as a proxy to link the voice and face geometry, we can eliminate the influence of unpredictable AMs and make the face geometry tractable. Our approach is evaluated on our proposed dataset with ground-truth 3D face scans and corresponding voice recordings, and we find significant correlations between voice and specific parts of the face geometry, such as the nasal cavity and cranium. Our work offers a new perspective on voice-face correlation and can serve as a good empirical study for anthropometry science. △ Less

Submitted 26 July, 2023; originally announced July 2023.

Comments: ACM Multimedia 2023

arXiv:2307.08217 [pdf, other]

BASS: Block-wise Adaptation for Speech Summarization

Authors: Roshan Sharma, Kenneth Zheng, Siddhant Arora, Shinji Watanabe, Rita Singh, Bhiksha Raj

Abstract: End-to-end speech summarization has been shown to improve performance over cascade baselines. However, such models are difficult to train on very large inputs (dozens of minutes or hours) owing to compute restrictions and are hence trained with truncated model inputs. Truncation leads to poorer models, and a solution to this problem rests in block-wise modeling, i.e., processing a portion of the i… ▽ More End-to-end speech summarization has been shown to improve performance over cascade baselines. However, such models are difficult to train on very large inputs (dozens of minutes or hours) owing to compute restrictions and are hence trained with truncated model inputs. Truncation leads to poorer models, and a solution to this problem rests in block-wise modeling, i.e., processing a portion of the input frames at a time. In this paper, we develop a method that allows one to train summarization models on very long sequences in an incremental manner. Speech summarization is realized as a streaming process, where hypothesis summaries are updated every block based on new acoustic information. We devise and test strategies to pass semantic context across the blocks. Experiments on the How2 dataset demonstrate that the proposed block-wise training method improves by 3 points absolute on ROUGE-L over a truncated input baseline. △ Less

Submitted 16 July, 2023; originally announced July 2023.

Comments: Accepted at Interspeech 2023

arXiv:2307.06669 [pdf, other]

Uncovering the Deceptions: An Analysis on Audio Spoofing Detection and Future Prospects

Authors: Rishabh Ranjan, Mayank Vatsa, Richa Singh

Abstract: Audio has become an increasingly crucial biometric modality due to its ability to provide an intuitive way for humans to interact with machines. It is currently being used for a range of applications, including person authentication to banking to virtual assistants. Research has shown that these systems are also susceptible to spoofing and attacks. Therefore, protecting audio processing systems ag… ▽ More Audio has become an increasingly crucial biometric modality due to its ability to provide an intuitive way for humans to interact with machines. It is currently being used for a range of applications, including person authentication to banking to virtual assistants. Research has shown that these systems are also susceptible to spoofing and attacks. Therefore, protecting audio processing systems against fraudulent activities, such as identity theft, financial fraud, and spreading misinformation, is of paramount importance. This paper reviews the current state-of-the-art techniques for detecting audio spoofing and discusses the current challenges along with open research problems. The paper further highlights the importance of considering the ethical and privacy implications of audio spoofing detection systems. Lastly, the work aims to accentuate the need for building more robust and generalizable methods, the integration of automatic speaker verification and countermeasure systems, and better evaluation protocols. △ Less

Submitted 13 July, 2023; originally announced July 2023.

Comments: Accepted in IJCAI 2023

arXiv:2306.05329 [pdf]

Movement Optimization of Robotic Arms for Energy and Time Reduction using Evolutionary Algorithms

Authors: Abolfazl Akbari, Saeed Mozaffari, Rajmeet Singh, Majid Ahmadi, Shahpour Alirezaee

Abstract: Trajectory optimization of a robot manipulator consists of both optimization of the robot movement as well as optimization of the robot end-effector path. This paper aims to find optimum movement parameters including movement type, speed, and acceleration to minimize robot energy. Trajectory optimization by minimizing the energy would increase the longevity of robotic manipulators. We utilized the… ▽ More Trajectory optimization of a robot manipulator consists of both optimization of the robot movement as well as optimization of the robot end-effector path. This paper aims to find optimum movement parameters including movement type, speed, and acceleration to minimize robot energy. Trajectory optimization by minimizing the energy would increase the longevity of robotic manipulators. We utilized the particle swarm optimization method to find the movement parameters leading to minimum energy consumption. The effectiveness of the proposed method is demonstrated on different trajectories. Experimental results show that 49% efficiency was obtained using a UR5 robotic arm. △ Less

Submitted 8 June, 2023; originally announced June 2023.

arXiv:2305.16974 [pdf, other]

Finite Time Regret Bounds for Minimum Variance Control of Autoregressive Systems with Exogenous Inputs

Authors: Rahul Singh, Akshay Mete, Avik Kar, P. R. Kumar

Abstract: Minimum variance controllers have been employed in a wide-range of industrial applications. A key challenge experienced by many adaptive controllers is their poor empirical performance in the initial stages of learning. In this paper, we address the problem of initializing them so that they provide acceptable transients, and also provide an accompanying finite-time regret analysis, for adaptive mi… ▽ More Minimum variance controllers have been employed in a wide-range of industrial applications. A key challenge experienced by many adaptive controllers is their poor empirical performance in the initial stages of learning. In this paper, we address the problem of initializing them so that they provide acceptable transients, and also provide an accompanying finite-time regret analysis, for adaptive minimum variance control of an auto-regressive system with exogenous inputs (ARX). Following [3], we consider a modified version of the Certainty Equivalence (CE) adaptive controller, which we call PIECE, that utilizes probing inputs for exploration. We show that it has a $C \log T$ bound on the regret after $T$ time-steps for bounded noise, and $C\log^2 T$ in the case of sub-Gaussian noise. The simulation results demonstrate the advantage of PIECE over the algorithm proposed in [3] as well as the standard Certainty Equivalence controller especially in the initial learning phase. To the best of our knowledge, this is the first work that provides finite-time regret bounds for an adaptive minimum variance controller. △ Less

Submitted 26 May, 2023; originally announced May 2023.

arXiv:2305.11834 [pdf, other]

Pengi: An Audio Language Model for Audio Tasks

Authors: Soham Deshmukh, Benjamin Elizalde, Rita Singh, Huaming Wang

Abstract: In the domain of audio processing, Transfer Learning has facilitated the rise of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches have led to the development of versatile models capable of tackling a wide array of tasks, while delivering state-of-the-art performance. However, current models inherently lack the capacity to produce the requisite language for open-ended ta… ▽ More In the domain of audio processing, Transfer Learning has facilitated the rise of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches have led to the development of versatile models capable of tackling a wide array of tasks, while delivering state-of-the-art performance. However, current models inherently lack the capacity to produce the requisite language for open-ended tasks, such as Audio Captioning or Audio Question & Answering. We introduce Pengi, a novel Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks. It takes as input, an audio recording, and text, and generates free-form text as output. The input audio is represented as a sequence of continuous embeddings by an audio encoder. A text encoder does the same for the corresponding text input. Both sequences are combined as a prefix to prompt a pre-trained frozen language model. The unified architecture of Pengi enables open-ended tasks and close-ended tasks without any additional fine-tuning or task-specific extensions. When evaluated on 22 downstream tasks, our approach yields state-of-the-art performance in several of them. Our results show that connecting language models with audio models is a major step towards general-purpose audio understanding △ Less

Submitted 18 January, 2024; v1 submitted 19 May, 2023; originally announced May 2023.

Comments: Accepted at NeurIPS 2023. The manuscript is updated with additional experiments suggested by reviewers

arXiv:2303.17660 [pdf, other]

Randomness assisted in-line holography with deep learning

Authors: Manisha, Aditya Chandra Mandal, Mohit Rathor, Zeev Zalevsky, Rakesh Kumar Singh

Abstract: We propose and demonstrate a holographic imaging scheme exploiting random illuminations for recording hologram and then applying numerical reconstruction and twin removal. We use an in-line holographic geometry to record the hologram in terms of the second-order correlation and apply the numerical approach to reconstruct the recorded hologram. The twin image issue of the in-line holographic scheme… ▽ More We propose and demonstrate a holographic imaging scheme exploiting random illuminations for recording hologram and then applying numerical reconstruction and twin removal. We use an in-line holographic geometry to record the hologram in terms of the second-order correlation and apply the numerical approach to reconstruct the recorded hologram. The twin image issue of the in-line holographic scheme is resolved by an unsupervised deep learning(DL) based method using an auto-encoder scheme. This strategy helps to reconstruct high-quality quantitative images in comparison to the conventional holography where the hologram is recorded in the intensity rather than the second-order intensity correlation. Experimental results are presented for two objects, and a comparison of the reconstruction quality is given between the conventional inline holography and the one obtained with the proposed technique. △ Less

Submitted 30 March, 2023; originally announced March 2023.

Comments: 10 pages, 7 figures

arXiv:2302.07476 [pdf, other]

Indexed Multiple Access with Reconfigurable Intelligent Surfaces: The Reflection Tuning Potential

Authors: Rohit Singh, Aryan Kaushik, Wonjae Shin, George C. Alexandropoulos, Mesut Toka, Marco Di Renzo

Abstract: Indexed modulation (IM) is an evolving technique that has become popular due to its ability of parallel data communication over distinct combinations of transmission entities. In this article, we first provide a comprehensive survey of IM-enabled multiple access (MA) techniques, emphasizing the shortcomings of existing non-indexed MA schemes. Theoretical comparisons are presented to show how the n… ▽ More Indexed modulation (IM) is an evolving technique that has become popular due to its ability of parallel data communication over distinct combinations of transmission entities. In this article, we first provide a comprehensive survey of IM-enabled multiple access (MA) techniques, emphasizing the shortcomings of existing non-indexed MA schemes. Theoretical comparisons are presented to show how the notion of indexing eliminates the limitations of non-indexed solutions. We also discuss the benefits that the utilization of a reconfigurable intelligent surface (RIS) can offer when deployed as an indexing entity. In particular, we propose an RIS-indexed multiple access (RIMA) transmission scheme that utilizes dynamic phase tuning to embed multi-user information over a single carrier. The performance of the proposed RIMA is assessed in light of simulation results that confirm its performance gains. The article further includes a list of relevant open technical issues and research directions. △ Less

Submitted 15 February, 2023; originally announced February 2023.

Comments: 7 pages, 5 figures, 1 table

arXiv:2302.07375 [pdf]

The Role of Physical Layer Security in Satellite-Based Networks

Authors: R. Singh, I. Ahmad, J. Huusko

Abstract: In the coming years, 6G will revolutionize the world with a large amount of bandwidth, high data rates, and extensive coverage in remote and rural areas. These goals can only be achieved by integrating terrestrial networks with non-terrestrial networks. On the other hand, these advancements are raising more concerns than other wireless links about malicious attacks on satellite-terrestrial links d… ▽ More In the coming years, 6G will revolutionize the world with a large amount of bandwidth, high data rates, and extensive coverage in remote and rural areas. These goals can only be achieved by integrating terrestrial networks with non-terrestrial networks. On the other hand, these advancements are raising more concerns than other wireless links about malicious attacks on satellite-terrestrial links due to their openness. Over the years, physical layer security (PLS) has emerged as a good candidate to deal with security threats by exploring the randomness of wireless channels. In this direction, this paper reviews how PLS methods are implemented in satellite communications. Firstly, we discuss the ongoing research on satellite-based networks by highlighting the key points in the literature. Then, we revisit the research activities on PLS in satellite-based networks by categorizing the different system architectures. Finally, we highlight research directions and opportunities to leverage the PLS in future satellite-based networks. △ Less

Submitted 14 February, 2023; originally announced February 2023.

arXiv:2301.07853 [pdf]

DECISIVE Benchmarking Data Report: sUAS Performance Results from Phase I

Authors: Adam Norton, Reza Ahmadzadeh, Kshitij Jerath, Paul Robinette, Jay Weitzen, Thanuka Wickramarathne, Holly Yanco, Minseop Choi, Ryan Donald, Brendan Donoghue, Christian Dumas, Peter Gavriel, Alden Giedraitis, Brendan Hertel, Jack Houle, Nathan Letteri, Edwin Meriaux, Zahra Rezaei Khavas, Rakshith Singh, Gregg Willcox, Naye Yoni

Abstract: This report reviews all results derived from performance benchmarking conducted during Phase I of the Development and Execution of Comprehensive and Integrated Subterranean Intelligent Vehicle Evaluations (DECISIVE) project by the University of Massachusetts Lowell, using the test methods specified in the DECISIVE Test Methods Handbook v1.1 for evaluating small unmanned aerial systems (sUAS) perfo… ▽ More This report reviews all results derived from performance benchmarking conducted during Phase I of the Development and Execution of Comprehensive and Integrated Subterranean Intelligent Vehicle Evaluations (DECISIVE) project by the University of Massachusetts Lowell, using the test methods specified in the DECISIVE Test Methods Handbook v1.1 for evaluating small unmanned aerial systems (sUAS) performance in subterranean and constrained indoor environments, spanning communications, field readiness, interface, obstacle avoidance, navigation, map**, autonomy, trust, and situation awareness. Using those 20 test methods, over 230 tests were conducted across 8 sUAS platforms: Cleo Robotics Dronut X1P (P = prototype), FLIR Black Hornet PRS, Flyability Elios 2 GOV, Lumenier Nighthawk V3, Parrot ANAFI USA GOV, Skydio X2D, Teal Golden Eagle, and Vantage Robotics Vesper. Best in class criteria is specified for each applicable test method and the sUAS that match this criteria are named for each test method, including a high-level executive summary of their performance. △ Less

Submitted 20 January, 2023; v1 submitted 18 January, 2023; originally announced January 2023.

Comments: Approved for public release: PAO #PR2023_74172; arXiv admin note: substantial text overlap with arXiv:2211.01801

arXiv:2211.08367 [pdf, other]

FlowGrad: Using Motion for Visual Sound Source Localization

Authors: Rajsuryan Singh, Pablo Zinemanas, Xavier Serra, Juan Pablo Bello, Magdalena Fuentes

Abstract: Most recent work in visual sound source localization relies on semantic audio-visual representations learned in a self-supervised manner, and by design excludes temporal information present in videos. While it proves to be effective for widely used benchmark datasets, the method falls short for challenging scenarios like urban traffic. This work introduces temporal context into the state-of-the-ar… ▽ More Most recent work in visual sound source localization relies on semantic audio-visual representations learned in a self-supervised manner, and by design excludes temporal information present in videos. While it proves to be effective for widely used benchmark datasets, the method falls short for challenging scenarios like urban traffic. This work introduces temporal context into the state-of-the-art methods for sound source localization in urban scenes using optical flow as a means to encode motion information. An analysis of the strengths and weaknesses of our methods helps us better understand the problem of visual sound source localization and sheds light on open challenges for audio-visual scene understanding. △ Less

Submitted 14 April, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

Comments: Accepted in ICASSP 2023

arXiv:2211.07737 [pdf, other]

Describing emotions with acoustic property prompts for speech emotion recognition

Authors: Hira Dhamyal, Benjamin Elizalde, Soham Deshmukh, Huaming Wang, Bhiksha Raj, Rita Singh

Abstract: Emotions lie on a broad continuum and treating emotions as a discrete number of classes limits the ability of a model to capture the nuances in the continuum. The challenge is how to describe the nuances of emotions and how to enable a model to learn the descriptions. In this work, we devise a method to automatically create a description (or prompt) for a given audio by computing acoustic properti… ▽ More Emotions lie on a broad continuum and treating emotions as a discrete number of classes limits the ability of a model to capture the nuances in the continuum. The challenge is how to describe the nuances of emotions and how to enable a model to learn the descriptions. In this work, we devise a method to automatically create a description (or prompt) for a given audio by computing acoustic properties, such as pitch, loudness, speech rate, and articulation rate. We pair a prompt with its corresponding audio using 5 different emotion datasets. We trained a neural network model using these audio-text pairs. Then, we evaluate the model using one more dataset. We investigate how the model can learn to associate the audio with the descriptions, resulting in performance improvement of Speech Emotion Recognition and Speech Audio Retrieval. We expect our findings to motivate research describing the broad continuum of emotion △ Less

Submitted 14 November, 2022; originally announced November 2022.

arXiv:2211.02005 [pdf, other]

Robust Dependence Measure using RKHS based Uncertainty Moments and Optimal Transport

Authors: Rishabh Singh, Jose C. Principe

Abstract: Reliable measurement of dependence between variables is essential in many applications of statistics and machine learning. Current approaches for dependence estimation, especially density-based approaches, lack in precision, robustness and/or interpretability (in terms of the type of dependence being estimated). We propose a two-step approach for dependence quantification between random variables:… ▽ More Reliable measurement of dependence between variables is essential in many applications of statistics and machine learning. Current approaches for dependence estimation, especially density-based approaches, lack in precision, robustness and/or interpretability (in terms of the type of dependence being estimated). We propose a two-step approach for dependence quantification between random variables: 1) We first decompose the probability density functions (PDF) of the variables involved in terms of multiple local moments of uncertainty that systematically and precisely identify the different regions of the PDF (with special emphasis on the tail-regions). 2) We then compute an optimal transport map to measure the geometric similarity between the corresponding sets of decomposed local uncertainty moments of the variables. Dependence is then determined by the degree of one-to-one correspondence between the respective uncertainty moments of the variables in the optimal transport map. We utilize a recently introduced Gaussian reproducing kernel Hilbert space (RKHS) based framework for multi-moment uncertainty decomposition of the variables. Being based on the Gaussian RKHS, our approach is robust towards outliers and monotone transformations of data, while the multiple moments of uncertainty provide high resolution and interpretability of the type of dependence being quantified. We support these claims through some preliminary results using simulated data. △ Less

Submitted 3 November, 2022; originally announced November 2022.

arXiv:2211.01999 [pdf, other]

Quantifying Model Uncertainty for Semantic Segmentation using Operators in the RKHS

Authors: Rishabh Singh, Jose C. Principe

Abstract: Deep learning models for semantic segmentation are prone to poor performance in real-world applications due to the highly challenging nature of the task. Model uncertainty quantification (UQ) is one way to address this issue of lack of model trustworthiness by enabling the practitioner to know how much to trust a segmentation output. Current UQ methods in this application domain are mainly restric… ▽ More Deep learning models for semantic segmentation are prone to poor performance in real-world applications due to the highly challenging nature of the task. Model uncertainty quantification (UQ) is one way to address this issue of lack of model trustworthiness by enabling the practitioner to know how much to trust a segmentation output. Current UQ methods in this application domain are mainly restricted to Bayesian based methods which are computationally expensive and are only able to extract central moments of uncertainty thereby limiting the quality of their uncertainty estimates. We present a simple framework for high-resolution predictive uncertainty quantification of semantic segmentation models that leverages a multi-moment functional definition of uncertainty associated with the model's feature space in the reproducing kernel Hilbert space (RKHS). The multiple uncertainty functionals extracted from this framework are defined by the local density dynamics of the model's feature space and hence automatically align themselves at the tail-regions of the intrinsic probability density function of the feature space (where uncertainty is the highest) in such a way that the successively higher order moments quantify the more uncertain regions. This leads to a significantly more accurate view of model uncertainty than conventional Bayesian methods. Moreover, the extraction of such moments is done in a single-shot computation making it much faster than Bayesian and ensemble approaches (that involve a high number of forward stochastic passes of the model to quantify its uncertainty). We demonstrate these advantages through experimental evaluations of our framework implemented over four different state-of-the-art model architectures that are trained and evaluated on two benchmark road-scene segmentation datasets (Camvid and Cityscapes). △ Less

Submitted 3 November, 2022; originally announced November 2022.

arXiv:2211.01801 [pdf]

DECISIVE Test Methods Handbook: Test Methods for Evaluating sUAS in Subterranean and Constrained Indoor Environments, Version 1.1

Authors: Adam Norton, Reza Ahmadzadeh, Kshitij Jerath, Paul Robinette, Jay Weitzen, Thanuka Wickramarathne, Holly Yanco, Minseop Choi, Ryan Donald, Brendan Donoghue, Christian Dumas, Peter Gavriel, Alden Giedraitis, Brendan Hertel, Jack Houle, Nathan Letteri, Edwin Meriaux, Zahra Rezaei Khavas, Rakshith Singh, Gregg Willcox, Naye Yoni

Abstract: This handbook outlines all test methods developed under the Development and Execution of Comprehensive and Integrated Subterranean Intelligent Vehicle Evaluations (DECISIVE) project by the University of Massachusetts Lowell for evaluating small unmanned aerial systems (sUAS) performance in subterranean and constrained indoor environments, spanning communications, field readiness, interface, obstac… ▽ More This handbook outlines all test methods developed under the Development and Execution of Comprehensive and Integrated Subterranean Intelligent Vehicle Evaluations (DECISIVE) project by the University of Massachusetts Lowell for evaluating small unmanned aerial systems (sUAS) performance in subterranean and constrained indoor environments, spanning communications, field readiness, interface, obstacle avoidance, navigation, map**, autonomy, trust, and situation awareness. For sUAS deployment in subterranean and constrained indoor environments, this puts forth two assumptions about applicable sUAS to be evaluated using these test methods: (1) able to operate without access to GPS signal, and (2) width from prop top to prop tip does not exceed 91 cm (36 in) wide (i.e., can physically fit through a typical doorway, although successful navigation through is not guaranteed). All test methods are specified using a common format: Purpose, Summary of Test Method, Apparatus and Artifacts, Equipment, Metrics, Procedure, and Example Data. All test methods are designed to be run in real-world environments (e.g., MOUT sites) or using fabricated apparatuses (e.g., test bays built from wood, or contained inside of one or more ship** containers). △ Less

Submitted 20 January, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

Comments: Approved for public release: PAO #PR2022_47058

arXiv:2210.16642 [pdf, other]

Unifying the Discrete and Continuous Emotion labels for Speech Emotion Recognition

Authors: Roshan Sharma, Hira Dhamyal, Bhiksha Raj, Rita Singh

Abstract: Traditionally, in paralinguistic analysis for emotion detection from speech, emotions have been identified with discrete or dimensional (continuous-valued) labels. Accordingly, models that have been proposed for emotion detection use one or the other of these label types. However, psychologists like Russell and Plutchik have proposed theories and models that unite these views, maintaining that the… ▽ More Traditionally, in paralinguistic analysis for emotion detection from speech, emotions have been identified with discrete or dimensional (continuous-valued) labels. Accordingly, models that have been proposed for emotion detection use one or the other of these label types. However, psychologists like Russell and Plutchik have proposed theories and models that unite these views, maintaining that these representations have shared and complementary information. This paper is an attempt to validate these viewpoints computationally. To this end, we propose a model to jointly predict continuous and discrete emotional attributes and show how the relationship between these can be utilized to improve the robustness and performance of emotion recognition tasks. Our approach comprises multi-task and hierarchical multi-task learning frameworks that jointly model the relationships between continuous-valued and discrete emotion labels. Experimental results on two widely used datasets (IEMOCAP and MSPPodcast) for speech-based emotion recognition show that our model results in statistically significant improvements in performance over strong baselines with non-unified approaches. We also demonstrate that using one type of label (discrete or continuous-valued) for training improves recognition performance in tasks that use the other type of label. Experimental results and reasoning for this approach (called the mismatched training approach) are also presented. △ Less

Submitted 29 October, 2022; originally announced October 2022.

Comments: Under Review at ICASSP 2023

arXiv:2206.12568 [pdf, other]

Self-supervision and Learnable STRFs for Age, Emotion, and Country Prediction

Authors: Roshan Sharma, Tyler Vuong, Mark Lindsey, Hira Dhamyal, Rita Singh, Bhiksha Raj

Abstract: This work presents a multitask approach to the simultaneous estimation of age, country of origin, and emotion given vocal burst audio for the 2022 ICML Expressive Vocalizations Challenge ExVo-MultiTask track. The method of choice utilized a combination of spectro-temporal modulation and self-supervised features, followed by an encoder-decoder network organized in a multitask paradigm. We evaluate… ▽ More This work presents a multitask approach to the simultaneous estimation of age, country of origin, and emotion given vocal burst audio for the 2022 ICML Expressive Vocalizations Challenge ExVo-MultiTask track. The method of choice utilized a combination of spectro-temporal modulation and self-supervised features, followed by an encoder-decoder network organized in a multitask paradigm. We evaluate the complementarity between the tasks posed by examining independent task-specific and joint models, and explore the relative strengths of different feature sets. We also introduce a simple score fusion mechanism to leverage the complementarity of different feature sets for this task. We find that robust data preprocessing in conjunction with score fusion over spectro-temporal receptive field and HuBERT models achieved our best ExVo-MultiTask test score of 0.412. △ Less

Submitted 25 June, 2022; originally announced June 2022.

Journal ref: Proceedings of the 39th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022

arXiv:2206.08826 [pdf, other]

doi 10.1093/jamia/ocac168

Multimodal Attention-based Deep Learning for Alzheimer's Disease Diagnosis

Authors: Michal Golovanevsky, Carsten Eickhoff, Ritambhara Singh

Abstract: Alzheimer's Disease (AD) is the most common neurodegenerative disorder with one of the most complex pathogeneses, making effective and clinically actionable decision support difficult. The objective of this study was to develop a novel multimodal deep learning framework to aid medical professionals in AD diagnosis. We present a Multimodal Alzheimer's Disease Diagnosis framework (MADDi) to accurate… ▽ More Alzheimer's Disease (AD) is the most common neurodegenerative disorder with one of the most complex pathogeneses, making effective and clinically actionable decision support difficult. The objective of this study was to develop a novel multimodal deep learning framework to aid medical professionals in AD diagnosis. We present a Multimodal Alzheimer's Disease Diagnosis framework (MADDi) to accurately detect the presence of AD and mild cognitive impairment (MCI) from imaging, genetic, and clinical data. MADDi is novel in that we use cross-modal attention, which captures interactions between modalities - a method not previously explored in this domain. We perform multi-class classification, a challenging task considering the strong similarities between MCI and AD. We compare with previous state-of-the-art models, evaluate the importance of attention, and examine the contribution of each modality to the model's performance. MADDi classifies MCI, AD, and controls with 96.88% accuracy on a held-out test set. When examining the contribution of different attention schemes, we found that the combination of cross-modal attention with self-attention performed the best, and no attention layers in the model performed the worst, with a 7.9% difference in F1-Scores. Our experiments underlined the importance of structured clinical data to help machine learning models contextualize and interpret the remaining modalities. Extensive ablation studies showed that any multimodal mixture of input features without access to structured clinical information suffered marked performance losses. This study demonstrates the merit of combining multiple input modalities via cross-modal attention to deliver highly accurate AD diagnostic decision support. △ Less

Submitted 23 September, 2022; v1 submitted 17 June, 2022; originally announced June 2022.

Comments: 11 pages, 5 figures

Journal ref: Journal of the American Medical Informatics Association, 2022; ocac168

arXiv:2205.09677 [pdf, other]

Reconstructing complex field through opaque scattering layer with structured light illumination

Authors: Aditya Chandra Mandal, Manisha, Abhijeet Phatak, Zeev Zalevsky, Rakesh Kumar Singh

Abstract: The wavefront is scrambled when coherent light propagates through a random scattering medium and which makes direct use of the conventional optical methods ineffective. In this paper, we propose and demonstrate a structured light illumination for imaging through an opaque scattering layer. Proposed technique is reference free and capable to recover the complex field from intensities of the speckle… ▽ More The wavefront is scrambled when coherent light propagates through a random scattering medium and which makes direct use of the conventional optical methods ineffective. In this paper, we propose and demonstrate a structured light illumination for imaging through an opaque scattering layer. Proposed technique is reference free and capable to recover the complex field from intensities of the speckle patterns. This is realized by making use of the phase-shifting in the structured light illumination and applying spatial averaging of the speckle pattern in the intensity correlation measurement. An experimental design is presented and simulated results based on the experimental design are shown to demonstrate imaging of different complex-valued objects through scattering layer. △ Less

Submitted 19 May, 2022; originally announced May 2022.

Comments: 23 pages, 7 figures

arXiv:2204.04802 [pdf, other]

On the pragmatism of using binary classifiers over data intensive neural network classifiers for detection of COVID-19 from voice

Authors: Ankit Shah, Hira Dhamyal, Yang Gao, Daniel Arancibia, Mario Arancibia, Bhiksha Raj, Rita Singh

Abstract: Lately, there has been a global effort by multiple research groups to detect COVID-19 from voice. Different researchers use different kinds of information from the voice signal to achieve this. Various types of phonated sounds and the sound of cough and breath have all been used with varying degree of success in automated voice-based COVID-19 detection apps. In this paper, we show that detecting C… ▽ More Lately, there has been a global effort by multiple research groups to detect COVID-19 from voice. Different researchers use different kinds of information from the voice signal to achieve this. Various types of phonated sounds and the sound of cough and breath have all been used with varying degree of success in automated voice-based COVID-19 detection apps. In this paper, we show that detecting COVID-19 from voice does not require custom-made non-standard features or complicated neural network classifiers rather it can be successfully done with just standard features and simple binary classifiers. In fact, we show that the latter is not only more accurate and interpretable but also more computationally efficient in that they can be run locally on small devices. We demonstrate this on a human-curated dataset of over 1000 subjects, collected and calibrated in clinical settings. △ Less

Submitted 25 October, 2022; v1 submitted 10 April, 2022; originally announced April 2022.

Comments: Submitted to ICASSP 2022

arXiv:2203.11725 [pdf, other]

Unsupervised Anomaly Detection in Medical Images with a Memory-augmented Multi-level Cross-attentional Masked Autoencoder

Authors: Yu Tian, Guansong Pang, Yuyuan Liu, Chong Wang, Yuanhong Chen, Fengbei Liu, Rajvinder Singh, Johan W Verjans, Mengyu Wang, Gustavo Carneiro

Abstract: Unsupervised anomaly detection (UAD) aims to find anomalous images by optimising a detector using a training set that contains only normal images. UAD approaches can be based on reconstruction methods, self-supervised approaches, and Imagenet pre-trained models. Reconstruction methods, which detect anomalies from image reconstruction errors, are advantageous because they do not rely on the design… ▽ More Unsupervised anomaly detection (UAD) aims to find anomalous images by optimising a detector using a training set that contains only normal images. UAD approaches can be based on reconstruction methods, self-supervised approaches, and Imagenet pre-trained models. Reconstruction methods, which detect anomalies from image reconstruction errors, are advantageous because they do not rely on the design of problem-specific pretext tasks needed by self-supervised approaches, and on the unreliable translation of models pre-trained from non-medical datasets. However, reconstruction methods may fail because they can have low reconstruction errors even for anomalous images. In this paper, we introduce a new reconstruction-based UAD approach that addresses this low-reconstruction error issue for anomalous images. Our UAD approach, the memory-augmented multi-level cross-attentional masked autoencoder (MemMC-MAE), is a transformer-based approach, consisting of a novel memory-augmented self-attention operator for the encoder and a new multi-level cross-attention operator for the decoder. MemMCMAE masks large parts of the input image during its reconstruction, reducing the risk that it will produce low reconstruction errors because anomalies are likely to be masked and cannot be reconstructed. However, when the anomaly is not masked, then the normal patterns stored in the encoder's memory combined with the decoder's multi-level cross attention will constrain the accurate reconstruction of the anomaly. We show that our method achieves SOTA anomaly detection and localisation on colonoscopy, pneumonia, and covid-19 chest x-ray datasets. △ Less

Submitted 21 August, 2023; v1 submitted 22 March, 2022; originally announced March 2022.

Comments: Accepted to MICCAI MLMI2023

arXiv:2201.10542 [pdf, other]

Augmented RBMLE-UCB Approach for Adaptive Control of Linear Quadratic Systems

Authors: Akshay Mete, Rahul Singh, P. R. Kumar

Abstract: We consider the problem of controlling an unknown stochastic linear system with quadratic costs - called the adaptive LQ control problem. We re-examine an approach called ''Reward Biased Maximum Likelihood Estimate'' (RBMLE) that was proposed more than forty years ago, and which predates the ''Upper Confidence Bound'' (UCB) method as well as the definition of ''regret'' for bandit problems. It sim… ▽ More We consider the problem of controlling an unknown stochastic linear system with quadratic costs - called the adaptive LQ control problem. We re-examine an approach called ''Reward Biased Maximum Likelihood Estimate'' (RBMLE) that was proposed more than forty years ago, and which predates the ''Upper Confidence Bound'' (UCB) method as well as the definition of ''regret'' for bandit problems. It simply added a term favoring parameters with larger rewards to the criterion for parameter estimation. We show how the RBMLE and UCB methods can be reconciled, and thereby propose an Augmented RBMLE-UCB algorithm that combines the penalty of the RBMLE method with the constraints of the UCB method, uniting the two approaches to optimism in the face of uncertainty. We establish that theoretically, this method retains $\Tilde{\mathcal{O}}(\sqrt{T})$ regret, the best-known so far. We further compare the empirical performance of the proposed Augmented RBMLE-UCB and the standard RBMLE (without the augmentation) with UCB, Thompson Sampling, Input Perturbation, Randomized Certainty Equivalence and StabL on many real-world examples including flight control of Boeing 747 and Unmanned Aerial Vehicle. We perform extensive simulation studies showing that the Augmented RBMLE consistently outperforms UCB, Thompson Sampling and StabL by a huge margin, while it is marginally better than Input Perturbation and moderately better than Randomized Certainty Equivalence. △ Less

Submitted 24 March, 2023; v1 submitted 25 January, 2022; originally announced January 2022.

Comments: 36th Conference on Neural Information Processing Systems (NeurIPS 2022). https://openreview.net/forum?id=7pNV4PCjbQy

arXiv:2112.07102 [pdf, other]

COVID-19 Pneumonia and Influenza Pneumonia Detection Using Convolutional Neural Networks

Authors: Julianna Antonchuk, Benjamin Prescott, Philip Melanchthon, Robin Singh

Abstract: In the research, we developed a computer vision solution to support diagnostic radiology in differentiating between COVID-19 pneumonia, influenza virus pneumonia, and normal biomarkers. The chest radiograph appearance of COVID-19 pneumonia is thought to be nonspecific, having presented a challenge to identify an optimal architecture of a convolutional neural network (CNN) that would classify with… ▽ More In the research, we developed a computer vision solution to support diagnostic radiology in differentiating between COVID-19 pneumonia, influenza virus pneumonia, and normal biomarkers. The chest radiograph appearance of COVID-19 pneumonia is thought to be nonspecific, having presented a challenge to identify an optimal architecture of a convolutional neural network (CNN) that would classify with a high sensitivity among the pulmonary inflammation features of COVID-19 and non-COVID-19 types of pneumonia. Rahman (2021) states that COVID-19 radiography images observe unavailability and quality issues impacting the diagnostic process and affecting the accuracy of the deep learning detection models. A significant scarcity of COVID-19 radiography images introduced an imbalance in data motivating us to use over-sampling techniques. In the study, we include an extensive set of X-ray imaging of human lungs (CXR) with COVID-19 pneumonia, influenza virus pneumonia, and normal biomarkers to achieve an extensible and accurate CNN model. In the experimentation phase of the research, we evaluated a variety of convolutional network architectures, selecting a sequential convolutional network with two traditional convolutional layers and two pooling layers with maximum function. In its classification performance, the best performing model demonstrated a validation accuracy of 93% and an F1 score of 0.95. We chose the Azure Machine Learning service to perform network experimentation and solution deployment. The auto-scaling compute clusters offered a significant time reduction in network training. We would like to see scientists across fields of artificial intelligence and human biology collaborating and expanding on the proposed solution to provide rapid and comprehensive diagnostics, effectively mitigating the spread of the virus △ Less

Submitted 13 December, 2021; originally announced December 2021.

Comments: for associated Azure ML notebook code, see https://github.com/bcprescott/MSDS/tree/main/Capstone_COVID19/code/AML

arXiv:2110.08820 [pdf, other]

On-board Fault Diagnosis of a Laboratory Mini SR-30 Gas Turbine Engine

Authors: Richa Singh

Abstract: Inspired by recent progress in machine learning, a data-driven fault diagnosis and isolation (FDI) scheme is explicitly developed for failure in the fuel supply system and sensor measurements of the laboratory gas turbine system. A passive approach of fault diagnosis is implemented where a model is trained using machine learning classifiers to detect a given set of fault scenarios in real-time on… ▽ More Inspired by recent progress in machine learning, a data-driven fault diagnosis and isolation (FDI) scheme is explicitly developed for failure in the fuel supply system and sensor measurements of the laboratory gas turbine system. A passive approach of fault diagnosis is implemented where a model is trained using machine learning classifiers to detect a given set of fault scenarios in real-time on which it is trained. Towards the end, a comparative study is presented for well-known classification techniques, namely Support vector classifier, linear discriminant analysis, K-neighbor, and decision trees. Several simulation studies were carried out to demonstrate and illustrate the proposed fault diagnosis scheme's advantages, capabilities, and performance. △ Less

Submitted 19 October, 2021; v1 submitted 17 October, 2021; originally announced October 2021.

arXiv:2110.04800 [pdf, other]

Self-Supervised 3D Face Reconstruction via Conditional Estimation

Authors: Yandong Wen, Weiyang Liu, Bhiksha Raj, Rita Singh

Abstract: We present a conditional estimation (CEST) framework to learn 3D facial parameters from 2D single-view images by self-supervised training from videos. CEST is based on the process of analysis by synthesis, where the 3D facial parameters (shape, reflectance, viewpoint, and illumination) are estimated from the face image, and then recombined to reconstruct the 2D face image. In order to learn semant… ▽ More We present a conditional estimation (CEST) framework to learn 3D facial parameters from 2D single-view images by self-supervised training from videos. CEST is based on the process of analysis by synthesis, where the 3D facial parameters (shape, reflectance, viewpoint, and illumination) are estimated from the face image, and then recombined to reconstruct the 2D face image. In order to learn semantically meaningful 3D facial parameters without explicit access to their labels, CEST couples the estimation of different 3D facial parameters by taking their statistical dependency into account. Specifically, the estimation of any 3D facial parameter is not only conditioned on the given image, but also on the facial parameters that have already been derived. Moreover, the reflectance symmetry and consistency among the video frames are adopted to improve the disentanglement of facial parameters. Together with a novel strategy for incorporating the reflectance symmetry and consistency, CEST can be efficiently trained with in-the-wild video clips. Both qualitative and quantitative experiments demonstrate the effectiveness of CEST. △ Less

Submitted 10 October, 2021; originally announced October 2021.

Comments: ICCV 2021 (15 pages)

arXiv:2110.04678 [pdf, other]

An Overview of Techniques for Biomarker Discovery in Voice Signal

Authors: Rita Singh, Ankit Shah, Hira Dhamyal

Abstract: This paper reflects on the effect of several categories of medical conditions on human voice, focusing on those that may be hypothesized to have effects on voice, but for which the changes themselves may be subtle enough to have eluded observation in standard analytical examinations of the voice signal. It presents three categories of techniques that can potentially uncover such elusive biomarkers… ▽ More This paper reflects on the effect of several categories of medical conditions on human voice, focusing on those that may be hypothesized to have effects on voice, but for which the changes themselves may be subtle enough to have eluded observation in standard analytical examinations of the voice signal. It presents three categories of techniques that can potentially uncover such elusive biomarkers and allow them to be measured and used for predictive and diagnostic purposes. These approaches include proxy techniques, model-based analytical techniques and data-driven AI techniques. △ Less

Submitted 9 October, 2021; originally announced October 2021.

Comments: Last two authors contributed equally to the paper

arXiv:2109.05580 [pdf, other]

doi 10.1007/978-3-031-08999-2_30

A Joint Graph and Image Convolution Network for Automatic Brain Tumor Segmentation

Authors: Camillo Saueressig, Adam Berkley, Reshma Munbodh, Ritambhara Singh

Abstract: We present a joint graph convolution-image convolution neural network as our submission to the Brain Tumor Segmentation (BraTS) 2021 challenge. We model each brain as a graph composed of distinct image regions, which is initially segmented by a graph neural network (GNN). Subsequently, the tumorous volume identified by the GNN is further refined by a simple (voxel) convolutional neural network (CN… ▽ More We present a joint graph convolution-image convolution neural network as our submission to the Brain Tumor Segmentation (BraTS) 2021 challenge. We model each brain as a graph composed of distinct image regions, which is initially segmented by a graph neural network (GNN). Subsequently, the tumorous volume identified by the GNN is further refined by a simple (voxel) convolutional neural network (CNN), which produces the final segmentation. This approach captures both global brain feature interactions via the graphical representation and local image details through the use of convolutional filters. We find that the GNN component by itself can effectively identify and segment the brain tumors. The addition of the CNN further improves the median performance of the model by 2 percent across all metrics evaluated. On the validation set, our joint GNN-CNN model achieves mean Dice scores of 0.89, 0.81, 0.73 and mean Hausdorff distances (95th percentile) of 6.8, 12.6, 28.2mm on the whole tumor, core tumor, and enhancing tumor, respectively. △ Less

Submitted 30 July, 2022; v1 submitted 12 September, 2021; originally announced September 2021.

Comments: 9 pages, 3 figures, submitted to BrainLes Workshop (MICCAI 2021) as part of BraTS2021 challenge

arXiv:2109.01303 [pdf, other]

Self-supervised Pseudo Multi-class Pre-training for Unsupervised Anomaly Detection and Segmentation in Medical Images

Authors: Yu Tian, Fengbei Liu, Guansong Pang, Yuanhong Chen, Yuyuan Liu, Johan W. Verjans, Rajvinder Singh, Gustavo Carneiro

Abstract: Unsupervised anomaly detection (UAD) methods are trained with normal (or healthy) images only, but during testing, they are able to classify normal and abnormal (or disease) images. UAD is an important medical image analysis (MIA) method to be applied in disease screening problems because the training sets available for those problems usually contain only normal images. However, the exclusive reli… ▽ More Unsupervised anomaly detection (UAD) methods are trained with normal (or healthy) images only, but during testing, they are able to classify normal and abnormal (or disease) images. UAD is an important medical image analysis (MIA) method to be applied in disease screening problems because the training sets available for those problems usually contain only normal images. However, the exclusive reliance on normal images may result in the learning of ineffective low-dimensional image representations that are not sensitive enough to detect and segment unseen abnormal lesions of varying size, appearance, and shape. Pre-training UAD methods with self-supervised learning, based on computer vision techniques, can mitigate this challenge, but they are sub-optimal because they do not explore domain knowledge for designing the pretext tasks, and their contrastive learning losses do not try to cluster the normal training images, which may result in a sparse distribution of normal images that is ineffective for anomaly detection. In this paper, we propose a new self-supervised pre-training method for MIA UAD applications, named Pseudo Multi-class Strong Augmentation via Contrastive Learning (PMSACL). PMSACL consists of a novel optimisation method that contrasts a normal image class from multiple pseudo classes of synthesised abnormal images, with each class enforced to form a dense cluster in the feature space. In the experiments, we show that our PMSACL pre-training improves the accuracy of SOTA UAD methods on many MIA benchmarks using colonoscopy, fundus screening and Covid-19 Chest X-ray datasets. The code is made publicly available via https://github.com/tianyu0207/PMSACL. △ Less

Submitted 14 August, 2023; v1 submitted 3 September, 2021; originally announced September 2021.

Comments: Accepted to Medical Image Analysis

arXiv:2108.10579 [pdf, other]

Lossy Medical Image Compression using Residual Learning-based Dual Autoencoder Model

Authors: Dipti Mishra, Satish Kumar Singh, Rajat Kumar Singh

Abstract: In this work, we propose a two-stage autoencoder based compressor-decompressor framework for compressing malaria RBC cell image patches. We know that the medical images used for disease diagnosis are around multiple gigabytes size, which is quite huge. The proposed residual-based dual autoencoder network is trained to extract the unique features which are then used to reconstruct the original imag… ▽ More In this work, we propose a two-stage autoencoder based compressor-decompressor framework for compressing malaria RBC cell image patches. We know that the medical images used for disease diagnosis are around multiple gigabytes size, which is quite huge. The proposed residual-based dual autoencoder network is trained to extract the unique features which are then used to reconstruct the original image through the decompressor module. The two latent space representations (first for the original image and second for the residual image) are used to rebuild the final original image. Color-SSIM has been exclusively used to check the quality of the chrominance part of the cell images after decompression. The empirical results indicate that the proposed work outperformed other neural network related compression technique for medical images by approximately 35%, 10% and 5% in PSNR, Color SSIM and MS-SSIM respectively. The algorithm exhibits a significant improvement in bit savings of 76%, 78%, 75% & 74% over JPEG-LS, JP2K-LM, CALIC and recent neural network approach respectively, making it a good compression-decompression technique. △ Less

Submitted 24 August, 2021; originally announced August 2021.

arXiv:2107.11662 [pdf, other]

Inference of collective Gaussian hidden Markov models

Authors: Rahul Singh, Yongxin Chen

Abstract: We consider inference problems for a class of continuous state collective hidden Markov models, where the data is recorded in aggregate (collective) form generated by a large population of individuals following the same dynamics. We propose an aggregate inference algorithm called collective Gaussian forward-backward algorithm, extending recently proposed Sinkhorn belief propagation algorithm to mo… ▽ More We consider inference problems for a class of continuous state collective hidden Markov models, where the data is recorded in aggregate (collective) form generated by a large population of individuals following the same dynamics. We propose an aggregate inference algorithm called collective Gaussian forward-backward algorithm, extending recently proposed Sinkhorn belief propagation algorithm to models characterized by Gaussian densities. Our algorithm enjoys convergence guarantee. In addition, it reduces to the standard Kalman filter when the observations are generated by a single individual. The efficacy of the proposed algorithm is demonstrated through multiple experiments. △ Less

Submitted 24 July, 2021; originally announced July 2021.

arXiv:2107.07988 [pdf, other]

Controlled AutoEncoders to Generate Faces from Voices

Authors: Hao Liang, Lulan Yu, Guikang Xu, Bhiksha Raj, Rita Singh

Abstract: Multiple studies in the past have shown that there is a strong correlation between human vocal characteristics and facial features. However, existing approaches generate faces simply from voice, without exploring the set of features that contribute to these observed correlations. A computational methodology to explore this can be devised by rephrasing the question to: "how much would a target face… ▽ More Multiple studies in the past have shown that there is a strong correlation between human vocal characteristics and facial features. However, existing approaches generate faces simply from voice, without exploring the set of features that contribute to these observed correlations. A computational methodology to explore this can be devised by rephrasing the question to: "how much would a target face have to change in order to be perceived as the originator of a source voice?" With this in perspective, we propose a framework to morph a target face in response to a given voice in a way that facial features are implicitly guided by learned voice-face correlation in this paper. Our framework includes a guided autoencoder that converts one face to another, controlled by a unique model-conditioning component called a gating controller which modifies the reconstructed face based on input voice recordings. We evaluate the framework on VoxCelab and VGGFace datasets through human subjects and face retrieval. Various experiments demonstrate the effectiveness of our proposed model. △ Less

Submitted 16 July, 2021; originally announced July 2021.

arXiv:2106.06858 [pdf, other]

Improving weakly supervised sound event detection with self-supervised auxiliary tasks

Authors: Soham Deshmukh, Bhiksha Raj, Rita Singh

Abstract: While multitask and transfer learning has shown to improve the performance of neural networks in limited data settings, they require pretraining of the model on large datasets beforehand. In this paper, we focus on improving the performance of weakly supervised sound event detection in low data and noisy settings simultaneously without requiring any pretraining task. To that extent, we propose a s… ▽ More While multitask and transfer learning has shown to improve the performance of neural networks in limited data settings, they require pretraining of the model on large datasets beforehand. In this paper, we focus on improving the performance of weakly supervised sound event detection in low data and noisy settings simultaneously without requiring any pretraining task. To that extent, we propose a shared encoder architecture with sound event detection as a primary task and an additional secondary decoder for a self-supervised auxiliary task. We empirically evaluate the proposed framework for weakly supervised sound event detection on a remix dataset of the DCASE 2019 task 1 acoustic scene data with DCASE 2018 Task 2 sounds event data under 0, 10 and 20 dB SNR. To ensure we retain the localisation information of multiple sound events, we propose a two-step attention pooling mechanism that provides a time-frequency localisation of multiple audio events in the clip. The proposed framework with two-step attention outperforms existing benchmark models by 22.3%, 12.8%, 5.9% on 0, 10 and 20 dB SNR respectively. We carry out an ablation study to determine the contribution of the auxiliary task and two-step attention pooling to the SED performance improvement. △ Less

Submitted 12 June, 2021; originally announced June 2021.

Comments: Accepted at INTERSPEECH 21

Showing 1–50 of 95 results for author: Singh, R