Search | arXiv e-print repository

Robust Pareto Design of GaN HEMTs for Millimeter-Wave Applications

Authors: Rafael Perez Martinez, Stephen Boyd, Srabanti Chowdhury

Abstract: This paper introduces a robust Pareto design approach for selecting Gallium Nitride (GaN) High Electron Mobility Transistors (HEMTs), particularly for power amplifier (PA) and low-noise amplifier (LNA) designs in 5G applications. We consider five key design variables and two settings (PAs and LNAs) where we have multiple objectives. We assess designs based on three critical objectives, evaluating… ▽ More This paper introduces a robust Pareto design approach for selecting Gallium Nitride (GaN) High Electron Mobility Transistors (HEMTs), particularly for power amplifier (PA) and low-noise amplifier (LNA) designs in 5G applications. We consider five key design variables and two settings (PAs and LNAs) where we have multiple objectives. We assess designs based on three critical objectives, evaluating each by its worst-case performance across a range of Gate-Source Voltages ($V_{\text{GS}}$). We conduct simulations across a range of $V_{\text{GS}}$ values to ensure a thorough and robust analysis. For PAs, the optimization goals are to maximize the worst-case modulated average output power ($P_{\text{out,avg}}$) and power-added efficiency ($PAE_{\text{avg}}$) while minimizing the worst-case average junction temperature ($T_{\text{j,avg}}$) under a modulated 64-QAM signal stimulus. In contrast, for LNAs, the focus is on maximizing the worst-case maximum oscillation frequency ($f_{\text{max}}$) and Gain, and minimizing the worst-case minimum noise figure ($NF_{\text{min}}$). We utilize a derivative-free optimization method to effectively identify robust Pareto optimal device designs. This approach enhances our comprehension of the trade-off space, facilitating more informed decision-making. Furthermore, this method is general across different applications. Although it does not guarantee a globally optimal design, we demonstrate its effectiveness in GaN device sizing. The primary advantage of this method is that it enables the attainment of near-optimal or even optimal designs with just a fraction of the simulations required for an exhaustive full-grid search. △ Less

Submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.16355 [pdf, other]

Compact Model Parameter Extraction via Derivative-Free Optimization

Authors: Rafael Perez Martinez, Masaya Iwamoto, Kelly Woo, Zhengliang Bian, Roberto Tinti, Stephen Boyd, Srabanti Chowdhury

Abstract: In this paper, we address the problem of compact model parameter extraction to simultaneously extract tens of parameters via derivative-free optimization. Traditionally, parameter extraction is performed manually by dividing the complete set of parameters into smaller subsets, each targeting different operational regions of the device, a process that can take several days or even weeks. Our approa… ▽ More In this paper, we address the problem of compact model parameter extraction to simultaneously extract tens of parameters via derivative-free optimization. Traditionally, parameter extraction is performed manually by dividing the complete set of parameters into smaller subsets, each targeting different operational regions of the device, a process that can take several days or even weeks. Our approach streamlines this process by employing derivative-free optimization to identify a good parameter set that best fits the compact model without performing an exhaustive number of simulations. We further enhance the optimization process to address critical issues in device modeling by carefully choosing a loss function that evaluates model performance consistently across varying magnitudes by focusing on relative errors (as opposed to absolute errors), prioritizing accuracy in key operational regions of the device above a certain threshold, and reducing sensitivity to outliers. Furthermore, we utilize the concept of train-test split to assess the model fit and avoid overfitting. This is done by fitting 80% of the data and testing the model efficacy with the remaining 20%. We demonstrate the effectiveness of our methodology by successfully modeling two semiconductor devices: a diamond Schottky diode and a GaN-on-SiC HEMT, with the latter involving the ASM-HEMT DC model, which requires simultaneously extracting 35 model parameters to fit the model to the measured data. These examples demonstrate the effectiveness of our approach and showcase the practical benefits of derivative-free optimization in device modeling. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.16099 [pdf, other]

Speech Representation Analysis based on Inter- and Intra-Model Similarities

Authors: Yassine El Kheir, Ahmed Ali, Shammur Absar Chowdhury

Abstract: Self-supervised models have revolutionized speech processing, achieving new levels of performance in a wide variety of tasks with limited resources. However, the inner workings of these models are still opaque. In this paper, we aim to analyze the encoded contextual representation of these foundation models based on their inter- and intra-model similarity, independent of any external annotation an… ▽ More Self-supervised models have revolutionized speech processing, achieving new levels of performance in a wide variety of tasks with limited resources. However, the inner workings of these models are still opaque. In this paper, we aim to analyze the encoded contextual representation of these foundation models based on their inter- and intra-model similarity, independent of any external annotation and task-specific constraint. We examine different SSL models varying their training paradigm -- Contrastive (Wav2Vec2.0) and Predictive models (HuBERT); and model sizes (base and large). We explore these models on different levels of localization/distributivity of information including (i) individual neurons; (ii) layer representation; (iii) attention weights and (iv) compare the representations with their finetuned counterparts.Our results highlight that these models converge to similar representation subspaces but not to similar neuron-localized concepts\footnote{A concept represents a coherent fragment of knowledge, such as ``a class containing certain objects as elements, where the objects have certain properties. We made the code publicly available for facilitating further research, we publicly released our code. △ Less

Submitted 23 June, 2024; originally announced June 2024.

Comments: 5 pages, Accepted to appear in ICASSP XAI-SA Workshop

arXiv:2406.14850 [pdf, other]

DExter: Learning and Controlling Performance Expression with Diffusion Models

Authors: Huan Zhang, Shreyan Chowdhury, Carlos Eduardo Cancino-Chacón, **hua Liang, Simon Dixon, Gerhard Widmer

Abstract: In the pursuit of develo** expressive music performance models using artificial intelligence, this paper introduces DExter, a new approach leveraging diffusion probabilistic models to render Western classical piano performances. In this approach, performance parameters are represented in a continuous expression space and a diffusion model is trained to predict these continuous parameters while b… ▽ More In the pursuit of develo** expressive music performance models using artificial intelligence, this paper introduces DExter, a new approach leveraging diffusion probabilistic models to render Western classical piano performances. In this approach, performance parameters are represented in a continuous expression space and a diffusion model is trained to predict these continuous parameters while being conditioned on the musical score. Furthermore, DExter also enables the generation of interpretations (expressive variations of a performance) guided by perceptually meaningful features by conditioning jointly on score and perceptual feature representations. Consequently, we find that our model is useful for learning expressive performance, generating perceptually steered performances, and transferring performance styles. We assess the model through quantitative and qualitative analyses, focusing on specific performance metrics regarding dimensions like asynchrony and articulation, as well as through listening tests comparing generated performances with different human interpretations. Results show that DExter is able to capture the time-varying correlation of the expressive parameters, and compares well to existing rendering models in subjectively evaluated ratings. The perceptual-feature-conditioned generation and transferring capabilities of DExter are verified by a proxy model predicting perceptual characteristics of differently steered performances. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: in submission to appsci special session

arXiv:2406.13431 [pdf, other]

Children's Speech Recognition through Discrete Token Enhancement

Authors: Vrunda N. Sukhadia, Shammur Absar Chowdhury

Abstract: Children's speech recognition is considered a low-resource task mainly due to the lack of publicly available data. There are several reasons for such data scarcity, including expensive data collection and annotation processes, and data privacy, among others. Transforming speech signals into discrete tokens that do not carry sensitive information but capture both linguistic and acoustic information… ▽ More Children's speech recognition is considered a low-resource task mainly due to the lack of publicly available data. There are several reasons for such data scarcity, including expensive data collection and annotation processes, and data privacy, among others. Transforming speech signals into discrete tokens that do not carry sensitive information but capture both linguistic and acoustic information could be a solution for privacy concerns. In this study, we investigate the integration of discrete speech tokens into children's speech recognition systems as input without significantly degrading the ASR performance. Additionally, we explored single-view and multi-view strategies for creating these discrete labels. Furthermore, we tested the models for generalization capabilities with unseen domain and nativity dataset. Results reveal that the discrete token ASR for children achieves nearly equivalent performance with an approximate 83% reduction in parameters. △ Less

Submitted 24 June, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

Comments: Accepted at Interspeech 2024

arXiv:2406.04673 [pdf, other]

MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

Authors: Sanjoy Chowdhury, Sayan Nag, K J Joseph, Balaji Vasan Srinivasan, Dinesh Manocha

Abstract: Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, w… ▽ More Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, we propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel "visual synapse", which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM. Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music, measured both objectively and subjectively, with a relative gain of up to 67.98% on the FAD score. We hope that our work will gather attention to this pragmatic, yet relatively under-explored research area. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: Accepted at CVPR 2024 as Highlight paper. Webpage: https://schowdhury671.github.io/melfusion_cvpr2024/

arXiv:2405.10435 [pdf, other]

Two-Stage Stochastic Optimal Power Flow for Microgrids With Uncertain Wildfire Effects

Authors: Sifat Chowdhury, Yu Zhang

Abstract: Large-scale power outages caused by extreme weather events are one of the major factors weakening grid resilience. In order to prevent the critical infrastructure from cascading failure, power lines are often proactively de-energized under the threat of a progressing wildfire. In this context, the potential of microgrid (MG) functioning in islanded mode can be exploited to enhance the resiliency o… ▽ More Large-scale power outages caused by extreme weather events are one of the major factors weakening grid resilience. In order to prevent the critical infrastructure from cascading failure, power lines are often proactively de-energized under the threat of a progressing wildfire. In this context, the potential of microgrid (MG) functioning in islanded mode can be exploited to enhance the resiliency of the power grid. However, there are numerous uncertainties originating from these types of events and an accurate modeling of the MG is required to harness its full potential. In this paper, we consider the uncertainty in line outages depending on fire propagation and reduced solar power generation due to the particulate matter in wildfire smoke. We formulate a two-stage stochastic MG optimal power flow problem by utilizing a second-order cone relaxation of the DistFlow model. Leveraging an effective approximation of the resistive heat gain, we separate the complicating constraints of dynamic line rating from the resulting optimization problem. Extensive simulation results corroborate the merits of our proposed framework, which is tested on a modified IEEE 22-bus system. △ Less

Submitted 16 May, 2024; originally announced May 2024.

arXiv:2403.07125 [pdf, other]

Learning-Aided Control of Robotic Tether-Net with Maneuverable Nodes to Capture Large Space Debris

Authors: Achira Boonrath, Feng Liu, Elenora M. Botta, Souma Chowdhury

Abstract: Maneuverable tether-net systems launched from an unmanned spacecraft offer a promising solution for the active removal of large space debris. Guaranteeing the successful capture of such space debris is dependent on the ability to reliably maneuver the tether-net system -- a flexible, many-DoF (thus complex) system -- for a wide range of launch scenarios. Here, scenarios are defined by the relative… ▽ More Maneuverable tether-net systems launched from an unmanned spacecraft offer a promising solution for the active removal of large space debris. Guaranteeing the successful capture of such space debris is dependent on the ability to reliably maneuver the tether-net system -- a flexible, many-DoF (thus complex) system -- for a wide range of launch scenarios. Here, scenarios are defined by the relative location of the debris with respect to the chaser spacecraft. This paper represents and solves this problem as a hierarchically decentralized implementation of robotic trajectory planning and control and demonstrates the effectiveness of the approach when applied to two different tether-net systems, with 4 and 8 maneuverable units (MUs), respectively. Reinforcement learning (policy gradient) is used to design the centralized trajectory planner that, based on the relative location of the target debris at the launch of the net, computes the final aiming positions of each MU, from which their trajectory can be derived. Each MU then seeks to follow its assigned trajectory by using a decentralized PID controller that outputs the MU's thrust vector and is informed by noisy sensor feedback (for realism) of its relative location. System performance is assessed in terms of capture success and overall fuel consumption by the MUs. Reward sha** and surrogate models are used to respectively guide and speed up the RL process. Simulation-based experiments show that this approach allows the successful capture of debris at fuel costs that are notably lower than nominal baselines, including in scenarios where the debris is significantly off-centered compared to the approaching chaser spacecraft. △ Less

Submitted 11 March, 2024; originally announced March 2024.

Comments: This paper was accepted for presentation in proceedings of IEEE International Conference on Robotics and Automation 2024

arXiv:2403.01789 [pdf, other]

DECOR: Enhancing Logic Locking Against Machine Learning-Based Attacks

Authors: Yinghua Hu, Kaixin Yang, Subhajit Dutta Chowdhury, Pierluigi Nuzzo

Abstract: Logic locking (LL) has gained attention as a promising intellectual property protection measure for integrated circuits. However, recent attacks, facilitated by machine learning (ML), have shown the potential to predict the correct key in multiple LL schemes by exploiting the correlation of the correct key value with the circuit structure. This paper presents a generic LL enhancement method based… ▽ More Logic locking (LL) has gained attention as a promising intellectual property protection measure for integrated circuits. However, recent attacks, facilitated by machine learning (ML), have shown the potential to predict the correct key in multiple LL schemes by exploiting the correlation of the correct key value with the circuit structure. This paper presents a generic LL enhancement method based on a randomized algorithm that can significantly decrease the correlation between locked circuit netlist and correct key values in an LL scheme. Numerical results show that the proposed method can efficiently degrade the accuracy of state-of-the-art ML-based attacks down to around 50%, resulting in negligible advantage versus random guessing. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Comments: 8 pages. Accepted at the International Symposium on Quality Electronic Design (ISQED), 2024

arXiv:2401.14826 [pdf, other]

Expressivity-aware Music Performance Retrieval using Mid-level Perceptual Features and Emotion Word Embeddings

Authors: Shreyan Chowdhury, Gerhard Widmer

Abstract: This paper explores a specific sub-task of cross-modal music retrieval. We consider the delicate task of retrieving a performance or rendition of a musical piece based on a description of its style, expressive character, or emotion from a set of different performances of the same piece. We observe that a general purpose cross-modal system trained to learn a common text-audio embedding space does n… ▽ More This paper explores a specific sub-task of cross-modal music retrieval. We consider the delicate task of retrieving a performance or rendition of a musical piece based on a description of its style, expressive character, or emotion from a set of different performances of the same piece. We observe that a general purpose cross-modal system trained to learn a common text-audio embedding space does not yield optimal results for this task. By introducing two changes -- one each to the text encoder and the audio encoder -- we demonstrate improved performance on a dataset of piano performances and associated free-text descriptions. On the text side, we use emotion-enriched word embeddings (EWE) and on the audio side, we extract mid-level perceptual features instead of generic audio embeddings. Our results highlight the effectiveness of mid-level perceptual features learnt from music and emotion enriched word embeddings learnt from emotion-labelled text in capturing musical expression in a cross-modal setting. Additionally, our interpretable mid-level features provide a route for introducing explainability in the retrieval and downstream recommendation processes. △ Less

Submitted 26 January, 2024; originally announced January 2024.

Comments: Presented at FIRE 2023 (Forum for Information Retrieval Evaluation) conference, Goa, India

arXiv:2401.04864 [pdf]

Microgravity Mass Gauging with Capacitance Sensing: Sensor Design and Experiment

Authors: M. A. Charleston, S. M. Chowdhury, Q. M. Marashdeh, B. J. Straiton, F. L. Teixeira

Abstract: The use of capacitance sensors for fuel mass gauging has been in consideration since the early days of manned space flight. However, certain difficulties arise when considering tanks in microgravity environments. Surface tension effects lead to fluid wetting of the interior surface of the tank, leaving large interior voids, while thrust/settling effects can lead to dispersed two-phase mixtures. Wi… ▽ More The use of capacitance sensors for fuel mass gauging has been in consideration since the early days of manned space flight. However, certain difficulties arise when considering tanks in microgravity environments. Surface tension effects lead to fluid wetting of the interior surface of the tank, leaving large interior voids, while thrust/settling effects can lead to dispersed two-phase mixtures. With the exception of Electrical Capacitance Volume Tomography (ECVT), few sensing technologies are well suited for measuring annular, stratified, and dispersed fluid configurations as well as handling the additional complications of mechanical installation inside a spherical tank. To optimize the design of future ECVT based spherical tank mass gauging sensors, different electrode plate layouts are considered, and their effect on the performance of the sensor as a fuel mass gauge is analyzed through the use of imaging and averaging techniques. △ Less

Submitted 9 January, 2024; originally announced January 2024.

Comments: 19 pages, 26 figures, 5 tables

arXiv:2310.13974 [pdf, other]

Automatic Pronunciation Assessment -- A Review

Authors: Yassine El Kheir, Ahmed Ali, Shammur Absar Chowdhury

Abstract: Pronunciation assessment and its application in computer-aided pronunciation training (CAPT) have seen impressive progress in recent years. With the rapid growth in language processing and deep learning over the past few years, there is a need for an updated review. In this paper, we review methods employed in pronunciation assessment for both phonemic and prosodic. We categorize the main challeng… ▽ More Pronunciation assessment and its application in computer-aided pronunciation training (CAPT) have seen impressive progress in recent years. With the rapid growth in language processing and deep learning over the past few years, there is a need for an updated review. In this paper, we review methods employed in pronunciation assessment for both phonemic and prosodic. We categorize the main challenges observed in prominent research trends, and highlight existing limitations, and available resources. This is followed by a discussion of the remaining challenges and possible directions for future work. △ Less

Submitted 21 October, 2023; originally announced October 2023.

Comments: 9 pages, accepted to EMNLP Findings

arXiv:2309.15674 [pdf, other]

Speech collage: code-switched audio generation by collaging monolingual corpora

Authors: Amir Hussein, Dorsa Zeinali, Ondřej Klejch, Matthew Wiesner, Brian Yan, Shammur Chowdhury, Ahmed Ali, Shinji Watanabe, Sanjeev Khudanpur

Abstract: Designing effective automatic speech recognition (ASR) systems for Code-Switching (CS) often depends on the availability of the transcribed CS resources. To address data scarcity, this paper introduces Speech Collage, a method that synthesizes CS data from monolingual corpora by splicing audio segments. We further improve the smoothness quality of audio generation using an overlap-add approach. We… ▽ More Designing effective automatic speech recognition (ASR) systems for Code-Switching (CS) often depends on the availability of the transcribed CS resources. To address data scarcity, this paper introduces Speech Collage, a method that synthesizes CS data from monolingual corpora by splicing audio segments. We further improve the smoothness quality of audio generation using an overlap-add approach. We investigate the impact of generated data on speech recognition in two scenarios: using in-domain CS text and a zero-shot approach with synthesized CS text. Empirical results highlight up to 34.4% and 16.2% relative reductions in Mixed-Error Rate and Word-Error Rate for in-domain and zero-shot scenarios, respectively. Lastly, we demonstrate that CS augmentation bolsters the model's code-switching inclination and reduces its monolingual bias. △ Less

Submitted 27 September, 2023; originally announced September 2023.

arXiv:2309.07739 [pdf, other]

The complementary roles of non-verbal cues for Robust Pronunciation Assessment

Authors: Yassine El Kheir, Shammur Absar Chowdhury, Ahmed Ali

Abstract: Research on pronunciation assessment systems focuses on utilizing phonetic and phonological aspects of non-native (L2) speech, often neglecting the rich layer of information hidden within the non-verbal cues. In this study, we proposed a novel pronunciation assessment framework, IntraVerbalPA. % The framework innovatively incorporates both fine-grained frame- and abstract utterance-level non-verba… ▽ More Research on pronunciation assessment systems focuses on utilizing phonetic and phonological aspects of non-native (L2) speech, often neglecting the rich layer of information hidden within the non-verbal cues. In this study, we proposed a novel pronunciation assessment framework, IntraVerbalPA. % The framework innovatively incorporates both fine-grained frame- and abstract utterance-level non-verbal cues, alongside the conventional speech and phoneme representations. Additionally, we introduce ''Goodness of phonemic-duration'' metric to effectively model duration distribution within the framework. Our results validate the effectiveness of the proposed IntraVerbalPA framework and its individual components, yielding performance that either matches or outperforms existing research works. △ Less

Submitted 14 September, 2023; originally announced September 2023.

Comments: 5 pages, submitted to ICASSP 2024

arXiv:2309.07719 [pdf, other]

L1-aware Multilingual Mispronunciation Detection Framework

Authors: Yassine El Kheir, Shammur Absar Chowdhury, Ahmed Ali

Abstract: The phonological discrepancies between a speaker's native (L1) and the non-native language (L2) serves as a major factor for mispronunciation. This paper introduces a novel multilingual MDD architecture, L1-MultiMDD, enriched with L1-aware speech representation. An end-to-end speech encoder is trained on the input signal and its corresponding reference phoneme sequence. First, an attention mechani… ▽ More The phonological discrepancies between a speaker's native (L1) and the non-native language (L2) serves as a major factor for mispronunciation. This paper introduces a novel multilingual MDD architecture, L1-MultiMDD, enriched with L1-aware speech representation. An end-to-end speech encoder is trained on the input signal and its corresponding reference phoneme sequence. First, an attention mechanism is deployed to align the input audio with the reference phoneme sequence. Afterwards, the L1-L2-speech embedding are extracted from an auxiliary model, pretrained in a multi-task setup identifying L1 and L2 language, and are infused with the primary network. Finally, the L1-MultiMDD is then optimized for a unified multilingual phoneme recognition task using connectionist temporal classification (CTC) loss for the target languages: English, Arabic, and Mandarin. Our experiments demonstrate the effectiveness of the proposed L1-MultiMDD framework on both seen -- L2-ARTIC, LATIC, and AraVoiceL2v2; and unseen -- EpaDB and Speechocean762 datasets. The consistent gains in PER, and false rejection rate (FRR) across all target languages confirm our approach's robustness, efficacy, and generalizability. △ Less

Submitted 21 September, 2023; v1 submitted 14 September, 2023; originally announced September 2023.

Comments: 5 papers, submitted to ICASSP 2024

arXiv:2308.12370 [pdf, other]

AdVerb: Visually Guided Audio Dereverberation

Authors: Sanjoy Chowdhury, Sreyan Ghosh, Subhrajyoti Dasgupta, Anton Ratnarajah, Utkarsh Tyagi, Dinesh Manocha

Abstract: We present AdVerb, a novel audio-visual dereverberation framework that uses visual cues in addition to the reverberant sound to estimate clean audio. Although audio-only dereverberation is a well-studied problem, our approach incorporates the complementary visual modality to perform audio dereverberation. Given an image of the environment where the reverberated sound signal has been recorded, AdVe… ▽ More We present AdVerb, a novel audio-visual dereverberation framework that uses visual cues in addition to the reverberant sound to estimate clean audio. Although audio-only dereverberation is a well-studied problem, our approach incorporates the complementary visual modality to perform audio dereverberation. Given an image of the environment where the reverberated sound signal has been recorded, AdVerb employs a novel geometry-aware cross-modal transformer architecture that captures scene geometry and audio-visual cross-modal relationship to generate a complex ideal ratio mask, which, when applied to the reverberant audio predicts the clean sound. The effectiveness of our method is demonstrated through extensive quantitative and qualitative evaluations. Our approach significantly outperforms traditional audio-only and audio-visual baselines on three downstream tasks: speech enhancement, speech recognition, and speaker verification, with relative improvements in the range of 18% - 82% on the LibriSpeech test-clean set. We also achieve highly satisfactory RT60 error scores on the AVSpeech dataset. △ Less

Submitted 23 August, 2023; originally announced August 2023.

Comments: Accepted at ICCV 2023. For project page, see https://gamma.umd.edu/researchdirections/speech/adverb

arXiv:2308.02503 [pdf, other]

MyVoice: Arabic Speech Resource Collaboration Platform

Authors: Yousseif Elshahawy, Yassine El Kheir, Shammur Absar Chowdhury, Ahmed Ali

Abstract: We introduce MyVoice, a crowdsourcing platform designed to collect Arabic speech to enhance dialectal speech technologies. This platform offers an opportunity to design large dialectal speech datasets; and makes them publicly available. MyVoice allows contributors to select city/country-level fine-grained dialect and record the displayed utterances. Users can switch roles between contributors and… ▽ More We introduce MyVoice, a crowdsourcing platform designed to collect Arabic speech to enhance dialectal speech technologies. This platform offers an opportunity to design large dialectal speech datasets; and makes them publicly available. MyVoice allows contributors to select city/country-level fine-grained dialect and record the displayed utterances. Users can switch roles between contributors and annotators. The platform incorporates a quality assurance system that filters out low-quality and spurious recordings before sending them for validation. During the validation phase, contributors can assess the quality of recordings, annotate them, and provide feedback which is then reviewed by administrators. Furthermore, the platform offers flexibility to admin roles to add new data or tasks beyond dialectal speech and word collection, which are displayed to contributors. Thus, enabling collaborative efforts in gathering diverse and large Arabic speech data. △ Less

Submitted 23 July, 2023; originally announced August 2023.

Comments: 2 pages, accepted at InterSpeech23 Show and Tell Session

arXiv:2307.03061 [pdf, other]

Learning Constrained Corner Node Trajectories of a Tether Net System for Space Debris Capture

Authors: Feng Liu, Achira Boonrath, Prajit KrisshnaKumar, Elenora M. Botta, Souma Chowdhury

Abstract: The earth's orbit is becoming increasingly crowded with debris that poses significant safety risks to the operation of existing and new spacecraft and satellites. The active tether-net system, which consists of a flexible net with maneuverable corner nodes launched from a small autonomous spacecraft, is a promising solution for capturing and disposing of such space debris. The requirement of auton… ▽ More The earth's orbit is becoming increasingly crowded with debris that poses significant safety risks to the operation of existing and new spacecraft and satellites. The active tether-net system, which consists of a flexible net with maneuverable corner nodes launched from a small autonomous spacecraft, is a promising solution for capturing and disposing of such space debris. The requirement of autonomous operation and the need to generalize over scenarios with debris scenarios in different rotational rates makes the capture process significantly challenging. The space debris could rotate about multiple axes, which, along with sensing/estimation and actuation uncertainties, calls for a robust, generalizable approach to guiding the net launch and flight - one that can guarantee robust capture. This paper proposes a decentralized actuation system combined with reinforcement learning for planning and controlling this tether-net system. In this new system, four microsatellites with cold gas type thrusters act as the corner nodes of the net and can thus help control or correct the flight of the net after launch. The microsatellites pull the net to complete the task of approaching and capturing the space debris. The proposed method uses a RL framework that integrates a proximal policy optimization to find the optimal solution based on the dynamics simulation of the net and the microsatellites performed in Vortex Studio. The RL framework finds the optimal trajectory that is both fuel-efficient and ensures a desired level of capture quality. △ Less

Submitted 6 July, 2023; originally announced July 2023.

Comments: This paper was presented at AIAA Aviation 2023 Forum

arXiv:2306.01845 [pdf, other]

Multi-View Multi-Task Representation Learning for Mispronunciation Detection

Authors: Yassine El Kheir, Shammur Absar Chowdhury, Ahmed Ali

Abstract: The disparity in phonology between learner's native (L1) and target (L2) language poses a significant challenge for mispronunciation detection and diagnosis (MDD) systems. This challenge is further intensified by lack of annotated L2 data. This paper proposes a novel MDD architecture that exploits multiple `views' of the same input data assisted by auxiliary tasks to learn more distinctive phoneti… ▽ More The disparity in phonology between learner's native (L1) and target (L2) language poses a significant challenge for mispronunciation detection and diagnosis (MDD) systems. This challenge is further intensified by lack of annotated L2 data. This paper proposes a novel MDD architecture that exploits multiple `views' of the same input data assisted by auxiliary tasks to learn more distinctive phonetic representation in a low-resource setting. Using the mono- and multilingual encoders, the model learn multiple views of the input, and capture the sound properties across diverse languages and accents. These encoded representations are further enriched by learning articulatory features in a multi-task setup. Our reported results using the L2-ARCTIC data outperformed the SOTA models, with a phoneme error rate reduction of 11.13% and 8.60% and absolute F1 score increase of 5.89%, and 2.49% compared to the single-view mono- and multilingual systems, with a limited L2 dataset. △ Less

Submitted 7 August, 2023; v1 submitted 2 June, 2023; originally announced June 2023.

Comments: 5 pages, Accepted SLaTE23

arXiv:2305.09688 [pdf]

OOD-Speech: A Large Bengali Speech Recognition Dataset for Out-of-Distribution Benchmarking

Authors: Fazle Rabbi Rakib, Souhardya Saha Dip, Samiul Alam, Nazia Tasnim, Md. Istiak Hossain Shihab, Md. Nazmuddoha Ansary, Syed Mobassir Hossen, Marsia Haque Meghla, Mamunur Mamun, Farig Sadeque, Sayma Sultana Chowdhury, Tahsin Reasat, Asif Sushmit, Ahmed Imtiaz Humayun

Abstract: We present OOD-Speech, the first out-of-distribution (OOD) benchmarking dataset for Bengali automatic speech recognition (ASR). Being one of the most spoken languages globally, Bengali portrays large diversity in dialects and prosodic features, which demands ASR frameworks to be robust towards distribution shifts. For example, islamic religious sermons in Bengali are delivered with a tonality that… ▽ More We present OOD-Speech, the first out-of-distribution (OOD) benchmarking dataset for Bengali automatic speech recognition (ASR). Being one of the most spoken languages globally, Bengali portrays large diversity in dialects and prosodic features, which demands ASR frameworks to be robust towards distribution shifts. For example, islamic religious sermons in Bengali are delivered with a tonality that is significantly different from regular speech. Our training dataset is collected via massively online crowdsourcing campaigns which resulted in 1177.94 hours collected and curated from $22,645$ native Bengali speakers from South Asia. Our test dataset comprises 23.03 hours of speech collected and manually annotated from 17 different sources, e.g., Bengali TV drama, Audiobook, Talk show, Online class, and Islamic sermons to name a few. OOD-Speech is jointly the largest publicly available speech dataset, as well as the first out-of-distribution ASR benchmarking dataset for Bengali. △ Less

Submitted 15 May, 2023; originally announced May 2023.

arXiv:2305.07790 [pdf]

Automated Grain Boundary (GB) Segmentation and Microstructural Analysis in 347H Stainless Steel Using Deep Learning and Multimodal Microscopy

Authors: Shoieb Ahmed Chowdhury, M. F. N. Taufique, **g Wang, Marissa Masden, Madison Wenzlick, Ram Devanathan, Alan L Schemer-Kohrn, Keerti S Kappagantula

Abstract: Austenitic 347H stainless steel offers superior mechanical properties and corrosion resistance required for extreme operating conditions such as high temperature. The change in microstructure due to composition and process variations is expected to impact material properties. Identifying microstructural features such as grain boundaries thus becomes an important task in the process-microstructure-… ▽ More Austenitic 347H stainless steel offers superior mechanical properties and corrosion resistance required for extreme operating conditions such as high temperature. The change in microstructure due to composition and process variations is expected to impact material properties. Identifying microstructural features such as grain boundaries thus becomes an important task in the process-microstructure-properties loop. Applying convolutional neural network (CNN) based deep-learning models is a powerful technique to detect features from material micrographs in an automated manner. Manual labeling of the images for the segmentation task poses a major bottleneck for generating training data and labels in a reliable and reproducible way within a reasonable timeframe. In this study, we attempt to overcome such limitations by utilizing multi-modal microscopy to generate labels directly instead of manual labeling. We combine scanning electron microscopy (SEM) images of 347H stainless steel as training data and electron backscatter diffraction (EBSD) micrographs as pixel-wise labels for grain boundary detection as a semantic segmentation task. We demonstrate that despite producing instrumentation drift during data collection between two modes of microscopy, this method performs comparably to similar segmentation tasks that used manual labeling. Additionally, we find that naïve pixel-wise segmentation results in small gaps and missing boundaries in the predicted grain boundary map. By incorporating topological information during model training, the connectivity of the grain boundary network and segmentation performance is improved. Finally, our approach is validated by accurate computation on downstream tasks of predicting the underlying grain morphology distributions which are the ultimate quantities of interest for microstructural characterization. △ Less

Submitted 12 May, 2023; originally announced May 2023.

arXiv:2305.07445 [pdf, other]

QVoice: Arabic Speech Pronunciation Learning Application

Authors: Yassine El Kheir, Fouad Khnaisser, Shammur Absar Chowdhury, Hamdy Mubarak, Shazia Afzal, Ahmed Ali

Abstract: This paper introduces a novel Arabic pronunciation learning application QVoice, powered with end-to-end mispronunciation detection and feedback generator module. The application is designed to support non-native Arabic speakers in enhancing their pronunciation skills, while also hel** native speakers mitigate any potential influence from regional dialects on their Modern Standard Arabic (MSA) pr… ▽ More This paper introduces a novel Arabic pronunciation learning application QVoice, powered with end-to-end mispronunciation detection and feedback generator module. The application is designed to support non-native Arabic speakers in enhancing their pronunciation skills, while also hel** native speakers mitigate any potential influence from regional dialects on their Modern Standard Arabic (MSA) pronunciation. QVoice employs various learning cues to aid learners in comprehending meaning, drawing connections with their existing knowledge of English language, and offers detailed feedback for pronunciation correction, along with contextual examples showcasing word usage. The learning cues featured in QVoice encompass a wide range of meaningful information, such as visualizations of phrases/words and their translations, as well as phonetic transcriptions and transliterations. QVoice provides pronunciation feedback at the character level and assesses performance at the word level. △ Less

Submitted 9 May, 2023; originally announced May 2023.

Comments: 2 pages, Accepted InterSpeech23 Show & Tell Demo Session

Journal ref: InterSpeech 2023

arXiv:2304.00649 [pdf, other]

Multilingual Word Error Rate Estimation: e-WER3

Authors: Shammur Absar Chowdhury, Ahmed Ali

Abstract: The success of the multilingual automatic speech recognition systems empowered many voice-driven applications. However, measuring the performance of such systems remains a major challenge, due to its dependency on manually transcribed speech data in both mono- and multilingual scenarios. In this paper, we propose a novel multilingual framework -- eWER3 -- jointly trained on acoustic and lexical re… ▽ More The success of the multilingual automatic speech recognition systems empowered many voice-driven applications. However, measuring the performance of such systems remains a major challenge, due to its dependency on manually transcribed speech data in both mono- and multilingual scenarios. In this paper, we propose a novel multilingual framework -- eWER3 -- jointly trained on acoustic and lexical representation to estimate word error rate. We demonstrate the effectiveness of eWER3 to (i) predict WER without using any internal states from the ASR and (ii) use the multilingual shared latent space to push the performance of the close-related languages. We show our proposed multilingual model outperforms the previous monolingual word error rate estimation method (eWER2) by an absolute 9\% increase in Pearson correlation coefficient (PCC), with better overall estimation between the predicted and reference WER. △ Less

Submitted 2 April, 2023; originally announced April 2023.

Comments: Accepted in ICASSP, Multilingual WER estimation, End-to-End systems, multilingual model, automatic word error rate estimation

arXiv:2303.01875 [pdf, other]

Decoding and Visualising Intended Emotion in an Expressive Piano Performance

Authors: Shreyan Chowdhury, Gerhard Widmer

Abstract: Expert musicians can mould a musical piece to convey specific emotions that they intend to communicate. In this paper, we place a mid-level features based music emotion model in this performer-to-listener communication scenario, and demonstrate via a small visualisation music emotion decoding in real time. We also extend the existing set of mid-level features using analogues of perceptual speed an… ▽ More Expert musicians can mould a musical piece to convey specific emotions that they intend to communicate. In this paper, we place a mid-level features based music emotion model in this performer-to-listener communication scenario, and demonstrate via a small visualisation music emotion decoding in real time. We also extend the existing set of mid-level features using analogues of perceptual speed and perceived dynamics. △ Less

Submitted 3 March, 2023; originally announced March 2023.

Comments: Extended version of Late-Breaking Demo Session paper accepted at ISMIR 2022 (23rd Int. Society for Music Information Retrieval Conf., Bengaluru, India, 2022)

arXiv:2302.13921 [pdf, other]

Autonomous Polycrystalline Material Decomposition for Hyperspectral Neutron Tomography

Authors: Mohammad Samin Nur Chowdhury, Diyu Yang, Shimin Tang, Singanallur V. Venkatakrishnan, Hassina Z. Bilheux, Gregery T. Buzzard, Charles A. Bouman

Abstract: Hyperspectral neutron tomography is an effective method for analyzing crystalline material samples with complex compositions in a non-destructive manner. Since the counts in the hyperspectral neutron radiographs directly depend on the neutron cross-sections, materials may exhibit contrasting neutron responses across wavelengths. Therefore, it is possible to extract the unique signatures associated… ▽ More Hyperspectral neutron tomography is an effective method for analyzing crystalline material samples with complex compositions in a non-destructive manner. Since the counts in the hyperspectral neutron radiographs directly depend on the neutron cross-sections, materials may exhibit contrasting neutron responses across wavelengths. Therefore, it is possible to extract the unique signatures associated with each material and use them to separate the crystalline phases simultaneously. We introduce an autonomous material decomposition (AMD) algorithm to automatically characterize and localize polycrystalline structures using Bragg edges with contrasting neutron responses from hyperspectral data. The algorithm estimates the linear attenuation coefficient spectra from the measured radiographs and then uses these spectra to perform polycrystalline material decomposition and reconstructs 3D material volumes to localize materials in the spatial domain. Our results demonstrate that the method can accurately estimate both the linear attenuation coefficient spectra and associated reconstructions on both simulated and experimental neutron data. △ Less

Submitted 21 August, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

arXiv:2212.00647 [pdf, other]

An Edge Alignment-based Orientation Selection Method for Neutron Tomography

Authors: Diyu Yang, Shimin Tang, Singanallur V. Venkatakrishnan, Mohammad S. N. Chowdhury, Yuxuan Zhang, Hassina Z. Bilheux, Gregery T. Buzzard, Charles A. Bouman

Abstract: Neutron computed tomography (nCT) is a 3D characterization technique used to image the internal morphology or chemical composition of samples in biology and materials sciences. A typical workflow involves placing the sample in the path of a neutron beam, acquiring projection data at a predefined set of orientations, and processing the resulting data using an analytic reconstruction algorithm. Typi… ▽ More Neutron computed tomography (nCT) is a 3D characterization technique used to image the internal morphology or chemical composition of samples in biology and materials sciences. A typical workflow involves placing the sample in the path of a neutron beam, acquiring projection data at a predefined set of orientations, and processing the resulting data using an analytic reconstruction algorithm. Typical nCT scans require hours to days to complete and are then processed using conventional filtered back-projection (FBP), which performs poorly with sparse views or noisy data. Hence, the main methods in order to reduce overall acquisition time are the use of an improved sampling strategy combined with the use of advanced reconstruction methods such as model-based iterative reconstruction (MBIR). In this paper, we propose an adaptive orientation selection method in which an MBIR reconstruction on previously-acquired measurements is used to define an objective function on orientations that balances a data-fitting term promoting edge alignment and a regularization term promoting orientation diversity. Using simulated and experimental data, we demonstrate that our method produces high-quality reconstructions using significantly fewer total measurements than the conventional approach. △ Less

Submitted 8 March, 2023; v1 submitted 1 December, 2022; originally announced December 2022.

arXiv:2211.16319 [pdf, other]

Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition

Authors: Injy Hamed, Amir Hussein, Oumnia Chellah, Shammur Chowdhury, Hamdy Mubarak, Sunayana Sitaram, Nizar Habash, Ahmed Ali

Abstract: Code-switching poses a number of challenges and opportunities for multilingual automatic speech recognition. In this paper, we focus on the question of robust and fair evaluation metrics. To that end, we develop a reference benchmark data set of code-switching speech recognition hypotheses with human judgments. We define clear guidelines for minimal editing of automatic hypotheses. We validate the… ▽ More Code-switching poses a number of challenges and opportunities for multilingual automatic speech recognition. In this paper, we focus on the question of robust and fair evaluation metrics. To that end, we develop a reference benchmark data set of code-switching speech recognition hypotheses with human judgments. We define clear guidelines for minimal editing of automatic hypotheses. We validate the guidelines using 4-way inter-annotator agreement. We evaluate a large number of metrics in terms of correlation with human judgments. The metrics we consider vary in terms of representation (orthographic, phonological, semantic), directness (intrinsic vs extrinsic), granularity (e.g. word, character), and similarity computation method. The highest correlation to human judgment is achieved using transliteration followed by text normalization. We release the first corpus for human acceptance of code-switching speech recognition results in dialectal Arabic/English conversation speech. △ Less

Submitted 22 November, 2022; originally announced November 2022.

Comments: Accepted to SLT 2022

arXiv:2211.00923 [pdf, other]

SpeechBlender: Speech Augmentation Framework for Mispronunciation Data Generation

Authors: Yassine El Kheir, Shammur Absar Chowdhury, Ahmed Ali, Hamdy Mubarak, Shazia Afzal

Abstract: The lack of labeled second language (L2) speech data is a major challenge in designing mispronunciation detection models. We introduce SpeechBlender - a fine-grained data augmentation pipeline for generating mispronunciation errors to overcome such data scarcity. The SpeechBlender utilizes varieties of masks to target different regions of phonetic units, and use the mixing factors to linearly inte… ▽ More The lack of labeled second language (L2) speech data is a major challenge in designing mispronunciation detection models. We introduce SpeechBlender - a fine-grained data augmentation pipeline for generating mispronunciation errors to overcome such data scarcity. The SpeechBlender utilizes varieties of masks to target different regions of phonetic units, and use the mixing factors to linearly interpolate raw speech signals while augmenting pronunciation. The masks facilitate smooth blending of the signals, generating more effective samples than the `Cut/Paste' method. Our proposed technique achieves state-of-the-art results, with Speechocean762, on ASR dependent mispronunciation detection models at phoneme level, with a 2.0% gain in Pearson Correlation Coefficient (PCC) compared to the previous state-of-the-art [1]. Additionally, we demonstrate a 5.0% improvement at the phoneme level compared to our baseline. We also observed a 4.6% increase in F1-score with Arabic AraVoiceL2 testset. △ Less

Submitted 12 July, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

Comments: 5 pages

arXiv:2206.08835 [pdf, other]

What can Speech and Language Tell us About the Working Alliance in Psychotherapy

Authors: Sebastian P. Bayerl, Gabriel Roccabruna, Shammur Absar Chowdhury, Tommaso Ciulli, Morena Danieli, Korbinian Riedhammer, Giuseppe Riccardi

Abstract: We are interested in the problem of conversational analysis and its application to the health domain. Cognitive Behavioral Therapy is a structured approach in psychotherapy, allowing the therapist to help the patient to identify and modify the malicious thoughts, behavior, or actions. This cooperative effort can be evaluated using the Working Alliance Inventory Observer-rated Shortened - a 12 item… ▽ More We are interested in the problem of conversational analysis and its application to the health domain. Cognitive Behavioral Therapy is a structured approach in psychotherapy, allowing the therapist to help the patient to identify and modify the malicious thoughts, behavior, or actions. This cooperative effort can be evaluated using the Working Alliance Inventory Observer-rated Shortened - a 12 items inventory covering task, goal, and relationship - which has a relevant influence on therapeutic outcomes. In this work, we investigate the relation between this alliance inventory and the spoken conversations (sessions) between the patient and the psychotherapist. We have delivered eight weeks of e-therapy, collected their audio and video call sessions, and manually transcribed them. The spoken conversations have been annotated and evaluated with WAI ratings by professional therapists. We have investigated speech and language features and their association with WAI items. The feature types include turn dynamics, lexical entrainment, and conversational descriptors extracted from the speech and language signals. Our findings provide strong evidence that a subset of these features are strong indicators of working alliance. To the best of our knowledge, this is the first and a novel study to exploit speech and language for characterising working alliance. △ Less

Submitted 27 June, 2022; v1 submitted 17 June, 2022; originally announced June 2022.

Comments: Accepted at Interspeech 2022

arXiv:2205.11992 [pdf, other]

Co-optimization of Battery Routing and Load Restoration for Microgrids with Mobile Energy Storage Systems

Authors: Shourya Bose, Sifat Chowdhury, Yu Zhang

Abstract: Mobile energy storage systems (MESS) offer great operational flexibility to enhance the resiliency of distribution systems in an emergency condition. The optimal placement and sizing of those units are pivotal for quickly restoring the curtailed loads. In this paper, we propose a model for load restoration in a microgrid while concurrently optimizing the MESS routes required for the same. The mode… ▽ More Mobile energy storage systems (MESS) offer great operational flexibility to enhance the resiliency of distribution systems in an emergency condition. The optimal placement and sizing of those units are pivotal for quickly restoring the curtailed loads. In this paper, we propose a model for load restoration in a microgrid while concurrently optimizing the MESS routes required for the same. The model is formulated as a mixed integer second order cone program by considering the state of charge and evolution of the lower and upper bounds of battery capacities. Simulation results tested on the IEEE 123- bus benchmark system demonstrate the efficacy of the proposed model. △ Less

Submitted 24 May, 2022; originally announced May 2022.

Comments: PRES GM 2022 Conference

arXiv:2201.02550 [pdf, other]

Textual Data Augmentation for Arabic-English Code-Switching Speech Recognition

Authors: Amir Hussein, Shammur Absar Chowdhury, Ahmed Abdelali, Najim Dehak, Ahmed Ali, Sanjeev Khudanpur

Abstract: The pervasiveness of intra-utterance code-switching (CS) in spoken content requires that speech recognition (ASR) systems handle mixed language. Designing a CS-ASR system has many challenges, mainly due to data scarcity, grammatical structure complexity, and domain mismatch. The most common method for addressing CS is to train an ASR system with the available transcribed CS speech, along with mono… ▽ More The pervasiveness of intra-utterance code-switching (CS) in spoken content requires that speech recognition (ASR) systems handle mixed language. Designing a CS-ASR system has many challenges, mainly due to data scarcity, grammatical structure complexity, and domain mismatch. The most common method for addressing CS is to train an ASR system with the available transcribed CS speech, along with monolingual data. In this work, we propose a zero-shot learning methodology for CS-ASR by augmenting the monolingual data with artificially generating CS text. We based our approach on random lexical replacements and Equivalence Constraint (EC) while exploiting aligned translation pairs to generate random and grammatically valid CS content. Our empirical results show a 65.5% relative reduction in language model perplexity, and 7.7% in ASR WER on two ecologically valid CS test sets. The human evaluation of the generated text using EC suggests that more than 80% is of adequate quality. △ Less

Submitted 11 January, 2023; v1 submitted 7 January, 2022; originally announced January 2022.

arXiv:2110.00728 [pdf]

Implementation of MPPT Technique of Solar Module with Supervised Machine Learning

Authors: Ruhi Sharmin, Sayeed Shafayet Chowdhury, Farihal Abedin, Kazi Mujibur Rahman

Abstract: In this paper, we proposed a method using supervised ML in solar PV system for MPPT analysis. For this purpose, an overall schematic diagram of a PV system is designed and simulated to create a dataset in MATLAB/ Simulink. Thus, by analyzing the output characteristics of a solar cell, an improved MPPT algorithm on the basis of neural network (NN) method is put forward to track the maximum power po… ▽ More In this paper, we proposed a method using supervised ML in solar PV system for MPPT analysis. For this purpose, an overall schematic diagram of a PV system is designed and simulated to create a dataset in MATLAB/ Simulink. Thus, by analyzing the output characteristics of a solar cell, an improved MPPT algorithm on the basis of neural network (NN) method is put forward to track the maximum power point (MPP) of solar cell modules. To perform the task, Bayesian Regularization method was chosen as the training algorithm as it works best even for smaller data supporting the wide range of the train data set. The theoretical results show that the improved NN MPPT algorithm has higher efficiency compared with the Perturb and Observe method in the same environment, and the PV system can keep working at MPP without oscillation and probability of any kind of misjudgment. So it can not only reduce misjudgment, but also avoid power loss around the MPP. Moreover, we implemented the algorithm in a hardware set-up and verified the theoretical result comparing it with the empirical data. △ Less

Submitted 2 October, 2021; originally announced October 2021.

Comments: 11 pages, 11 Figures, 5 Tables

arXiv:2109.14979 [pdf, other]

doi 10.1109/ICCVW54120.2021.00103

Moving Object Detection for Event-based vision using Graph Spectral Clustering

Authors: Anindya Mondal, Shashant R, Jhony H. Giraldo, Thierry Bouwmans, Ananda S. Chowdhury

Abstract: Moving object detection has been a central topic of discussion in computer vision for its wide range of applications like in self-driving cars, video surveillance, security, and enforcement. Neuromorphic Vision Sensors (NVS) are bio-inspired sensors that mimic the working of the human eye. Unlike conventional frame-based cameras, these sensors capture a stream of asynchronous 'events' that pose mu… ▽ More Moving object detection has been a central topic of discussion in computer vision for its wide range of applications like in self-driving cars, video surveillance, security, and enforcement. Neuromorphic Vision Sensors (NVS) are bio-inspired sensors that mimic the working of the human eye. Unlike conventional frame-based cameras, these sensors capture a stream of asynchronous 'events' that pose multiple advantages over the former, like high dynamic range, low latency, low power consumption, and reduced motion blur. However, these advantages come at a high cost, as the event camera data typically contains more noise and has low resolution. Moreover, as event-based cameras can only capture the relative changes in brightness of a scene, event data do not contain usual visual information (like texture and color) as available in video data from normal cameras. So, moving object detection in event-based cameras becomes an extremely challenging task. In this paper, we present an unsupervised Graph Spectral Clustering technique for Moving Object Detection in Event-based data (GSCEventMOD). We additionally show how the optimum number of moving objects can be automatically determined. Experimental comparisons on publicly available datasets show that the proposed GSCEventMOD algorithm outperforms a number of state-of-the-art techniques by a maximum margin of 30%. △ Less

Submitted 2 December, 2021; v1 submitted 30 September, 2021; originally announced September 2021.

Comments: Ten pages, five figures, Published in 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada

arXiv:2108.08952 [pdf, other]

Mitigating Greenhouse Gas Emissions Through Generative Adversarial Networks Based Wildfire Prediction

Authors: Sifat Chowdhury, Kai Zhu, Yu Zhang

Abstract: Over the past decade, the number of wildfire has increased significantly around the world, especially in the State of California. The high-level concentration of greenhouse gas (GHG) emitted by wildfires aggravates global warming that further increases the risk of more fires. Therefore, an accurate prediction of wildfire occurrence greatly helps in preventing large-scale and long-lasting wildfires… ▽ More Over the past decade, the number of wildfire has increased significantly around the world, especially in the State of California. The high-level concentration of greenhouse gas (GHG) emitted by wildfires aggravates global warming that further increases the risk of more fires. Therefore, an accurate prediction of wildfire occurrence greatly helps in preventing large-scale and long-lasting wildfires and reducing the consequent GHG emissions. Various methods have been explored for wildfire risk prediction. However, the complex correlations among a lot of natural and human factors and wildfire ignition make the prediction task very challenging. In this paper, we develop a deep learning based data augmentation approach for wildfire risk prediction. We build a dataset consisting of diverse features responsible for fire ignition and utilize a conditional tabular generative adversarial network to explore the underlying patterns between the target value of risk levels and all involved features. For fair and comprehensive comparisons, we compare our proposed scheme with five other baseline methods where the former outperformed most of them. To corroborate the robustness, we have also tested the performance of our method with another dataset that also resulted in better efficiency. By adopting the proposed method, we can take preventive strategies of wildfire mitigation to reduce global GHG emissions. △ Less

Submitted 19 August, 2021; originally announced August 2021.

arXiv:2107.13231 [pdf, other]

On Perceived Emotion in Expressive Piano Performance: Further Experimental Evidence for the Relevance of Mid-level Perceptual Features

Authors: Shreyan Chowdhury, Gerhard Widmer

Abstract: Despite recent advances in audio content-based music emotion recognition, a question that remains to be explored is whether an algorithm can reliably discern emotional or expressive qualities between different performances of the same piece. In the present work, we analyze several sets of features on their effectiveness in predicting arousal and valence of six different performances (by six famous… ▽ More Despite recent advances in audio content-based music emotion recognition, a question that remains to be explored is whether an algorithm can reliably discern emotional or expressive qualities between different performances of the same piece. In the present work, we analyze several sets of features on their effectiveness in predicting arousal and valence of six different performances (by six famous pianists) of Bach's Well-Tempered Clavier Book 1. These features include low-level acoustic features, score-based features, features extracted using a pre-trained emotion model, and Mid-level perceptual features. We compare their predictive power by evaluating them on several experiments designed to test performance-wise or piece-wise variations of emotion. We find that Mid-level features show significant contribution in performance-wise variation of both arousal and valence -- even better than the pre-trained emotion model. Our findings add to the evidence of Mid-level perceptual features being an important representation of musical attributes for several tasks -- specifically, in this case, for capturing the expressive aspects of music that manifest as perceived emotion of a musical performance. △ Less

Submitted 28 July, 2021; originally announced July 2021.

Comments: In Proceedings of the 22nd International Society for Music Information Retrieval (ISMIR) Conference, Online, 2021

arXiv:2107.01573 [pdf, other]

Arabic Code-Switching Speech Recognition using Monolingual Data

Authors: Ahmed Ali, Shammur Chowdhury, Amir Hussein, Yasser Hifny

Abstract: Code-switching in automatic speech recognition (ASR) is an important challenge due to globalization. Recent research in multilingual ASR shows potential improvement over monolingual systems. We study key issues related to multilingual modeling for ASR through a series of large-scale ASR experiments. Our innovative framework deploys a multi-graph approach in the weighted finite state transducers (W… ▽ More Code-switching in automatic speech recognition (ASR) is an important challenge due to globalization. Recent research in multilingual ASR shows potential improvement over monolingual systems. We study key issues related to multilingual modeling for ASR through a series of large-scale ASR experiments. Our innovative framework deploys a multi-graph approach in the weighted finite state transducers (WFST) framework. We compare our WFST decoding strategies with a transformer sequence to sequence system trained on the same data. Given a code-switching scenario between Arabic and English languages, our results show that the WFST decoding approaches were more suitable for the intersentential code-switching datasets. In addition, the transformer system performed better for intrasentential code-switching task. With this study, we release an artificially generated development and test sets, along with ecological code-switching test set, to benchmark the ASR performance. △ Less

Submitted 4 July, 2021; originally announced July 2021.

Comments: Accepted in Interspeech 2021, speech recognition, code-switching, ASR, transformer, WFST, graph approach

arXiv:2107.00439 [pdf, other]

What do End-to-End Speech Models Learn about Speaker, Language and Channel Information? A Layer-wise and Neuron-level Analysis

Authors: Shammur Absar Chowdhury, Nadir Durrani, Ahmed Ali

Abstract: Deep neural networks are inherently opaque and challenging to interpret. Unlike hand-crafted feature-based models, we struggle to comprehend the concepts learned and how they interact within these models. This understanding is crucial not only for debugging purposes but also for ensuring fairness in ethical decision-making. In our study, we conduct a post-hoc functional interpretability analysis o… ▽ More Deep neural networks are inherently opaque and challenging to interpret. Unlike hand-crafted feature-based models, we struggle to comprehend the concepts learned and how they interact within these models. This understanding is crucial not only for debugging purposes but also for ensuring fairness in ethical decision-making. In our study, we conduct a post-hoc functional interpretability analysis of pretrained speech models using the probing framework [1]. Specifically, we analyze utterance-level representations of speech models trained for various tasks such as speaker recognition and dialect identification. We conduct layer and neuron-wise analyses, probing for speaker, language, and channel properties. Our study aims to answer the following questions: i) what information is captured within the representations? ii) how is it represented and distributed? and iii) can we identify a minimal subset of the network that possesses this information? Our results reveal several novel findings, including: i) channel and gender information are distributed across the network, ii) the information is redundantly available in neurons with respect to a task, iii) complex properties such as dialectal information are encoded only in the task-oriented pretrained network, iv) and is localised in the upper layers, v) we can extract a minimal subset of neurons encoding the pre-defined property, vi) salient neurons are sometimes shared between properties, vii) our analysis highlights the presence of biases (for example gender) in the network. Our cross-architectural comparison indicates that: i) the pretrained models capture speaker-invariant information, and ii) CNN models are competitive with Transformer models in encoding various understudied properties. △ Less

Submitted 10 July, 2023; v1 submitted 1 July, 2021; originally announced July 2021.

Comments: Accepted in CSL journal. Keywords: Speech, Neuron Analysis, Interpretibility, Diagnostic Classifier, AI explainability, End-to-End Architecture

arXiv:2106.13000 [pdf, other]

QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus

Authors: Hamdy Mubarak, Amir Hussein, Shammur Absar Chowdhury, Ahmed Ali

Abstract: We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain. This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel. The dataset is released with lightly supervised transcriptions, aligned with the audio segments. Unlike previous datasets, QASR contains linguistically motivated segmentation, pun… ▽ More We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain. This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel. The dataset is released with lightly supervised transcriptions, aligned with the audio segments. Unlike previous datasets, QASR contains linguistically motivated segmentation, punctuation, speaker information among others. QASR is suitable for training and evaluating speech recognition systems, acoustics- and/or linguistics- based Arabic dialect identification, punctuation restoration, speaker identification, speaker linking, and potentially other NLP modules for spoken data. In addition to QASR transcription, we release a dataset of 130M words to aid in designing and training a better language model. We show that end-to-end automatic speech recognition trained on QASR reports a competitive word error rate compared to the previous MGB-2 corpus. We report baseline results for downstream natural language processing tasks such as named entity recognition using speech transcript. We also report the first baseline for Arabic punctuation restoration. We make the corpus available for the research community. △ Less

Submitted 24 June, 2021; originally announced June 2021.

Comments: Speech Corpus, Spoken Conversation, ASR, Dialect Identification, Punctuation Restoration, Speaker Verification, NER, Named Entity, Arabic, Speaker gender, Turn-taking Accepted in ACL 2021

arXiv:2106.07787 [pdf, other]

Tracing Back Music Emotion Predictions to Sound Sources and Intuitive Perceptual Qualities

Authors: Shreyan Chowdhury, Verena Praher, Gerhard Widmer

Abstract: Music emotion recognition is an important task in MIR (Music Information Retrieval) research. Owing to factors like the subjective nature of the task and the variation of emotional cues between musical genres, there are still significant challenges in develo** reliable and generalizable models. One important step towards better models would be to understand what a model is actually learning from… ▽ More Music emotion recognition is an important task in MIR (Music Information Retrieval) research. Owing to factors like the subjective nature of the task and the variation of emotional cues between musical genres, there are still significant challenges in develo** reliable and generalizable models. One important step towards better models would be to understand what a model is actually learning from the data and how the prediction for a particular input is made. In previous work, we have shown how to derive explanations of model predictions in terms of spectrogram image segments that connect to the high-level emotion prediction via a layer of easily interpretable perceptual features. However, that scheme lacks intuitive musical comprehensibility at the spectrogram level. In the present work, we bridge this gap by merging audioLIME -- a source-separation based explainer -- with mid-level perceptual features, thus forming an intuitive connection chain between the input audio and the output emotion predictions. We demonstrate the usefulness of this method by applying it to debug a biased emotion prediction model. △ Less

Submitted 16 June, 2021; v1 submitted 14 June, 2021; originally announced June 2021.

Comments: In Proceedings of the 18th Sound and Music Computing Conference (SMC 2021)

arXiv:2106.05885 [pdf, other]

Balanced End-to-End Monolingual pre-training for Low-Resourced Indic Languages Code-Switching Speech Recognition

Authors: Amir Hussein, Shammur Chowdhury, Najim Dehak, Ahmed Ali

Abstract: The success in designing Code-Switching (CS) ASR often depends on the availability of the transcribed CS resources. Such dependency harms the development of ASR in low-resourced languages such as Bengali and Hindi. In this paper, we exploit the transfer learning approach to design End-to-End (E2E) CS ASR systems for the two low-resourced language pairs using different monolingual speech data and a… ▽ More The success in designing Code-Switching (CS) ASR often depends on the availability of the transcribed CS resources. Such dependency harms the development of ASR in low-resourced languages such as Bengali and Hindi. In this paper, we exploit the transfer learning approach to design End-to-End (E2E) CS ASR systems for the two low-resourced language pairs using different monolingual speech data and a small set of noisy CS data. We trained the CS-ASR, following two steps: (i) building a robust bilingual ASR system using a convolution-augmented transformer (Conformer) based acoustic model and n-gram language model, and (ii) fine-tuned the entire E2E ASR with limited noisy CS data. We tested our method on MUCS 2021 challenge and achieved 3rd place in the CS track. We then tested the proposed method using noisy CS data released for Hindi-English and Bengali-English pairs in Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages (MUCS 2021) and achieved 3rd place in the CS track. Unlike, the leading two systems that benefited from crawling YouTube and learning transliteration pairs, our proposed transfer learning approach focused on using only the limited CS data with no data-cleaning or data re-segmentation. Our approach achieved 14.1% relative gain in word error rate (WER) in Hindi-English and 27.1% in Bengali-English. We provide detailed guidelines on the steps to finetune the self-attention based model for limited data for ASR. Moreover, we release the code and recipe used in this paper. △ Less

Submitted 15 February, 2022; v1 submitted 10 June, 2021; originally announced June 2021.

arXiv:2105.14779 [pdf, other]

Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR

Authors: Shammur Absar Chowdhury, Amir Hussein, Ahmed Abdelali, Ahmed Ali

Abstract: With the advent of globalization, there is an increasing demand for multilingual automatic speech recognition (ASR), handling language and dialectal variation of spoken content. Recent studies show its efficacy over monolingual systems. In this study, we design a large multilingual end-to-end ASR using self-attention based conformer architecture. We trained the system using Arabic (Ar), English (E… ▽ More With the advent of globalization, there is an increasing demand for multilingual automatic speech recognition (ASR), handling language and dialectal variation of spoken content. Recent studies show its efficacy over monolingual systems. In this study, we design a large multilingual end-to-end ASR using self-attention based conformer architecture. We trained the system using Arabic (Ar), English (En) and French (Fr) languages. We evaluate the system performance handling: (i) monolingual (Ar, En and Fr); (ii) multi-dialectal (Modern Standard Arabic, along with dialectal variation such as Egyptian and Moroccan); (iii) code-switching -- cross-lingual (Ar-En/Fr) and dialectal (MSA-Egyptian dialect) test cases, and compare with current state-of-the-art systems. Furthermore, we investigate the influence of different embedding/character representations including character vs word-piece; shared vs distinct input symbol per language. Our findings demonstrate the strength of such a model by outperforming state-of-the-art monolingual dialectal Arabic and code-switching Arabic ASR. △ Less

Submitted 5 July, 2021; v1 submitted 31 May, 2021; originally announced May 2021.

Comments: Accepted in INTERSPEECH 2021, Multilingual ASR, Multi-dialectal ASR, Code-Switching ASR, Arabic ASR, Conformer, Transformer, E2E ASR, Speech Recognition, ASR, Arabic, English, French

arXiv:2104.12528 [pdf, other]

Spatio-Temporal Pruning and Quantization for Low-latency Spiking Neural Networks

Authors: Sayeed Shafayet Chowdhury, Isha Garg, Kaushik Roy

Abstract: Spiking Neural Networks (SNNs) are a promising alternative to traditional deep learning methods since they perform event-driven information processing. However, a major drawback of SNNs is high inference latency. The efficiency of SNNs could be enhanced using compression methods such as pruning and quantization. Notably, SNNs, unlike their non-spiking counterparts, consist of a temporal dimension,… ▽ More Spiking Neural Networks (SNNs) are a promising alternative to traditional deep learning methods since they perform event-driven information processing. However, a major drawback of SNNs is high inference latency. The efficiency of SNNs could be enhanced using compression methods such as pruning and quantization. Notably, SNNs, unlike their non-spiking counterparts, consist of a temporal dimension, the compression of which can lead to latency reduction. In this paper, we propose spatial and temporal pruning of SNNs. First, structured spatial pruning is performed by determining the layer-wise significant dimensions using principal component analysis of the average accumulated membrane potential of the neurons. This step leads to 10-14X model compression. Additionally, it enables inference with lower latency and decreases the spike count per inference. To further reduce latency, temporal pruning is performed by gradually reducing the timesteps while training. The networks are trained using surrogate gradient descent based backpropagation and we validate the results on CIFAR10 and CIFAR100, using VGG architectures. The spatiotemporally pruned SNNs achieve 89.04% and 66.4% accuracy on CIFAR10 and CIFAR100, respectively, while performing inference with 3-30X reduced latency compared to state-of-the-art SNNs. Moreover, they require 8-14X lesser compute energy compared to their unpruned standard deep learning counterparts. The energy numbers are obtained by multiplying the number of operations with energy per operation. These SNNs also provide 1-4% higher robustness against Gaussian noise corrupted inputs. Furthermore, we perform weight quantization and find that performance remains reasonably stable up to 5-bit quantization. △ Less

Submitted 28 April, 2021; v1 submitted 26 April, 2021; originally announced April 2021.

arXiv:2102.13479 [pdf, other]

Towards Explaining Expressive Qualities in Piano Recordings: Transfer of Explanatory Features via Acoustic Domain Adaptation

Authors: Shreyan Chowdhury, Gerhard Widmer

Abstract: Emotion and expressivity in music have been topics of considerable interest in the field of music information retrieval. In recent years, mid-level perceptual features have been suggested as means to explain computational predictions of musical emotion. We find that the diversity of musical styles and genres in the available dataset for learning these features is not sufficient for models to gener… ▽ More Emotion and expressivity in music have been topics of considerable interest in the field of music information retrieval. In recent years, mid-level perceptual features have been suggested as means to explain computational predictions of musical emotion. We find that the diversity of musical styles and genres in the available dataset for learning these features is not sufficient for models to generalise well to specialised acoustic domains such as solo piano music. In this work, we show that by utilising unsupervised domain adaptation together with receptive-field regularised deep neural networks, it is possible to significantly improve generalisation to this domain. Additionally, we demonstrate that our domain-adapted models can better predict and explain expressive qualities in classical piano performances, as perceived and described by human listeners. △ Less

Submitted 26 February, 2021; originally announced February 2021.

Comments: 5 pages, 3 figures; accepted for IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021)

arXiv:2101.00691 [pdf, other]

doi 10.1109/TII.2020.3048391

CovTANet: A Hybrid Tri-level Attention Based Network for Lesion Segmentation, Diagnosis, and Severity Prediction of COVID-19 Chest CT Scans

Authors: Tanvir Mahmud, Md. Jahin Alam, Sakib Chowdhury, Shams Nafisa Ali, Md Maisoon Rahman, Shaikh Anowarul Fattah, Mohammad Saquib

Abstract: Rapid and precise diagnosis of COVID-19 is one of the major challenges faced by the global community to control the spread of this overgrowing pandemic. In this paper, a hybrid neural network is proposed, named CovTANet, to provide an end-to-end clinical diagnostic tool for early diagnosis, lesion segmentation, and severity prediction of COVID-19 utilizing chest computer tomography (CT) scans. A m… ▽ More Rapid and precise diagnosis of COVID-19 is one of the major challenges faced by the global community to control the spread of this overgrowing pandemic. In this paper, a hybrid neural network is proposed, named CovTANet, to provide an end-to-end clinical diagnostic tool for early diagnosis, lesion segmentation, and severity prediction of COVID-19 utilizing chest computer tomography (CT) scans. A multi-phase optimization strategy is introduced for solving the challenges of complicated diagnosis at a very early stage of infection, where an efficient lesion segmentation network is optimized initially which is later integrated into a joint optimization framework for the diagnosis and severity prediction tasks providing feature enhancement of the infected regions. Moreover, for overcoming the challenges with diffused, blurred, and varying shaped edges of COVID lesions with novel and diverse characteristics, a novel segmentation network is introduced, namely Tri-level Attention-based Segmentation Network (TA-SegNet). This network has significantly reduced semantic gaps in subsequent encoding decoding stages, with immense parallelization of multi-scale features for faster convergence providing considerable performance improvement over traditional networks. Furthermore, a novel tri-level attention mechanism has been introduced, which is repeatedly utilized over the network, combining channel, spatial, and pixel attention schemes for faster and efficient generalization of contextual information embedded in the feature map through feature re-calibration and enhancement operations. Outstanding performances have been achieved in all three-tasks through extensive experimentation on a large publicly available dataset containing 1110 chest CT-volumes that signifies the effectiveness of the proposed scheme at the current stage of the pandemic. △ Less

Submitted 3 January, 2021; originally announced January 2021.

Comments: 10 Pages, 8 figures. This article has been published in IEEE Transactions on Industrial Informatics

arXiv:2008.02194 [pdf, other]

On the Characterization of Expressive Performance in Classical Music: First Results of the Con Espressione Game

Authors: Carlos Cancino-Chacón, Silvan Peter, Shreyan Chowdhury, Anna Aljanaki, Gerhard Widmer

Abstract: A piece of music can be expressively performed, or interpreted, in a variety of ways. With the help of an online questionnaire, the Con Espressione Game, we collected some 1,500 descriptions of expressive character relating to 45 performances of 9 excerpts from classical piano pieces, played by different famous pianists. More specifically, listeners were asked to describe, using freely chosen word… ▽ More A piece of music can be expressively performed, or interpreted, in a variety of ways. With the help of an online questionnaire, the Con Espressione Game, we collected some 1,500 descriptions of expressive character relating to 45 performances of 9 excerpts from classical piano pieces, played by different famous pianists. More specifically, listeners were asked to describe, using freely chosen words (preferably: adjectives), how they perceive the expressive character of the different performances. In this paper, we offer a first account of this new data resource for expressive performance research, and provide an exploratory analysis, addressing three main questions: (1) how similarly do different listeners describe a performance of a piece? (2) what are the main dimensions (or axes) for expressive character emerging from this?; and (3) how do measurable parameters of a performance (e.g., tempo, dynamics) and mid- and high-level features that can be predicted by machine learning models (e.g., articulation, arousal) relate to these expressive dimensions? The dataset that we publish along with this paper was enriched by adding hand-corrected score-to-performance alignments, as well as descriptive audio features such as tempo and dynamics curves. △ Less

Submitted 5 August, 2020; originally announced August 2020.

Comments: 8 pages, 2 figures, accepted for the 21st International Society for Music Information Retrieval Conference (ISMIR 2020)

arXiv:2007.13503 [pdf, other]

Receptive-Field Regularized CNNs for Music Classification and Tagging

Authors: Khaled Koutini, Hamid Eghbal-Zadeh, Verena Haunschmid, Paul Primus, Shreyan Chowdhury, Gerhard Widmer

Abstract: Convolutional Neural Networks (CNNs) have been successfully used in various Music Information Retrieval (MIR) tasks, both as end-to-end models and as feature extractors for more complex systems. However, the MIR field is still dominated by the classical VGG-based CNN architecture variants, often in combination with more complex modules such as attention, and/or techniques such as pre-training on l… ▽ More Convolutional Neural Networks (CNNs) have been successfully used in various Music Information Retrieval (MIR) tasks, both as end-to-end models and as feature extractors for more complex systems. However, the MIR field is still dominated by the classical VGG-based CNN architecture variants, often in combination with more complex modules such as attention, and/or techniques such as pre-training on large datasets. Deeper models such as ResNet -- which surpassed VGG by a large margin in other domains -- are rarely used in MIR. One of the main reasons for this, as we will show, is the lack of generalization of deeper CNNs in the music domain. In this paper, we present a principled way to make deep architectures like ResNet competitive for music-related tasks, based on well-designed regularization strategies. In particular, we analyze the recently introduced Receptive-Field Regularization and Shake-Shake, and show that they significantly improve the generalization of deep CNNs on music-related tasks, and that the resulting deep CNNs can outperform current more complex models such as CNNs augmented with pre-training and attention. We demonstrate this on two different MIR tasks and two corresponding datasets, thus offering our deep regularized CNNs as a new baseline for these datasets, which can also be used as a feature-extracting module in future, more complex approaches. △ Less

Submitted 27 July, 2020; originally announced July 2020.

arXiv:2007.10812 [pdf]

Anomaly Detection in Unsupervised Surveillance Setting Using Ensemble of Multimodal Data with Adversarial Defense

Authors: Sayeed Shafayet Chowdhury, Kaji Mejbaul Islam, Rouhan Noor

Abstract: Autonomous aerial surveillance using drone feed is an interesting and challenging research domain. To ensure safety from intruders and potential objects posing threats to the zone being protected, it is crucial to be able to distinguish between normal and abnormal states in real-time. Additionally, we also need to consider any device malfunction. However, the inherent uncertainty embedded within t… ▽ More Autonomous aerial surveillance using drone feed is an interesting and challenging research domain. To ensure safety from intruders and potential objects posing threats to the zone being protected, it is crucial to be able to distinguish between normal and abnormal states in real-time. Additionally, we also need to consider any device malfunction. However, the inherent uncertainty embedded within the type and level of abnormality makes supervised techniques less suitable since the adversary may present a unique anomaly for intrusion. As a result, an unsupervised method for anomaly detection is preferable taking the unpredictable nature of attacks into account. Again in our case, the autonomous drone provides heterogeneous data streams consisting of images and other analog or digital sensor data, all of which can play a role in anomaly detection if they are ensembled synergistically. To that end, an ensemble detection mechanism is proposed here which estimates the degree of abnormality of analyzing the real-time image and IMU (Inertial Measurement Unit) sensor data in an unsupervised manner. First, we have implemented a Convolutional Neural Network (CNN) regression block, named AngleNet to estimate the angle between a reference image and current test image, which provides us with a measure of the anomaly of the device. Moreover, the IMU data are used in autoencoders to predict abnormality. Finally, the results from these two pipelines are ensembled to estimate the final degree of abnormality. Furthermore, we have applied adversarial attack to test the robustness and security of the proposed approach and integrated defense mechanism. The proposed method performs satisfactorily on the IEEE SP Cup-2020 dataset with an accuracy of 97.8%. Additionally, we have also tested this approach on an in-house dataset to validate its robustness. △ Less

Submitted 17 July, 2020; originally announced July 2020.

Comments: arXiv admin note: substantial text overlap with arXiv:2006.03733

arXiv:2007.08607 [pdf]

Optimization of Surface Plasmon Resonance Biosensor for Analysis of Lipid Molecules

Authors: Ehsan Kabir, Syed Mohammad Ashab Uddin, Sayeed Shafayet Chowdhury

Abstract: Surface Plasmon Resonance (SPR) is an important bio-sensing technique for real-time label-free detection. However, it is pivotal to optimize various parameters of the sensor configuration for efficient and highly sensitive sensing. To that effect, we focus on optimizing two different SPR structures -- the basic Kretschmann configuration and narrow groove grating. Our analysis aims to detect two di… ▽ More Surface Plasmon Resonance (SPR) is an important bio-sensing technique for real-time label-free detection. However, it is pivotal to optimize various parameters of the sensor configuration for efficient and highly sensitive sensing. To that effect, we focus on optimizing two different SPR structures -- the basic Kretschmann configuration and narrow groove grating. Our analysis aims to detect two different types of lipids known as phospholipid and eggyolk, which are used as analyte (sensing layer) and two different types of proteins namely tryptophan and bovine serum albumin (BSA) are used as ligand (binding site). For both the configurations, we investigate all possible lipid-protein combinations to understand the effect of various parameters on sensitivity, minimum reflectivity and full width half maximum (FWHM). Lipids are the structural building block of cell membranes and mutation of these layers by virus and bacteria is one the prime reasons of many diseases in our body. Hence, improving the performance of a SPR sensor to detect very small change in lipid holds immense significance. We use finite-difference time-domain (FDTD) technique to perform quantitative analysis to get an optimized structure. We find that sensitivity increases when lipid concentration is increased and it is the highest (21.95 degree/RIU) for phospholipid and tryptophan combination when metal and lipid layer thickness are 45 nm and 30 nm respectively. However, metal layer thickness does not cause any significant variation in sensitivity, but as it increases to 50 nm, minimum reflectivity and full width half maximum (FWHM) decreases to the lowest. In case of narrow groove grating structure, broad range of wavelengths can generate SPR and the sensitivity is highest (900nm/RIU) for a configuration of 10 nm groove width and 70 nm groove height at a resonance wavelength of 1411 nm. △ Less

Submitted 1 July, 2020; originally announced July 2020.

arXiv:2006.03733 [pdf]

Unsupervised Abnormality Detection Using Heterogeneous Autonomous Systems

Authors: Sayeed Shafayet Chowdhury, Kazi Mejbaul Islam, Rouhan Noor

Abstract: Anomaly detection (AD) in a surveillance scenario is an emerging and challenging field of research. For autonomous vehicles like drones or cars, it is immensely important to distinguish between normal and abnormal states in real-time. Additionally, we also need to detect any device malfunction. But the nature and degree of abnormality may vary depending upon the actual environment and adversary. A… ▽ More Anomaly detection (AD) in a surveillance scenario is an emerging and challenging field of research. For autonomous vehicles like drones or cars, it is immensely important to distinguish between normal and abnormal states in real-time. Additionally, we also need to detect any device malfunction. But the nature and degree of abnormality may vary depending upon the actual environment and adversary. As a result, it is impractical to model all cases a-priori and use supervised methods to classify. Also, an autonomous vehicle provides various data types like images and other analog or digital sensor data, all of which can be useful in anomaly detection if leveraged fruitfully. To that effect, in this paper, a heterogeneous system is proposed which estimates the degree of abnormality of an unmanned surveillance drone, analyzing real-time image and IMU (Inertial Measurement Unit) sensor data in an unsupervised manner. Here, we have demonstrated a Convolutional Neural Network (CNN) architecture, named AngleNet to estimate the angle between a normal image and another image under consideration, which provides us with a measure of anomaly of the device. Moreover, the IMU data are used in autoencoder to predict abnormality. Finally, the results from these two algorithms are ensembled to estimate the final degree of abnormality. The proposed method performs satisfactorily on the IEEE SP Cup-2020 dataset with an accuracy of 97.3%. Additionally, we have also tested this approach on an in-house dataset to validate its robustness. △ Less

Submitted 14 July, 2020; v1 submitted 5 June, 2020; originally announced June 2020.

arXiv:2005.02704 [pdf, ps, other]

doi 10.1007/978-3-030-34869-4_45

Fast Geometric Surface based Segmentation of Point Cloud from Lidar Data

Authors: Aritra Mukherjee, Sourya Dipta Das, Jasorsi Ghosh, Ananda S. Chowdhury, Sanjoy Kumar Saha

Abstract: Map** the environment has been an important task for robot navigation and Simultaneous Localization And Map** (SLAM). LIDAR provides a fast and accurate 3D point cloud map of the environment which helps in map building. However, processing millions of points in the point cloud becomes a computationally expensive task. In this paper, a methodology is presented to generate the segmented surfaces… ▽ More Map** the environment has been an important task for robot navigation and Simultaneous Localization And Map** (SLAM). LIDAR provides a fast and accurate 3D point cloud map of the environment which helps in map building. However, processing millions of points in the point cloud becomes a computationally expensive task. In this paper, a methodology is presented to generate the segmented surfaces in real time and these can be used in modeling the 3D objects. At first an algorithm is proposed for efficient map building from single shot data of spinning Lidar. It is based on fast meshing and sub-sampling. It exploits the physical design and the working principle of the spinning Lidar sensor. The generated mesh surfaces are then segmented by estimating the normal and considering their homogeneity. The segmented surfaces can be used as proposals for predicting geometrically accurate model of objects in the robots activity environment. The proposed methodology is compared with some popular point cloud segmentation methods to highlight the efficacy in terms of accuracy and speed. △ Less

Submitted 6 May, 2020; originally announced May 2020.

Comments: Accepted to PReMI 2019( Pattern Recognition and Machine Intelligence 2019). International Conference on Pattern Recognition and Machine Intelligence. Springer, Cham, 2019

Showing 1–50 of 56 results for author: Chowdhury, S