-
Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection
Authors:
Duc-Tuan Truong,
Ruijie Tao,
Tuan Nguyen,
Hieu-Thi Luong,
Kong Aik Lee,
Eng Siong Chng
Abstract:
Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in sp…
▽ More
Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in specific regions of both frequency channels and temporal segments, while MHSA neglects this temporal-channel dependency of the input sequence. In this work, we proposed a Temporal-Channel Modeling (TCM) module to enhance MHSA's capability for capturing temporal-channel dependencies. Experimental results on the ASVspoof 2021 show that with only 0.03M additional parameters, the TCM module can outperform the state-of-the-art system by 9.25% in EER. Further ablation study reveals that utilizing both temporal and channel information yields the most improvement for detecting synthetic speech.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Speech Emotion Recognition under Resource Constraints with Data Distillation
Authors:
Yi Chang,
Zhao Ren,
Zhonghao Zhao,
Thanh Tam Nguyen,
Kun Qian,
Tanja Schultz,
Björn W. Schuller
Abstract:
Speech emotion recognition (SER) plays a crucial role in human-computer interaction. The emergence of edge devices in the Internet of Things (IoT) presents challenges in constructing intricate deep learning models due to constraints in memory and computational resources. Moreover, emotional speech data often contains private information, raising concerns about privacy leakage during the deployment…
▽ More
Speech emotion recognition (SER) plays a crucial role in human-computer interaction. The emergence of edge devices in the Internet of Things (IoT) presents challenges in constructing intricate deep learning models due to constraints in memory and computational resources. Moreover, emotional speech data often contains private information, raising concerns about privacy leakage during the deployment of SER models. To address these challenges, we propose a data distillation framework to facilitate efficient development of SER models in IoT applications using a synthesised, smaller, and distilled dataset. Our experiments demonstrate that the distilled dataset can be effectively utilised to train SER models with fixed initialisation, achieving performances comparable to those developed using the original full emotional speech dataset.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
PhoWhisper: Automatic Speech Recognition for Vietnamese
Authors:
Thanh-Thien Le,
Linh The Nguyen,
Dat Quoc Nguyen
Abstract:
We introduce PhoWhisper in five versions for Vietnamese automatic speech recognition. PhoWhisper's robustness is achieved through fine-tuning the Whisper model on an 844-hour dataset that encompasses diverse Vietnamese accents. Our experimental study demonstrates state-of-the-art performances of PhoWhisper on benchmark Vietnamese ASR datasets. We have open-sourced PhoWhisper at: https://github.com…
▽ More
We introduce PhoWhisper in five versions for Vietnamese automatic speech recognition. PhoWhisper's robustness is achieved through fine-tuning the Whisper model on an 844-hour dataset that encompasses diverse Vietnamese accents. Our experimental study demonstrates state-of-the-art performances of PhoWhisper on benchmark Vietnamese ASR datasets. We have open-sourced PhoWhisper at: https://github.com/VinAIResearch/PhoWhisper
△ Less
Submitted 27 March, 2024;
originally announced June 2024.
-
SysCaps: Language Interfaces for Simulation Surrogates of Complex Systems
Authors:
Patrick Emami,
Zhaonan Li,
Saumya Sinha,
Truc Nguyen
Abstract:
Data-driven simulation surrogates help computational scientists study complex systems. They can also help inform impactful policy decisions. We introduce a learning framework for surrogate modeling where language is used to interface with the underlying system being simulated. We call a language description of a system a "system caption", or SysCap. To address the lack of datasets of paired natura…
▽ More
Data-driven simulation surrogates help computational scientists study complex systems. They can also help inform impactful policy decisions. We introduce a learning framework for surrogate modeling where language is used to interface with the underlying system being simulated. We call a language description of a system a "system caption", or SysCap. To address the lack of datasets of paired natural language SysCaps and simulation runs, we use large language models (LLMs) to synthesize high-quality captions. Using our framework, we train multimodal text and timeseries regression models for two real-world simulators of complex energy systems. Our experiments demonstrate the feasibility of designing language interfaces for real-world surrogate models at comparable accuracy to standard baselines. We qualitatively and quantitatively show that SysCaps unlock text-prompt-style surrogate modeling and new generalization abilities beyond what was previously possible. We will release the generated SysCaps datasets and our code to support follow-on studies.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Deep learning improved autofocus for motion artifact reduction and its application in quantitative susceptibility map**
Authors:
Chao Li,
**wei Zhang,
Hang Zhang,
Jiahao Li,
Pascal Spincemaille,
Thanh D. Nguyen,
Yi Wang
Abstract:
Purpose: To develop a pipeline for motion artifact correction in mGRE and quantitative susceptibility map** (QSM). Methods: Deep learning is integrated with autofocus to improve motion artifact suppression, which is applied QSM of patients with Parkinson's disease (PD). The estimation of affine motion parameters in the autofocus method depends on signal-to-noise ratio and lacks accuracy when dat…
▽ More
Purpose: To develop a pipeline for motion artifact correction in mGRE and quantitative susceptibility map** (QSM). Methods: Deep learning is integrated with autofocus to improve motion artifact suppression, which is applied QSM of patients with Parkinson's disease (PD). The estimation of affine motion parameters in the autofocus method depends on signal-to-noise ratio and lacks accuracy when data sampling occurs outside the k-space center. A deep learning strategy is employed to remove the residual motion artifacts in autofocus. Results: Results obtained in simulated brain data (n =15) with reference truth show that the proposed autofocus deep learning method significantly improves the image quality of mGRE and QSM (p = 0.001 for SSIM, p < 0.0001 for PSNR and RMSE). Results from 10 PD patients with real motion artifacts in QSM have also been corrected using the proposed method and sent to an experienced radiologist for image quality evaluation, and the average image quality score has increased (p=0.0039). Conclusions: The proposed method enables substantial suppression of motion artifacts in mGRE and QSM.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
Remote Sensing Data Assimilation with a Chained Hydrologic-hydraulic Model for Flood Forecasting
Authors:
Thanh Huy Nguyen,
Andrea Piacentini,
Sophie Ricci,
Ludovic Cassan,
Simon Munier,
Quentin Bonassies,
Raquel Rodriguez-Suquet
Abstract:
A chained hydrologic-hydraulic model is implemented using predicted runoff from a large-scale hydrologic model (namely ISBA-CTRIP) as inputs to local hydrodynamic models (TELEMAC-2D) to issue forecasts of water level and flood extent. The uncertainties in the hydrological forcing and in friction parameters are reduced by an Ensemble Kalman Filter that jointly assimilates in-situ water levels and f…
▽ More
A chained hydrologic-hydraulic model is implemented using predicted runoff from a large-scale hydrologic model (namely ISBA-CTRIP) as inputs to local hydrodynamic models (TELEMAC-2D) to issue forecasts of water level and flood extent. The uncertainties in the hydrological forcing and in friction parameters are reduced by an Ensemble Kalman Filter that jointly assimilates in-situ water levels and flood extent maps derived from remote sensing observations. The data assimilation framework is cycled in a real-time forecasting configuration. A cycle consists of a reanalysis and a forecast phase. Over the analysis, observations up to the present are assimilated. An ensemble is then initialized from the last analyzed states and issued forecasts for next 36 hr. Three strategies of forcing data for this forecast are investigated: (i) using CTRIP runoff for reanalysis and forecast, (ii) using observed discharge for analysis, then CTRIP runoff for forecast and (iii) using observed discharge for reanalysis and keep a persistent discharge value for forecast. It was shown that the data assimilation strategy provides a reliable reanalysis in hindcast mode. The combination of observed discharge and CTRIP runoff provides the most accurate results. For all strategies, the quality of the forecast decreases as the lead time increases. When the errors in CTRIP forcing are non-stationary, the forecast capability may be reduced. This work demonstrates that the forcing provided by a hydrologic model, while imperfect, can be efficiently used as input to a hydraulic model to issue reanalysis and forecasts, thanks to the assimilation of in-situ and remote sensing observations.
△ Less
Submitted 1 May, 2024;
originally announced May 2024.
-
Multi-target and multi-stage liver lesion segmentation and detection in multi-phase computed tomography scans
Authors:
Abdullah F. Al-Battal,
Soan T. M. Duong,
Van Ha Tang,
Quang Duc Tran,
Steven Q. H. Truong,
Chien Phan,
Truong Q. Nguyen,
Cheolhong An
Abstract:
Multi-phase computed tomography (CT) scans use contrast agents to highlight different anatomical structures within the body to improve the probability of identifying and detecting anatomical structures of interest and abnormalities such as liver lesions. Yet, detecting these lesions remains a challenging task as these lesions vary significantly in their size, shape, texture, and contrast with resp…
▽ More
Multi-phase computed tomography (CT) scans use contrast agents to highlight different anatomical structures within the body to improve the probability of identifying and detecting anatomical structures of interest and abnormalities such as liver lesions. Yet, detecting these lesions remains a challenging task as these lesions vary significantly in their size, shape, texture, and contrast with respect to surrounding tissue. Therefore, radiologists need to have an extensive experience to be able to identify and detect these lesions. Segmentation-based neural networks can assist radiologists with this task. Current state-of-the-art lesion segmentation networks use the encoder-decoder design paradigm based on the UNet architecture where the multi-phase CT scan volume is fed to the network as a multi-channel input. Although this approach utilizes information from all the phases and outperform single-phase segmentation networks, we demonstrate that their performance is not optimal and can be further improved by incorporating the learning from models trained on each single-phase individually. Our approach comprises three stages. The first stage identifies the regions within the liver where there might be lesions at three different scales (4, 8, and 16 mm). The second stage includes the main segmentation model trained using all the phases as well as a segmentation model trained on each of the phases individually. The third stage uses the multi-phase CT volumes together with the predictions from each of the segmentation models to generate the final segmentation map. Overall, our approach improves relative liver lesion segmentation performance by 1.6% while reducing performance variability across subjects by 8% when compared to the current state-of-the-art models.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
AAM-VDT: Vehicle Digital Twin for Tele-Operations in Advanced Air Mobility
Authors:
Tuan Anh Nguyen,
Taeho Kwag,
Vinh Pham,
Viet Nghia Nguyen,
Jeongseok Hyun,
Minseok Jang,
Jae-Woo Lee
Abstract:
This study advanced tele-operations in Advanced Air Mobility (AAM) through the creation of a Vehicle Digital Twin (VDT) system for eVTOL aircraft, tailored to enhance remote control safety and efficiency, especially for Beyond Visual Line of Sight (BVLOS) operations. By synergizing digital twin technology with immersive Virtual Reality (VR) interfaces, we notably elevate situational awareness and…
▽ More
This study advanced tele-operations in Advanced Air Mobility (AAM) through the creation of a Vehicle Digital Twin (VDT) system for eVTOL aircraft, tailored to enhance remote control safety and efficiency, especially for Beyond Visual Line of Sight (BVLOS) operations. By synergizing digital twin technology with immersive Virtual Reality (VR) interfaces, we notably elevate situational awareness and control precision for remote operators. Our VDT framework integrates immersive tele-operation with a high-fidelity aerodynamic database, essential for authentically simulating flight dynamics and control tactics. At the heart of our methodology lies an eVTOL's high-fidelity digital replica, placed within a simulated reality that accurately reflects physical laws, enabling operators to manage the aircraft via a master-slave dynamic, substantially outperforming traditional 2D interfaces. The architecture of the designed system ensures seamless interaction between the operator, the digital twin, and the actual aircraft, facilitating exact, instantaneous feedback. Experimental assessments, involving propulsion data gathering, simulation database fidelity verification, and tele-operation testing, verify the system's capability in precise control command transmission and maintaining the digital-physical eVTOL synchronization. Our findings underscore the VDT system's potential in augmenting AAM efficiency and safety, paving the way for broader digital twin application in autonomous aerial vehicles.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
Exploring Pathological Speech Quality Assessment with ASR-Powered Wav2Vec2 in Data-Scarce Context
Authors:
Tuan Nguyen,
Corinne Fredouille,
Alain Ghio,
Mathieu Balaguer,
Virginie Woisard
Abstract:
Automatic speech quality assessment has raised more attention as an alternative or support to traditional perceptual clinical evaluation. However, most research so far only gains good results on simple tasks such as binary classification, largely due to data scarcity. To deal with this challenge, current works tend to segment patients' audio files into many samples to augment the datasets. Neverth…
▽ More
Automatic speech quality assessment has raised more attention as an alternative or support to traditional perceptual clinical evaluation. However, most research so far only gains good results on simple tasks such as binary classification, largely due to data scarcity. To deal with this challenge, current works tend to segment patients' audio files into many samples to augment the datasets. Nevertheless, this approach has limitations, as it indirectly relates overall audio scores to individual segments. This paper introduces a novel approach where the system learns at the audio level instead of segments despite data scarcity. This paper proposes to use the pre-trained Wav2Vec2 architecture for both SSL, and ASR as feature extractor in speech assessment. Carried out on the HNC dataset, our ASR-driven approach established a new baseline compared with other approaches, obtaining average $MSE=0.73$ and $MSE=1.15$ for the prediction of intelligibility and severity scores respectively, using only 95 training samples. It shows that the ASR based Wav2Vec2 model brings the best results and may indicate a strong correlation between ASR and speech quality assessment. We also measure its ability on variable segment durations and speech content, exploring factors influencing its decision.
△ Less
Submitted 29 March, 2024;
originally announced March 2024.
-
Spatially temporally distributed informative path planning for multi-robot systems
Authors:
Binh Nguyen,
Linh Nguyen,
Truong X. Nghiem,
Hung La,
Jose Baca,
Pablo Rangel,
Miguel Cid Montoya,
Thang Nguyen
Abstract:
This paper investigates the problem of informative path planning for a mobile robotic sensor network in spatially temporally distributed map**. The robots are able to gather noisy measurements from an area of interest during their movements to build a Gaussian Process (GP) model of a spatio-temporal field. The model is then utilized to predict the spatio-temporal phenomenon at different points o…
▽ More
This paper investigates the problem of informative path planning for a mobile robotic sensor network in spatially temporally distributed map**. The robots are able to gather noisy measurements from an area of interest during their movements to build a Gaussian Process (GP) model of a spatio-temporal field. The model is then utilized to predict the spatio-temporal phenomenon at different points of interest. To spatially and temporally navigate the group of robots so that they can optimally acquire maximal information gains while their connectivity is preserved, we propose a novel multistep prediction informative path planning optimization strategy employing our newly defined local cost functions. By using the dual decomposition method, it is feasible and practical to effectively solve the optimization problem in a distributed manner. The proposed method was validated through synthetic experiments utilizing real-world data sets.
△ Less
Submitted 25 March, 2024;
originally announced March 2024.
-
Forecasting the load of Parcel Pickup Points using a Markov Jump Process
Authors:
Thi-Thu-Tam Nguyen,
Adnane Cabani,
Iyadh Cabani,
Koen De Turck,
Michel Kieffer
Abstract:
The growth of e-commerce has resulted in a surge in parcel deliveries, increasing transportation costs and pollution issues. Alternatives to home delivery have emerged, such as the delivery to so-called parcel pick-up points (PUPs), which eliminates delivery failure due to customers not being at home. Nevertheless, parcels reaching overloaded PUPs may need to be redirected to alternative PUPs, som…
▽ More
The growth of e-commerce has resulted in a surge in parcel deliveries, increasing transportation costs and pollution issues. Alternatives to home delivery have emerged, such as the delivery to so-called parcel pick-up points (PUPs), which eliminates delivery failure due to customers not being at home. Nevertheless, parcels reaching overloaded PUPs may need to be redirected to alternative PUPs, sometimes far from the chosen ones, which may generate customer dissatisfaction. Consequently, predicting the PUP load is critical for a PUP management company to infer the availability of PUPs for future orders and better balance parcel flows between PUPs.
This paper proposes a new approach to forecasting the PUP load evolution using a Markov jump process that models the parcel life cycle. The latest known status of each parcel is considered to estimate its contribution to the future load of its target PUP. This approach can account for the variability of activity, the various parcel preparation delays by sellers, and the diversity of parcel carriers that may result in different delivery delays. Here, results are provided for predicting the load associated with parcels ordered from online retailers by customers (Business-to-Customer, B2C). The proposed approach is generic and can also be applied to other parcel flows to PUPs, such as second-hand products (Customer-to-Customer, C2C) sent via a PUP network.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
Early Flood Warning Using Satellite-Derived Convective System and Precipitation Data -- A Retrospective Case Study of Central Vietnam
Authors:
Tran-Vu La,
Thanh Huy Nguyen,
Patrick Matgen,
Marco Chini
Abstract:
This paper addresses the challenges of an early flood warning caused by complex convective systems (CSs), by using Low-Earth Orbit and Geostationary satellite data. We focus on a sequence of extreme events that took place in central Vietnam during October 2020, with a specific emphasis on the events leading up to the floods, i.e., those occurring before October 10th, 2020. In this critical phase,…
▽ More
This paper addresses the challenges of an early flood warning caused by complex convective systems (CSs), by using Low-Earth Orbit and Geostationary satellite data. We focus on a sequence of extreme events that took place in central Vietnam during October 2020, with a specific emphasis on the events leading up to the floods, i.e., those occurring before October 10th, 2020. In this critical phase, several hydrometeorological indicators could be identified thanks to an increasingly advanced and dense observation network composed of Earth Observation satellites, in particular those enabling the characterization and monitoring of a CS, in terms of low-temperature clouds and heavy rainfall. Himawari-8 images, both individually and in time-series, allow identifying and tracking convective clouds. This is complemented by the observation of heavy/violent rainfall through GPM IMERG data, as well as the detection of strong winds using radiometers and scatterometers. Collectively, these datasets, along with the estimated intensity and duration of the event from each source, form a comprehensive dataset detailing the intricate behaviors of CSs. All of these factors are significant contributors to the magnitude of flooding and the short-term dynamics anticipated in the studied region.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
Assimilation of SWOT Altimetry and Sentinel-1 Flood Extent Observations for Flood Reanalysis -- A Proof-of-Concept
Authors:
Thanh Huy Nguyen,
Sophie Ricci,
Andrea Piacentini,
Charlotte Emery,
Raquel Rodriguez Suquet,
Santiago Peña Luque
Abstract:
In spite of astonishing advances and developments in remote sensing technologies, meeting the spatio-temporal requirements for flood hydrodynamic modeling remains a great challenge for Earth Observation. The assimilation of multi-source remote sensing data in 2D hydrodynamic models participates to overcome such a challenge. The recently launched Surface Water and Ocean Topography (SWOT) wide-swath…
▽ More
In spite of astonishing advances and developments in remote sensing technologies, meeting the spatio-temporal requirements for flood hydrodynamic modeling remains a great challenge for Earth Observation. The assimilation of multi-source remote sensing data in 2D hydrodynamic models participates to overcome such a challenge. The recently launched Surface Water and Ocean Topography (SWOT) wide-swath altimetry satellite provides a global coverage of water surface elevation at a high resolution. SWOT provides complementary observation to radar and optical images, increasing the opportunity to observe and monitor flood events. This research work focuses on the assimilation of 2D flood extent maps derived from Sentinel-1 C-SAR imagery data, and water surface elevation from SWOT as well as in-situ water level measurements. An Ensemble Kalman Filter (EnKF) with a joint state-parameter analysis is implemented on top of a 2D hydrodynamic TELEMAC-2D model to account for errors in roughness, input forcing and water depth in floodplain subdomains. The proposed strategy is carried out in an Observing System Simulation Experiment based on the 2021 flood event over the Garonne Marmandaise catchment. This work makes the most of the large volume of heterogeneous data from space for flood prediction in hindcast mode paves the way for nowcasting.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
Multi-IRS-aided Terahertz Networks: Channel Modelling and User Association With Imperfect CSI
Authors:
Muddasir Rahim,
Thanh Luan Nguyen,
Georges Kaddoum,
Tri Nhu Do
Abstract:
Terahertz (THz) communication is envisioned as one of the candidate technologies for future wireless communications to enable achievable data rates of up to several terabits per second (Tbps). However, the high pathloss and molecular absorption in THz band communications often limit the transmission range. To overcome these limitations, this paper proposes intelligent reconfigurable surface (IRS)-…
▽ More
Terahertz (THz) communication is envisioned as one of the candidate technologies for future wireless communications to enable achievable data rates of up to several terabits per second (Tbps). However, the high pathloss and molecular absorption in THz band communications often limit the transmission range. To overcome these limitations, this paper proposes intelligent reconfigurable surface (IRS)-aided THz networks with imperfect channel state information (CSI). Specifically, we present an angle-based trigonometric channel model to facilitate the performance evaluation of IRS-aided THz networks. In addition, to maximize the sum rate, we formulate the transmitter (Tx)-IRS-receiver (Rx) matching problem, which is a mixed-integer nonlinear programming (MINLP) problem. To address this non-deterministic polynomial-time hard (NP-hard) problem, we propose a Gale-Shapley algorithm-based solutions to obtain stable matching between transmitters and IRSs, and receivers and IRSs, in the first and second sub-problems, respectively. The impact of the transmission power, the number of IRS elements, and the network area on the sum rate are investigated. Furthermore, the proposed algorithm is compared to an exhaustive search, nearest association, greedy search, and random allocation to validate the proposed solution. The complexity and convergence analysis demonstrate that the computational complexity of our algorithm is lower than that of the ES method.
△ Less
Submitted 5 February, 2024;
originally announced March 2024.
-
Fighting Game Adaptive Background Music for Improved Gameplay
Authors:
Ibrahim Khan,
Thai Van Nguyen,
Chollakorn Nimpattanavong,
Ruck Thawonmas
Abstract:
This paper presents our work to enhance the background music (BGM) in DareFightingICE by adding adaptive features. The adaptive BGM consists of three different categories of instruments playing the BGM of the winner sound design from the 2022 DareFightingICE Competition. The BGM adapts by changing the volume of each category of instruments. Each category is connected to a different element of the…
▽ More
This paper presents our work to enhance the background music (BGM) in DareFightingICE by adding adaptive features. The adaptive BGM consists of three different categories of instruments playing the BGM of the winner sound design from the 2022 DareFightingICE Competition. The BGM adapts by changing the volume of each category of instruments. Each category is connected to a different element of the game. We then run experiments to evaluate the adaptive BGM by using a deep reinforcement learning AI agent that only uses audio as input (Blind DL AI). The results show that the performance of the Blind DL AI improves while playing with the adaptive BGM as compared to playing without the adaptive BGM.
△ Less
Submitted 5 March, 2024;
originally announced March 2024.
-
Enhanced DareFightingICE Competitions: Sound Design and AI Competitions
Authors:
Ibrahim Khan,
Chollakorn Nimpattanavong,
Thai Van Nguyen,
Kantinan Plupattanakit,
Ruck Thawonmas
Abstract:
This paper presents a new and improved DareFightingICE platform, a fighting game platform with a focus on visually impaired players (VIPs), in the Unity game engine. It also introduces the separation of the DareFightingICE Competition into two standalone competitions called DareFightingICE Sound Design Competition and DareFightingICE AI Competition--at the 2024 IEEE Conference on Games (CoG)--in w…
▽ More
This paper presents a new and improved DareFightingICE platform, a fighting game platform with a focus on visually impaired players (VIPs), in the Unity game engine. It also introduces the separation of the DareFightingICE Competition into two standalone competitions called DareFightingICE Sound Design Competition and DareFightingICE AI Competition--at the 2024 IEEE Conference on Games (CoG)--in which a new platform will be used. This new platform is an enhanced version of the old DareFightingICE platform, having a better audio system to convey 3D sound and a better way to send audio data to AI agents. With this enhancement and by utilizing Unity, the new DareFightingICE platform is more accessible in terms of adding new features for VIPs and future audio research. This paper also improves the evaluation method for evaluating sound designs in the Sound Design Competition which will ensure a better sound design for VIPs as this competition continues to run at future CoG. To the best of our knowledge, both of our competitions are first of their kind, and the connection between the competitions to mutually improve the entries' quality with time makes these competitions an important part of representing an often overlooked segment within the broader gaming community, VIPs.
△ Less
Submitted 27 April, 2024; v1 submitted 5 March, 2024;
originally announced March 2024.
-
Real-time hybrid controls of energy storage and load shedding for integrated power and energy systems of ships
Authors:
Linh Vu,
Thai-Thanh Nguyen,
Bang Le-Huy Nguyen,
Md Isfakul Anam,
Tuyen Vu
Abstract:
This paper presents an original energy management methodology to enhance the resilience of ship power systems. The integration of various energy storage systems (ESS), including battery energy storage systems (BESS) and super-capacitor energy storage systems (SCESS), in modern ship power systems poses challenges in designing an efficient energy management system (EMS). The EMS proposed in this pap…
▽ More
This paper presents an original energy management methodology to enhance the resilience of ship power systems. The integration of various energy storage systems (ESS), including battery energy storage systems (BESS) and super-capacitor energy storage systems (SCESS), in modern ship power systems poses challenges in designing an efficient energy management system (EMS). The EMS proposed in this paper aims to achieve multiple objectives. The primary objective is to minimize shed loads, while the secondary objective is to effectively manage different types of ESS. Considering the diverse ramp-rate characteristics of generators, SCESS, and BESS, the proposed EMS exploits these differences to determine an optimal long-term schedule for minimizing shed loads. Furthermore, the proposed EMS balances the state-of-charge (SoC) of ESS and prioritizes the SCESS's SoC levels to ensure the efficient operation of BESS and SCESS. For better computational efficiency, we introduce the receding horizon optimization method, enabling real-time EMS implementation. A comparison with the fixed horizon optimization (FHO) validates its effectiveness. Simulation studies and results demonstrate that the proposed EMS efficiently manages generators, BESS, and SCESS, ensuring system resilience under generation shortages. Additionally, the proposed methodology significantly reduces the computational burden compared to the FHO technique while maintaining acceptable resilience performance.
△ Less
Submitted 2 March, 2024;
originally announced March 2024.
-
The Impact of Frequency Bands on Acoustic Anomaly Detection of Machines using Deep Learning Based Model
Authors:
Tin Nguyen,
Lam Pham,
Phat Lam,
Dat Ngo,
Hieu Tang,
Alexander Schindler
Abstract:
In this paper, we propose a deep learning based model for Acoustic Anomaly Detection of Machines, the task for detecting abnormal machines by analysing the machine sound. By conducting extensive experiments, we indicate that multiple techniques of pseudo audios, audio segment, data augmentation, Mahalanobis distance, and narrow frequency bands, which mainly focus on feature engineering, are effect…
▽ More
In this paper, we propose a deep learning based model for Acoustic Anomaly Detection of Machines, the task for detecting abnormal machines by analysing the machine sound. By conducting extensive experiments, we indicate that multiple techniques of pseudo audios, audio segment, data augmentation, Mahalanobis distance, and narrow frequency bands, which mainly focus on feature engineering, are effective to enhance the system performance. Among the evaluating techniques, the narrow frequency bands presents a significant impact. Indeed, our proposed model, which focuses on the narrow frequency bands, outperforms the DCASE baseline on the benchmark dataset of DCASE 2022 Task 2 Development set. The important role of the narrow frequency bands indicated in this paper inspires the research community on the task of Acoustic Anomaly Detection of Machines to further investigate and propose novel network architectures focusing on the frequency bands.
△ Less
Submitted 1 March, 2024;
originally announced March 2024.
-
PIDformer: Transformer Meets Control Theory
Authors:
Tam Nguyen,
César A. Uribe,
Tan M. Nguyen,
Richard G. Baraniuk
Abstract:
In this work, we address two main shortcomings of transformer architectures: input corruption and rank collapse in their output representation. We unveil self-attention as an autonomous state-space model that inherently promotes smoothness in its solutions, leading to lower-rank outputs and diminished representation capacity. Moreover, the steady-state solution of the model is sensitive to input p…
▽ More
In this work, we address two main shortcomings of transformer architectures: input corruption and rank collapse in their output representation. We unveil self-attention as an autonomous state-space model that inherently promotes smoothness in its solutions, leading to lower-rank outputs and diminished representation capacity. Moreover, the steady-state solution of the model is sensitive to input perturbations. We incorporate a Proportional-Integral-Derivative (PID) closed-loop feedback control system with a reference point into the model to improve robustness and representation capacity. This integration aims to preserve high-frequency details while bolstering model stability, rendering it more noise-resilient. The resulting controlled state-space model is theoretically proven robust and adept at addressing the rank collapse. Motivated by this control framework, we derive a novel class of transformers, PID-controlled Transformer (PIDformer), aimed at improving robustness and mitigating the rank-collapse issue inherent in softmax transformers. We empirically evaluate the model for advantages and robustness against baseline transformers across various practical tasks, including object classification, image segmentation, and language modeling.
△ Less
Submitted 25 February, 2024;
originally announced February 2024.
-
Secrecy Performance Analysis of Space-to-Ground Optical Satellite Communications
Authors:
Thang V. Nguyen,
Thanh V. Pham,
Anh T. Pham,
Dang T. Ngoc
Abstract:
Free-space optics (FSO)-based satellite communication systems have recently received considerable attention due to their enhanced capacity compared to their radio frequency (RF) counterparts. This paper analyzes the performance of physical layer security of space-to-ground intensity modulation/direct detection FSO satellite links under the effect of atmospheric loss, misalignment, cloud attenuatio…
▽ More
Free-space optics (FSO)-based satellite communication systems have recently received considerable attention due to their enhanced capacity compared to their radio frequency (RF) counterparts. This paper analyzes the performance of physical layer security of space-to-ground intensity modulation/direct detection FSO satellite links under the effect of atmospheric loss, misalignment, cloud attenuation, and atmospheric turbulence-induced fading. Specifically, a wiretap channel consisting of a legitimate transmitter Alice (i.e., the satellite), a legitimate user Bob, and an eavesdropper Eve over turbulence channels modeled by the Fisher-Snedecor $\mathcal{F}$ distribution is considered. The secrecy performance in terms of the average secrecy capacity, secrecy outage probability, and strictly positive secrecy capacity are derived in closed-form. Simulation results reveal significant impacts of satellite altitude, zenith angle, and turbulence strength on the secrecy performance.
△ Less
Submitted 21 February, 2024;
originally announced February 2024.
-
Q-learning-based Joint Design of Adaptive Modulation and Precoding for Physical Layer Security in Visible Light Communications
Authors:
Duc M. T. Hoang,
Thanh V. Pham,
Anh T. Pham,
Chuyen T Nguyen
Abstract:
There has been an increasing interest in physical layer security (PLS), which, compared with conventional cryptography, offers a unique approach to guaranteeing information confidentiality against eavesdroppers. In this paper, we study a joint design of adaptive $M$-ary pulse amplitude modulation (PAM) and precoding, which aims to optimize wiretap visible-light channels' secrecy capacity and bit e…
▽ More
There has been an increasing interest in physical layer security (PLS), which, compared with conventional cryptography, offers a unique approach to guaranteeing information confidentiality against eavesdroppers. In this paper, we study a joint design of adaptive $M$-ary pulse amplitude modulation (PAM) and precoding, which aims to optimize wiretap visible-light channels' secrecy capacity and bit error rate (BER) performances. The proposed design is motivated by higher-order modulation, which results in better secrecy capacity at the expense of a higher BER. On the other hand, a proper precoding design, which can manipulate the received signal quality at the legitimate user and the eavesdropper, can also enhance secrecy performance and influence the BER. A reward function that considers the secrecy capacity and the BERs of the legitimate user's (Bob) and the eavesdropper's (Eve) channels is introduced and maximized. Due to the non-linearity and complexity of the reward function, it is challenging to solve the optical design using classical optimization techniques. Therefore, reinforcement learning-based designs using Q-learning and Deep Q-learning are proposed to maximize the reward function. Simulation results verify that compared with the baseline designs, the proposed joint designs achieve better reward values while maintaining the BER of Bob's channel (Eve's channel) well below (above) the pre-FEC (forward error correction) BER threshold.
△ Less
Submitted 21 February, 2024;
originally announced February 2024.
-
Integrating LLMs for Explainable Fault Diagnosis in Complex Systems
Authors:
Akshay J. Dave,
Tat Nghia Nguyen,
Richard B. Vilim
Abstract:
This paper introduces an integrated system designed to enhance the explainability of fault diagnostics in complex systems, such as nuclear power plants, where operator understanding is critical for informed decision-making. By combining a physics-based diagnostic tool with a Large Language Model, we offer a novel solution that not only identifies faults but also provides clear, understandable expl…
▽ More
This paper introduces an integrated system designed to enhance the explainability of fault diagnostics in complex systems, such as nuclear power plants, where operator understanding is critical for informed decision-making. By combining a physics-based diagnostic tool with a Large Language Model, we offer a novel solution that not only identifies faults but also provides clear, understandable explanations of their causes and implications. The system's efficacy is demonstrated through application to a molten salt facility, showcasing its ability to elucidate the connections between diagnosed faults and sensor data, answer operator queries, and evaluate historical sensor anomalies. Our approach underscores the importance of merging model-based diagnostics with advanced AI to improve the reliability and transparency of autonomous systems.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
SpiRit-LM: Interleaved Spoken and Written Language Model
Authors:
Tu Anh Nguyen,
Benjamin Muller,
Bokai Yu,
Marta R. Costa-jussa,
Maha Elbayad,
Sravya Popuri,
Paul-Ambroise Duquenne,
Robin Algayres,
Ruslan Mavlyutov,
Itai Gat,
Gabriel Synnaeve,
Juan Pino,
Benoit Sagot,
Emmanuel Dupoux
Abstract:
We introduce SPIRIT-LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single set of tokens, and trained with a word-level interleaving method using a small automatically-curated…
▽ More
We introduce SPIRIT-LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single set of tokens, and trained with a word-level interleaving method using a small automatically-curated speech-text parallel corpus. SPIRIT-LM comes in two versions: a BASE version that uses speech semantic units and an EXPRESSIVE version that models expressivity using pitch and style units in addition to the semantic units. For both versions, the text is encoded with subword BPE tokens. The resulting model displays both the semantic abilities of text models and the expressive abilities of speech models. Additionally, we demonstrate that SPIRIT-LM is able to learn new tasks in a few-shot fashion across modalities (i.e. ASR, TTS, Speech Classification).
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
Multilinear Kernel Regression and Imputation via Manifold Learning
Authors:
Duc Thien Nguyen,
Konstantinos Slavakis
Abstract:
This paper introduces a novel nonparametric framework for data imputation, coined multilinear kernel regression and imputation via the manifold assumption (MultiL-KRIM). Motivated by manifold learning, MultiL-KRIM models data features as a point cloud located in or close to a user-unknown smooth manifold embedded in a reproducing kernel Hilbert space. Unlike typical manifold-learning routes, which…
▽ More
This paper introduces a novel nonparametric framework for data imputation, coined multilinear kernel regression and imputation via the manifold assumption (MultiL-KRIM). Motivated by manifold learning, MultiL-KRIM models data features as a point cloud located in or close to a user-unknown smooth manifold embedded in a reproducing kernel Hilbert space. Unlike typical manifold-learning routes, which seek low-dimensional patterns via regularizers based on graph-Laplacian matrices, MultiL-KRIM builds instead on the intuitive concept of tangent spaces to manifolds and incorporates collaboration among point-cloud neighbors (regressors) directly into the data-modeling term of the loss function. Multiple kernel functions are allowed to offer robustness and rich approximation properties, while multiple matrix factors offer low-rank modeling, integrate dimensionality reduction, and streamline computations with no need of training data. Two important application domains showcase the functionality of MultiL-KRIM: time-varying-graph-signal (TVGS) recovery, and reconstruction of highly accelerated dynamic-magnetic-resonance-imaging (dMRI) data. Extensive numerical tests on real and synthetic data demonstrate MultiL-KRIM's remarkable speedups over its predecessors, and outperformance over prevalent "shallow" data-imputation techniques, with a more intuitive and explainable pipeline than deep-image-prior methods.
△ Less
Submitted 5 February, 2024;
originally announced February 2024.
-
Physical Layer Location Privacy in SIMO Communication Using Fake Paths Injection
Authors:
Trong Duy Tran,
Maxime Ferreira Da Costa,
Linh Trung Nguyen
Abstract:
Fake path injection is an emerging paradigm for inducing privacy over wireless networks. In this paper, fake paths are injected by the transmitter into a SIMO multipath communication channel to preserve her physical location from an eavesdropper. A novel statistical privacy metric is defined as the ratio between the largest (resp. smallest) eigenvalues of Bob's (resp. Eve's) Cramér-Rao lower bound…
▽ More
Fake path injection is an emerging paradigm for inducing privacy over wireless networks. In this paper, fake paths are injected by the transmitter into a SIMO multipath communication channel to preserve her physical location from an eavesdropper. A novel statistical privacy metric is defined as the ratio between the largest (resp. smallest) eigenvalues of Bob's (resp. Eve's) Cramér-Rao lower bound on the SIMO multipath channel parameters to assess the privacy enhancements. Leveraging the spectral properties of generalized Vandermonde matrices, bounds on the privacy margin of the proposed scheme are derived. Specifically, it is shown that the privacy margin increases quadratically in the inverse of the separation between the true and the fake paths under Eve's perspective. Numerical simulations further showcase the approach's benefit.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
Channel Characterization of UAV-RIS-aided Systems with Adaptive Phase-shift Configuration
Authors:
Thanh Luan Nguyen,
Georges Kaddoum,
Tri Nhu Do,
Zygmunt J. Haas
Abstract:
This letter considers a UAV aiding communication between a ground transmitter and a ground receiver in the presence of co-channel interference. A discrete-time Markov process is adopted to model the complex nature of the Air-to-Ground (A2G) channel, including the occurrence of Line-of-Sight, Non-Line-of-Sight, and blockage events. Moreover, an adaptive phase-shift-enabled Reconfigurable Intelligen…
▽ More
This letter considers a UAV aiding communication between a ground transmitter and a ground receiver in the presence of co-channel interference. A discrete-time Markov process is adopted to model the complex nature of the Air-to-Ground (A2G) channel, including the occurrence of Line-of-Sight, Non-Line-of-Sight, and blockage events. Moreover, an adaptive phase-shift-enabled Reconfigurable Intelligent Surface (RIS) is deployed to combat A2G blockage events. Novel frameworks based on the shadowed Rician distribution are proposed to derive closed-form expressions for Ground-to-Air/A2G SINR' distributions. Numerical results show that RISs with large numbers of elements, e.g., 256 RIS elements, improve end-to-end Outage Probability (OP) and reduce blockages.
△ Less
Submitted 30 January, 2024;
originally announced January 2024.
-
User Association Optimization for IRS-aided Terahertz Networks: A Matching Theory Approach
Authors:
Muddasir Rahim,
Thanh Luan Nguyen,
Georges Kaddoum,
Tri Nhu Do
Abstract:
Terahertz (THz) communication is a promising technology for future wireless communications, offering data rates of up to several terabits-per-second (Tbps). However, the range of THz band communications is often limited by high pathloss and molecular absorption. To overcome these challenges, this paper proposes intelligent reconfigurable surfaces (IRSs) to enhance THz communication systems. Specif…
▽ More
Terahertz (THz) communication is a promising technology for future wireless communications, offering data rates of up to several terabits-per-second (Tbps). However, the range of THz band communications is often limited by high pathloss and molecular absorption. To overcome these challenges, this paper proposes intelligent reconfigurable surfaces (IRSs) to enhance THz communication systems. Specifically, we introduce an angle-based trigonometric channel model to evaluate the effectiveness of IRS-aided THz networks. Additionally, to maximize the sum rate, we formulate the source-IRS-destination matching problem, which is a mixed-integer nonlinear programming (MINLP) problem. To solve this non-deterministic polynomial-time hard (NP-hard) problem, the paper proposes a Gale-Shapley-based solution that obtains stable matches between sources and IRSs, as well as between destinations and IRSs in the first and second sub-problems, respectively.
△ Less
Submitted 26 January, 2024;
originally announced January 2024.
-
Statistical Characterization of RIS-assisted UAV Communications in Terrestrial and Non-Terrestrial Networks Under Channel Aging
Authors:
Thanh Luan Nguyen,
Georges Kaddoum,
Tri Nhu Do,
Zygmunt J. Haas
Abstract:
This paper studies the statistical characterization of ground-to-air (G2A) and reconfigurable intelligent surface (RIS)-assisted air-to-ground (A2G) communications with unmanned aerial vehicles (UAVs) in terrestrial and non-terrestrial networks under the impact of channel aging.
We first model the G2A and A2G signal-to-noise ratios (SNRs) as non-central complex Gaussian quadratic random variable…
▽ More
This paper studies the statistical characterization of ground-to-air (G2A) and reconfigurable intelligent surface (RIS)-assisted air-to-ground (A2G) communications with unmanned aerial vehicles (UAVs) in terrestrial and non-terrestrial networks under the impact of channel aging.
We first model the G2A and A2G signal-to-noise ratios (SNRs) as non-central complex Gaussian quadratic random variables (RVs) and derive their exact probability density functions, offering a unique characterization for the A2G SNR as the product of two scaled non-central chi-square RVs. Moreover, we also find that, for a large number of RIS elements, the RIS-assisted A2G channel can be characterized as a single Rician fading channel.
Our results reveal the presence of channel hardening in A2G communication under low UAV speeds, where we derive the maximum target spectral efficiency (SE) for a system to maintain a consistent required outage level. Meanwhile, high UAV speeds, exceeding 50 m/s, lead to a significant performance degradation, which cannot be mitigated by increasing the number of RIS elements.
△ Less
Submitted 30 January, 2024; v1 submitted 25 January, 2024;
originally announced January 2024.
-
FreGrad: Lightweight and Fast Frequency-aware Diffusion Vocoder
Authors:
Tan Dat Nguyen,
Ji-Hoon Kim,
Youngjoon Jang,
Jaehun Kim,
Joon Son Chung
Abstract:
The goal of this paper is to generate realistic audio with a lightweight and fast diffusion-based vocoder, named FreGrad. Our framework consists of the following three key components: (1) We employ discrete wavelet transform that decomposes a complicated waveform into sub-band wavelets, which helps FreGrad to operate on a simple and concise feature space, (2) We design a frequency-aware dilated co…
▽ More
The goal of this paper is to generate realistic audio with a lightweight and fast diffusion-based vocoder, named FreGrad. Our framework consists of the following three key components: (1) We employ discrete wavelet transform that decomposes a complicated waveform into sub-band wavelets, which helps FreGrad to operate on a simple and concise feature space, (2) We design a frequency-aware dilated convolution that elevates frequency awareness, resulting in generating speech with accurate frequency information, and (3) We introduce a bag of tricks that boosts the generation quality of the proposed model. In our experiments, FreGrad achieves 3.7 times faster training time and 2.2 times faster inference speed compared to our baseline while reducing the model size by 0.6 times (only 1.78M parameters) without sacrificing the output quality. Audio samples are available at: https://mm.kaist.ac.kr/projects/FreGrad.
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
Beyond Traditional Approaches: Multi-Task Network for Breast Ultrasound Diagnosis
Authors:
Dat T. Chung,
Minh-Anh Dang,
Mai-Anh Vu,
Minh T. Nguyen,
Thanh-Huy Nguyen,
Vinh Q. Dinh
Abstract:
Breast Ultrasound plays a vital role in cancer diagnosis as a non-invasive approach with cost-effective. In recent years, with the development of deep learning, many CNN-based approaches have been widely researched in both tumor localization and cancer classification tasks. Even though previous single models achieved great performance in both tasks, these methods have some limitations in inference…
▽ More
Breast Ultrasound plays a vital role in cancer diagnosis as a non-invasive approach with cost-effective. In recent years, with the development of deep learning, many CNN-based approaches have been widely researched in both tumor localization and cancer classification tasks. Even though previous single models achieved great performance in both tasks, these methods have some limitations in inference time, GPU requirement, and separate fine-tuning for each model. In this study, we aim to redesign and build end-to-end multi-task architecture to conduct both segmentation and classification. With our proposed approach, we achieved outstanding performance and time efficiency, with 79.8% and 86.4% in DeepLabV3+ architecture in the segmentation task.
△ Less
Submitted 14 January, 2024;
originally announced January 2024.
-
The Smooth Trajectory Estimator for LMB Filters
Authors:
Hoa Van Nguyen,
Tran Thien Dat Nguyen,
Changbeom Shim,
Marzhar Anuar
Abstract:
This paper proposes a smooth-trajectory estimator for the labelled multi-Bernoulli (LMB) filter by exploiting the special structure of the generalised labelled multi-Bernoulli (GLMB) filter. We devise a simple and intuitive approach to store the best association map when approximating the GLMB random finite set (RFS) to the LMB RFS. In particular, we construct a smooth-trajectory estimator (i.e.,…
▽ More
This paper proposes a smooth-trajectory estimator for the labelled multi-Bernoulli (LMB) filter by exploiting the special structure of the generalised labelled multi-Bernoulli (GLMB) filter. We devise a simple and intuitive approach to store the best association map when approximating the GLMB random finite set (RFS) to the LMB RFS. In particular, we construct a smooth-trajectory estimator (i.e., an estimator over the entire trajectories of labelled estimates) for the LMB filter based on the history of the best association map and all of the measurements up to the current time. Experimental results under two challenging scenarios demonstrate significant tracking accuracy improvements with negligible additional computational time compared to the conventional LMB filter. The source code is publicly available at https://tinyurl.com/ste-lmb, aimed at promoting advancements in MOT algorithms.
△ Less
Submitted 1 January, 2024;
originally announced January 2024.
-
RimSet: Quantitatively Identifying and Characterizing Chronic Active Multiple Sclerosis Lesion on Quantitative Susceptibility Maps
Authors:
Hang Zhang,
Thanh D. Nguyen,
**wei Zhang,
Renjiu Hu,
Susan A. Gauthier,
Yi Wang
Abstract:
Background: Rim+ lesions in multiple sclerosis (MS), detectable via Quantitative Susceptibility Map** (QSM), correlate with increased disability. Existing literature lacks quantitative analysis of these lesions. We introduce RimSet for quantitative identification and characterization of rim+ lesions on QSM. Methods: RimSet combines RimSeg, an unsupervised segmentation method using level-set meth…
▽ More
Background: Rim+ lesions in multiple sclerosis (MS), detectable via Quantitative Susceptibility Map** (QSM), correlate with increased disability. Existing literature lacks quantitative analysis of these lesions. We introduce RimSet for quantitative identification and characterization of rim+ lesions on QSM. Methods: RimSet combines RimSeg, an unsupervised segmentation method using level-set methodology, and radiomic measurements with Local Binary Pattern texture descriptors. We validated RimSet using simulated QSM images and an in vivo dataset of 172 MS subjects with 177 rim+ and 3986 rim-lesions. Results: RimSeg achieved a 78.7% Dice score against the ground truth, with challenges in partial rim lesions. RimSet detected rim+ lesions with a partial ROC AUC of 0.808 and PR AUC of 0.737, surpassing existing methods. QSMRim-Net showed the lowest mean square error (0.85) and high correlation (0.91; 95% CI: 0.88, 0.93) with expert annotations at the subject level.
△ Less
Submitted 28 December, 2023;
originally announced December 2023.
-
MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation
Authors:
Shengkui Zhao,
Yukun Ma,
Chongjia Ni,
Chong Zhang,
Hao Wang,
Trung Hieu Nguyen,
Kun Zhou,
Jiaqi Yip,
Dianwen Ng,
Bin Ma
Abstract:
Our previously proposed MossFormer has achieved promising performance in monaural speech separation. However, it predominantly adopts a self-attention-based MossFormer module, which tends to emphasize longer-range, coarser-scale dependencies, with a deficiency in effectively modelling finer-scale recurrent patterns. In this paper, we introduce a novel hybrid model that provides the capabilities to…
▽ More
Our previously proposed MossFormer has achieved promising performance in monaural speech separation. However, it predominantly adopts a self-attention-based MossFormer module, which tends to emphasize longer-range, coarser-scale dependencies, with a deficiency in effectively modelling finer-scale recurrent patterns. In this paper, we introduce a novel hybrid model that provides the capabilities to model both long-range, coarse-scale dependencies and fine-scale recurrent patterns by integrating a recurrent module into the MossFormer framework. Instead of applying the recurrent neural networks (RNNs) that use traditional recurrent connections, we present a recurrent module based on a feedforward sequential memory network (FSMN), which is considered "RNN-free" recurrent network due to the ability to capture recurrent patterns without using recurrent connections. Our recurrent module mainly comprises an enhanced dilated FSMN block by using gated convolutional units (GCU) and dense connections. In addition, a bottleneck layer and an output layer are also added for controlling information flow. The recurrent module relies on linear projections and convolutions for seamless, parallel processing of the entire sequence. The integrated MossFormer2 hybrid model demonstrates remarkable enhancements over MossFormer and surpasses other state-of-the-art methods in WSJ0-2/3mix, Libri2Mix, and WHAM!/WHAMR! benchmarks.
△ Less
Submitted 18 December, 2023;
originally announced December 2023.
-
Study of cognitive component of auditory attention to natural speech events
Authors:
Nhan D. T. Nguyen,
Kaare Mikkelsen,
Preben Kidmose
Abstract:
Event-related potentials (ERP) have been used to address a wide range of research questions in neuroscience and cognitive psychology including selective auditory attention. The recent progress in auditory attention decoding (AAD) methods is based on algorithms that find a relation between the audio envelope and the neurophysiological response. The most popular approach is based on the reconstructi…
▽ More
Event-related potentials (ERP) have been used to address a wide range of research questions in neuroscience and cognitive psychology including selective auditory attention. The recent progress in auditory attention decoding (AAD) methods is based on algorithms that find a relation between the audio envelope and the neurophysiological response. The most popular approach is based on the reconstruction of the audio envelope based on EEG signals. However, these methods are mainly based on the neurophysiological entrainment to physical attributes of the sensory stimulus and are generally limited by a long detection window. This study proposes a novel approach to auditory attention decoding by looking at higher-level cognitive responses to natural speech. To investigate if natural speech events elicit cognitive ERP components and how these components are affected by attention mechanisms, we designed a series of four experimental paradigms with increasing complexity: a word category oddball paradigm, a word category oddball paradigm with competing speakers, and competing speech streams with and without specific targets. We recorded the electroencephalogram (EEG) from 32 scalp electrodes and 12 in-ear electrodes (ear-EEG) from 24 participants. A cognitive ERP component, which we believe is related to the well-known P3b component, was observed at parietal electrode sites with a latency of approximately 620 ms. The component is statistically most significant for the simplest paradigm and gradually decreases in strength with increasing complexity of the paradigm. We also show that the component can be observed in the in-ear EEG signals by using spatial filtering. The cognitive component elicited by auditory attention may contribute to decoding auditory attention from electrophysiological recordings and its presence in the ear-EEG signals is promising for future applications within hearing aids.
△ Less
Submitted 19 December, 2023; v1 submitted 16 December, 2023;
originally announced December 2023.
-
IncepSE: Leveraging InceptionTime's performance with Squeeze and Excitation mechanism in ECG analysis
Authors:
Tue Minh Cao,
Nhat Hong Tran,
Le Phi Nguyen,
Hieu Huy Pham,
Hung Thanh Nguyen
Abstract:
Our study focuses on the potential for modifications of Inception-like architecture within the electrocardiogram (ECG) domain. To this end, we introduce IncepSE, a novel network characterized by strategic architectural incorporation that leverages the strengths of both InceptionTime and channel attention mechanisms. Furthermore, we propose a training setup that employs stabilization techniques tha…
▽ More
Our study focuses on the potential for modifications of Inception-like architecture within the electrocardiogram (ECG) domain. To this end, we introduce IncepSE, a novel network characterized by strategic architectural incorporation that leverages the strengths of both InceptionTime and channel attention mechanisms. Furthermore, we propose a training setup that employs stabilization techniques that are aimed at tackling the formidable challenges of severe imbalance dataset PTB-XL and gradient corruption. By this means, we manage to set a new height for deep learning model in a supervised learning manner across the majority of tasks. Our model consistently surpasses InceptionTime by substantial margins compared to other state-of-the-arts in this domain, noticeably 0.013 AUROC score improvement in the "all" task, while also mitigating the inherent dataset fluctuations during training.
△ Less
Submitted 16 November, 2023;
originally announced December 2023.
-
Pattern retrieval of traffic congestion using graph-based associations of traffic domain-specific features
Authors:
Tin T. Nguyen,
Simeon C. Calvert,
Guopeng Li,
Hans van Lint
Abstract:
The fast-growing amount of traffic data brings many opportunities for revealing more insightful information about traffic dynamics. However, it also demands an effective database management system in which information retrieval is arguably an important feature. The ability to locate similar patterns in big datasets potentially paves the way for further valuable analyses in traffic management. This…
▽ More
The fast-growing amount of traffic data brings many opportunities for revealing more insightful information about traffic dynamics. However, it also demands an effective database management system in which information retrieval is arguably an important feature. The ability to locate similar patterns in big datasets potentially paves the way for further valuable analyses in traffic management. This paper proposes a content-based retrieval system for spatiotemporal patterns of highway traffic congestion. There are two main components in our framework, namely pattern representation and similarity measurement. To effectively interpret retrieval outcomes, the paper proposes a graph-based approach (relation-graph) for the former component, in which fundamental traffic phenomena are encoded as nodes and their spatiotemporal relationships as edges. In the latter component, the similarities between congestion patterns are customizable with various aspects according to user expectations. We evaluated the proposed framework by applying it to a dataset of hundreds of patterns with various complexities (temporally and spatially). The example queries indicate the effectiveness of the proposed method, i.e. the obtained patterns present similar traffic phenomena as in the given examples. In addition, the success of the proposed approach directly derives a new opportunity for semantic retrieval, in which expected patterns are described by adopting the relation-graph notion to associate fundamental traffic phenomena.
△ Less
Submitted 28 November, 2023;
originally announced November 2023.
-
On the Out of Distribution Robustness of Foundation Models in Medical Image Segmentation
Authors:
Duy Minh Ho Nguyen,
Tan Ngoc Pham,
Nghiem Tuong Diep,
Nghi Quoc Phan,
Quang Pham,
Vinh Tong,
Binh T. Nguyen,
Ngan Hoang Le,
Nhat Ho,
Pengtao Xie,
Daniel Sonntag,
Mathias Niepert
Abstract:
Constructing a robust model that can effectively generalize to test samples under distribution shifts remains a significant challenge in the field of medical imaging. The foundational models for vision and language, pre-trained on extensive sets of natural image and text data, have emerged as a promising approach. It showcases impressive learning abilities across different tasks with the need for…
▽ More
Constructing a robust model that can effectively generalize to test samples under distribution shifts remains a significant challenge in the field of medical imaging. The foundational models for vision and language, pre-trained on extensive sets of natural image and text data, have emerged as a promising approach. It showcases impressive learning abilities across different tasks with the need for only a limited amount of annotated samples. While numerous techniques have focused on develo** better fine-tuning strategies to adapt these models for specific domains, we instead examine their robustness to domain shifts in the medical image segmentation task. To this end, we compare the generalization performance to unseen domains of various pre-trained models after being fine-tuned on the same in-distribution dataset and show that foundation-based models enjoy better robustness than other architectures. From here, we further developed a new Bayesian uncertainty estimation for frozen models and used them as an indicator to characterize the model's performance on out-of-distribution (OOD) data, proving particularly beneficial for real-world applications. Our experiments not only reveal the limitations of current indicators like accuracy on the line or agreement on the line commonly used in natural image applications but also emphasize the promise of the introduced Bayesian uncertainty. Specifically, lower uncertainty predictions usually tend to higher out-of-distribution (OOD) performance.
△ Less
Submitted 18 November, 2023;
originally announced November 2023.
-
Modeling Power Systems Dynamics with Symbolic Physics-Informed Neural Networks
Authors:
Huynh T. T. Tran,
Hieu T. Nguyen
Abstract:
In recent years, scientific machine learning, particularly physic-informed neural networks (PINNs), has introduced new innovative methods to understanding the differential equations that describe power system dynamics, providing a more efficient alternative to traditional methods. However, using a single neural network to capture patterns of all variables requires a large enough size of networks,…
▽ More
In recent years, scientific machine learning, particularly physic-informed neural networks (PINNs), has introduced new innovative methods to understanding the differential equations that describe power system dynamics, providing a more efficient alternative to traditional methods. However, using a single neural network to capture patterns of all variables requires a large enough size of networks, leading to a long time of training and still high computational costs. In this paper, we utilize the interfacing of PINNs with symbolic techniques to construct multiple single-output neural networks by taking the loss function apart and integrating it over the relevant domain. Also, we reweigh the factors of the components in the loss function to improve the performance of the network for instability systems. Our results show that the symbolic PINNs provide higher accuracy with significantly fewer parameters and faster training time. By using the adaptive weight method, the symbolic PINNs can avoid the vanishing gradient problem and numerical instability.
△ Less
Submitted 11 November, 2023;
originally announced November 2023.
-
Label Space Partition Selection for Multi-Object Tracking Using Two-Layer Partitioning
Authors:
Ji Youn Lee,
Changbeom Shim,
Hoa Van Nguyen,
Tran Thien Dat Nguyen,
Hyun** Choi,
Youngho Kim
Abstract:
Estimating the trajectories of multi-objects poses a significant challenge due to data association ambiguity, which leads to a substantial increase in computational requirements. To address such problems, a divide-and-conquer manner has been employed with parallel computation. In this strategy, distinguished objects that have unique labels are grouped based on their statistical dependencies, the i…
▽ More
Estimating the trajectories of multi-objects poses a significant challenge due to data association ambiguity, which leads to a substantial increase in computational requirements. To address such problems, a divide-and-conquer manner has been employed with parallel computation. In this strategy, distinguished objects that have unique labels are grouped based on their statistical dependencies, the intersection of predicted measurements. Several geometry approaches have been used for label grou** since finding all intersected label pairs is clearly infeasible for large-scale tracking problems. This paper proposes an efficient implementation of label grou** for label-partitioned generalized labeled multi-Bernoulli filter framework using a secondary partitioning technique. This allows for parallel computation in the label graph indexing step, avoiding generating and eliminating duplicate comparisons. Additionally, we compare the performance of the proposed technique with several efficient spatial searching algorithms. The results demonstrate the superior performance of the proposed approach on large-scale data sets, enabling scalable trajectory estimation.
△ Less
Submitted 22 October, 2023;
originally announced October 2023.
-
A reproducible 3D convolutional neural network with dual attention module (3D-DAM) for Alzheimer's disease classification
Authors:
Thanh Phuong Vu,
Tien Nhat Nguyen,
N. Minh Nhat Hoang,
Gia Minh Hoang
Abstract:
Alzheimer's disease is one of the most common types of neurodegenerative disease, characterized by the accumulation of amyloid-beta plaque and tau tangles. Recently, deep learning approaches have shown promise in Alzheimer's disease diagnosis. In this study, we propose a reproducible model that utilizes a 3D convolutional neural network with a dual attention module for Alzheimer's disease classifi…
▽ More
Alzheimer's disease is one of the most common types of neurodegenerative disease, characterized by the accumulation of amyloid-beta plaque and tau tangles. Recently, deep learning approaches have shown promise in Alzheimer's disease diagnosis. In this study, we propose a reproducible model that utilizes a 3D convolutional neural network with a dual attention module for Alzheimer's disease classification. We trained the model in the ADNI database and verified the generalizability of our method in two independent datasets (AIBL and OASIS1). Our method achieved state-of-the-art classification performance, with an accuracy of 91.94% for MCI progression classification and 96.30% for Alzheimer's disease classification on the ADNI dataset. Furthermore, the model demonstrated good generalizability, achieving an accuracy of 86.37% on the AIBL dataset and 83.42% on the OASIS1 dataset. These results indicate that our proposed approach has competitive performance and generalizability when compared to recent studies in the field.
△ Less
Submitted 4 March, 2024; v1 submitted 19 October, 2023;
originally announced October 2023.
-
Multi-stage Large Language Model Correction for Speech Recognition
Authors:
Jie Pu,
Thai-Son Nguyen,
Sebastian Stüker
Abstract:
In this paper, we investigate the usage of large language models (LLMs) to improve the performance of competitive speech recognition systems. Different from previous LLM-based ASR error correction methods, we propose a novel multi-stage approach that utilizes uncertainty estimation of ASR outputs and reasoning capability of LLMs. Specifically, the proposed approach has two stages: the first stage…
▽ More
In this paper, we investigate the usage of large language models (LLMs) to improve the performance of competitive speech recognition systems. Different from previous LLM-based ASR error correction methods, we propose a novel multi-stage approach that utilizes uncertainty estimation of ASR outputs and reasoning capability of LLMs. Specifically, the proposed approach has two stages: the first stage is about ASR uncertainty estimation and exploits N-best list hypotheses to identify less reliable transcriptions; The second stage works on these identified transcriptions and performs LLM-based corrections. This correction task is formulated as a multi-step rule-based LLM reasoning process, which uses explicitly written rules in prompts to decompose the task into concrete reasoning steps. Our experimental results demonstrate the effectiveness of the proposed method by showing 10% ~ 20% relative improvement in WER over competitive ASR systems -- across multiple test domains and in zero-shot settings.
△ Less
Submitted 17 June, 2024; v1 submitted 17 October, 2023;
originally announced October 2023.
-
Vision and Language Navigation in the Real World via Online Visual Language Map**
Authors:
Chengguang Xu,
Hieu T. Nguyen,
Christopher Amato,
Lawson L. S. Wong
Abstract:
Navigating in unseen environments is crucial for mobile robots. Enhancing them with the ability to follow instructions in natural language will further improve navigation efficiency in unseen cases. However, state-of-the-art (SOTA) vision-and-language navigation (VLN) methods are mainly evaluated in simulation, neglecting the complex and noisy real world. Directly transferring SOTA navigation poli…
▽ More
Navigating in unseen environments is crucial for mobile robots. Enhancing them with the ability to follow instructions in natural language will further improve navigation efficiency in unseen cases. However, state-of-the-art (SOTA) vision-and-language navigation (VLN) methods are mainly evaluated in simulation, neglecting the complex and noisy real world. Directly transferring SOTA navigation policies trained in simulation to the real world is challenging due to the visual domain gap and the absence of prior knowledge about unseen environments. In this work, we propose a novel navigation framework to address the VLN task in the real world. Utilizing the powerful foundation models, the proposed framework includes four key components: (1) an LLMs-based instruction parser that converts the language instruction into a sequence of pre-defined macro-action descriptions, (2) an online visual-language mapper that builds a real-time visual-language map to maintain a spatial and semantic understanding of the unseen environment, (3) a language indexing-based localizer that grounds each macro-action description into a waypoint location on the map, and (4) a DD-PPO-based local controller that predicts the action. We evaluate the proposed pipeline on an Interbotix LoCoBot WX250 in an unseen lab environment. Without any fine-tuning, our pipeline significantly outperforms the SOTA VLN baseline in the real world.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
Conditional Diffusion Model for Target Speaker Extraction
Authors:
Theodor Nguyen,
Guangzhi Sun,
Xianrui Zheng,
Chao Zhang,
Philip C Woodland
Abstract:
We propose DiffSpEx, a generative target speaker extraction method based on score-based generative modelling through stochastic differential equations. DiffSpEx deploys a continuous-time stochastic diffusion process in the complex short-time Fourier transform domain, starting from the target speaker source and converging to a Gaussian distribution centred on the mixture of sources. For the reverse…
▽ More
We propose DiffSpEx, a generative target speaker extraction method based on score-based generative modelling through stochastic differential equations. DiffSpEx deploys a continuous-time stochastic diffusion process in the complex short-time Fourier transform domain, starting from the target speaker source and converging to a Gaussian distribution centred on the mixture of sources. For the reverse-time process, a parametrised score function is conditioned on a target speaker embedding to extract the target speaker from the mixture of sources. We utilise ECAPA-TDNN target speaker embeddings and condition the score function alternately on the SDE time embedding and the target speaker embedding. The potential of DiffSpEx is demonstrated with the WSJ0-2mix dataset, achieving an SI-SDR of 12.9 dB and a NISQA score of 3.56. Moreover, we show that fine-tuning a pre-trained DiffSpEx model to a specific speaker further improves performance, enabling personalisation in target speaker extraction.
△ Less
Submitted 7 October, 2023;
originally announced October 2023.
-
A multi-institutional pediatric dataset of clinical radiology MRIs by the Children's Brain Tumor Network
Authors:
Ariana M. Familiar,
Anahita Fathi Kazerooni,
Hannah Anderson,
Aliaksandr Lubneuski,
Karthik Viswanathan,
Rocky Breslow,
Nastaran Khalili,
Sina Bagheri,
Debanjan Haldar,
Meen Chul Kim,
Sherjeel Arif,
Rachel Madhogarhia,
Thinh Q. Nguyen,
Elizabeth A. Frenkel,
Zeinab Helili,
Jessica Harrison,
Keyvan Farahani,
Marius George Linguraru,
Ulas Bagci,
Yury Velichko,
Jeffrey Stevens,
Sarah Leary,
Robert M. Lober,
Stephani Campion,
Amy A. Smith
, et al. (15 additional authors not shown)
Abstract:
Pediatric brain and spinal cancers remain the leading cause of cancer-related death in children. Advancements in clinical decision-support in pediatric neuro-oncology utilizing the wealth of radiology imaging data collected through standard care, however, has significantly lagged other domains. Such data is ripe for use with predictive analytics such as artificial intelligence (AI) methods, which…
▽ More
Pediatric brain and spinal cancers remain the leading cause of cancer-related death in children. Advancements in clinical decision-support in pediatric neuro-oncology utilizing the wealth of radiology imaging data collected through standard care, however, has significantly lagged other domains. Such data is ripe for use with predictive analytics such as artificial intelligence (AI) methods, which require large datasets. To address this unmet need, we provide a multi-institutional, large-scale pediatric dataset of 23,101 multi-parametric MRI exams acquired through routine care for 1,526 brain tumor patients, as part of the Children's Brain Tumor Network. This includes longitudinal MRIs across various cancer diagnoses, with associated patient-level clinical information, digital pathology slides, as well as tissue genotype and omics data. To facilitate downstream analysis, treatment-naïve images for 370 subjects were processed and released through the NCI Childhood Cancer Data Initiative via the Cancer Data Service. Through ongoing efforts to continuously build these imaging repositories, our aim is to accelerate discovery and translational AI models with real-world data, to ultimately empower precision medicine for children.
△ Less
Submitted 2 October, 2023;
originally announced October 2023.
-
MVC: A Multi-Task Vision Transformer Network for COVID-19 Diagnosis from Chest X-ray Images
Authors:
Huyen Tran,
Duc Thanh Nguyen,
John Yearwood
Abstract:
Medical image analysis using computer-based algorithms has attracted considerable attention from the research community and achieved tremendous progress in the last decade. With recent advances in computing resources and availability of large-scale medical image datasets, many deep learning models have been developed for disease diagnosis from medical images. However, existing techniques focus on…
▽ More
Medical image analysis using computer-based algorithms has attracted considerable attention from the research community and achieved tremendous progress in the last decade. With recent advances in computing resources and availability of large-scale medical image datasets, many deep learning models have been developed for disease diagnosis from medical images. However, existing techniques focus on sub-tasks, e.g., disease classification and identification, individually, while there is a lack of a unified framework enabling multi-task diagnosis. Inspired by the capability of Vision Transformers in both local and global representation learning, we propose in this paper a new method, namely Multi-task Vision Transformer (MVC) for simultaneously classifying chest X-ray images and identifying affected regions from the input data. Our method is built upon the Vision Transformer but extends its learning capability in a multi-task setting. We evaluated our proposed method and compared it with existing baselines on a benchmark dataset of COVID-19 chest X-ray images. Experimental results verified the superiority of the proposed method over the baselines on both the image classification and affected region identification tasks.
△ Less
Submitted 30 September, 2023;
originally announced October 2023.
-
Low-Resource Self-Supervised Learning with SSL-Enhanced TTS
Authors:
Po-chun Hsu,
Ali Elkahky,
Wei-Ning Hsu,
Yossi Adi,
Tu Anh Nguyen,
Jade Copet,
Emmanuel Dupoux,
Hung-yi Lee,
Abdelrahman Mohamed
Abstract:
Self-supervised learning (SSL) techniques have achieved remarkable results in various speech processing tasks. Nonetheless, a significant challenge remains in reducing the reliance on vast amounts of speech data for pre-training. This paper proposes to address this challenge by leveraging synthetic speech to augment a low-resource pre-training corpus. We construct a high-quality text-to-speech (TT…
▽ More
Self-supervised learning (SSL) techniques have achieved remarkable results in various speech processing tasks. Nonetheless, a significant challenge remains in reducing the reliance on vast amounts of speech data for pre-training. This paper proposes to address this challenge by leveraging synthetic speech to augment a low-resource pre-training corpus. We construct a high-quality text-to-speech (TTS) system with limited resources using SSL features and generate a large synthetic corpus for pre-training. Experimental results demonstrate that our proposed approach effectively reduces the demand for speech data by 90% with only slight performance degradation. To the best of our knowledge, this is the first work aiming to enhance low-resource self-supervised learning in speech processing.
△ Less
Submitted 4 June, 2024; v1 submitted 29 September, 2023;
originally announced September 2023.
-
Circular-Line Trajectory Tracking Controller for Mobile Robot using Multi-Pixy2 Sensors
Authors:
Xuan Quang Ngo,
Tri Duc Tran,
Huy Hung Nguyen,
Van Dong Nguyen,
Van Tu Duong,
Tan Tien Nguyen
Abstract:
This study suggests a novel tracking method that employs three Pixy2 sensors to identify the desired line trajectories instead of traditional perceiving means. Firstly, the kinematic model of the mobile robot is derived from the information gathered by three Pixy2 sensors. Secondly, the sliding mode controller is implemented to regulate the tracking error. Finally, simulation results are analyzed…
▽ More
This study suggests a novel tracking method that employs three Pixy2 sensors to identify the desired line trajectories instead of traditional perceiving means. Firstly, the kinematic model of the mobile robot is derived from the information gathered by three Pixy2 sensors. Secondly, the sliding mode controller is implemented to regulate the tracking error. Finally, simulation results are analyzed to show the effectiveness of the proposed method.
△ Less
Submitted 12 August, 2023;
originally announced September 2023.
-
Energy-Efficient Precoding Designs for Multi-User Visible Light Communication Systems with Confidential Messages
Authors:
Son T. Duong,
Thanh V. Pham,
Chuyen T. Nguyen,
Anh T. Pham
Abstract:
This paper studies energy-efficient precoding designs for multi-user visible light communication (VLC) systems from the perspective of physical layer security where users' messages must be kept mutually confidential. For such systems, we first derive a lower bound on the achievable secrecy rate of each user. Next, the total power consumption for illumination and data transmission is thoroughly ana…
▽ More
This paper studies energy-efficient precoding designs for multi-user visible light communication (VLC) systems from the perspective of physical layer security where users' messages must be kept mutually confidential. For such systems, we first derive a lower bound on the achievable secrecy rate of each user. Next, the total power consumption for illumination and data transmission is thoroughly analyzed. We then tackle the problem of maximizing energy efficiency, given that each user's secrecy rate satisfies a certain threshold. The design problem is shown to be non-convex fractional programming, which renders finding the optimal solution computationally prohibitive. Our aim in this paper is, therefore, to find sub-optimal yet low complexity solutions. For this purpose, the traditional Dinkelbach algorithm is first employed to reformulate the original problem to a non-fractional parameterized one. Two different approaches based on the convex-concave procedure (CCCP) and Semidefinite Relaxation (SDR) are utilized to solve the non-convex parameterized problem. In addition, to further reduce the complexity, we investigate a design using the zero-forcing (ZF) technique. Numerical results are conducted to show the feasibility, convergence, and performance of the proposed algorithms depending on different parameters of the system.
△ Less
Submitted 27 September, 2023;
originally announced September 2023.
-
SPGM: Prioritizing Local Features for enhanced speech separation performance
Authors:
Jia Qi Yip,
Shengkui Zhao,
Yukun Ma,
Chongjia Ni,
Chong Zhang,
Hao Wang,
Trung Hieu Nguyen,
Kun Zhou,
Dianwen Ng,
Eng Siong Chng,
Bin Ma
Abstract:
Dual-path is a popular architecture for speech separation models (e.g. Sepformer) which splits long sequences into overlap** chunks for its intra- and inter-blocks that separately model intra-chunk local features and inter-chunk global relationships. However, it has been found that inter-blocks, which comprise half a dual-path model's parameters, contribute minimally to performance. Thus, we pro…
▽ More
Dual-path is a popular architecture for speech separation models (e.g. Sepformer) which splits long sequences into overlap** chunks for its intra- and inter-blocks that separately model intra-chunk local features and inter-chunk global relationships. However, it has been found that inter-blocks, which comprise half a dual-path model's parameters, contribute minimally to performance. Thus, we propose the Single-Path Global Modulation (SPGM) block to replace inter-blocks. SPGM is named after its structure consisting of a parameter-free global pooling module followed by a modulation module comprising only 2% of the model's total parameters. The SPGM block allows all transformer layers in the model to be dedicated to local feature modelling, making the overall model single-path. SPGM achieves 22.1 dB SI-SDRi on WSJ0-2Mix and 20.4 dB SI-SDRi on Libri2Mix, exceeding the performance of Sepformer by 0.5 dB and 0.3 dB respectively and matches the performance of recent SOTA models with up to 8 times fewer parameters. Model and weights are available at huggingface.co/yipjiaqi/spgm
△ Less
Submitted 10 March, 2024; v1 submitted 21 September, 2023;
originally announced September 2023.
-
Are Soft Prompts Good Zero-shot Learners for Speech Recognition?
Authors:
Dianwen Ng,
Chong Zhang,
Ruixi Zhang,
Yukun Ma,
Fabian Ritter-Gutierrez,
Trung Hieu Nguyen,
Chongjia Ni,
Shengkui Zhao,
Eng Siong Chng,
Bin Ma
Abstract:
Large self-supervised pre-trained speech models require computationally expensive fine-tuning for downstream tasks. Soft prompt tuning offers a simple parameter-efficient alternative by utilizing minimal soft prompt guidance, enhancing portability while also maintaining competitive performance. However, not many people understand how and why this is so. In this study, we aim to deepen our understa…
▽ More
Large self-supervised pre-trained speech models require computationally expensive fine-tuning for downstream tasks. Soft prompt tuning offers a simple parameter-efficient alternative by utilizing minimal soft prompt guidance, enhancing portability while also maintaining competitive performance. However, not many people understand how and why this is so. In this study, we aim to deepen our understanding of this emerging method by investigating the role of soft prompts in automatic speech recognition (ASR). Our findings highlight their role as zero-shot learners in improving ASR performance but also make them vulnerable to malicious modifications. Soft prompts aid generalization but are not obligatory for inference. We also identify two primary roles of soft prompts: content refinement and noise information enhancement, which enhances robustness against background noise. Additionally, we propose an effective modification on noise prompts to show that they are capable of zero-shot learning on adapting to out-of-distribution noise environments.
△ Less
Submitted 17 September, 2023;
originally announced September 2023.