-
CMRxRecon2024: A Multi-Modality, Multi-View K-Space Dataset Boosting Universal Machine Learning for Accelerated Cardiac MRI
Authors:
Zi Wang,
Fanwen Wang,
Chen Qin,
Jun Lyu,
Ouyang Cheng,
Shuo Wang,
Yan Li,
Mengyao Yu,
Haoyu Zhang,
Kunyuan Guo,
Zhang Shi,
Qirong Li,
Ziqiang Xu,
Ya**g Zhang,
Hao Li,
Sha Hua,
Binghua Chen,
Longyu Sun,
Mengting Sun,
Qin Li,
Ying-Hua Chu,
Wenjia Bai,
**g Qin,
Xiahai Zhuang,
Claudia Prieto
, et al. (7 additional authors not shown)
Abstract:
Cardiac magnetic resonance imaging (MRI) has emerged as a clinically gold-standard technique for diagnosing cardiac diseases, thanks to its ability to provide diverse information with multiple modalities and anatomical views. Accelerated cardiac MRI is highly expected to achieve time-efficient and patient-friendly imaging, and then advanced image reconstruction approaches are required to recover h…
▽ More
Cardiac magnetic resonance imaging (MRI) has emerged as a clinically gold-standard technique for diagnosing cardiac diseases, thanks to its ability to provide diverse information with multiple modalities and anatomical views. Accelerated cardiac MRI is highly expected to achieve time-efficient and patient-friendly imaging, and then advanced image reconstruction approaches are required to recover high-quality, clinically interpretable images from undersampled measurements. However, the lack of publicly available cardiac MRI k-space dataset in terms of both quantity and diversity has severely hindered substantial technological progress, particularly for data-driven artificial intelligence. Here, we provide a standardized, diverse, and high-quality CMRxRecon2024 dataset to facilitate the technical development, fair evaluation, and clinical transfer of cardiac MRI reconstruction approaches, towards promoting the universal frameworks that enable fast and robust reconstructions across different cardiac MRI protocols in clinical practice. To the best of our knowledge, the CMRxRecon2024 dataset is the largest and most diverse publicly available cardiac k-space dataset. It is acquired from 330 healthy volunteers, covering commonly used modalities, anatomical views, and acquisition trajectories in clinical cardiac MRI workflows. Besides, an open platform with tutorials, benchmarks, and data processing tools is provided to facilitate data usage, advanced method development, and fair performance evaluation.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
SMRU: Split-and-Merge Recurrent-based UNet for Acoustic Echo Cancellation and Noise Suppression
Authors:
Zhihang Sun,
Andong Li,
Rilin Chen,
Hao Zhang,
Meng Yu,
Yi Zhou,
Dong Yu
Abstract:
The proliferation of deep neural networks has spawned the rapid development of acoustic echo cancellation and noise suppression, and plenty of prior arts have been proposed, which yield promising performance. Nevertheless, they rarely consider the deployment generality in different processing scenarios, such as edge devices, and cloud processing. To this end, this paper proposes a general model, t…
▽ More
The proliferation of deep neural networks has spawned the rapid development of acoustic echo cancellation and noise suppression, and plenty of prior arts have been proposed, which yield promising performance. Nevertheless, they rarely consider the deployment generality in different processing scenarios, such as edge devices, and cloud processing. To this end, this paper proposes a general model, termed SMRU, to cover different application scenarios. The novelty lies in two-fold. First, a multi-scale band split layer and band merge layer are proposed to effectively fuse local frequency bands for lower complexity modeling. Besides, by simulating the multi-resolution feature modeling characteristic of the classical UNet structure, a novel recurrent-dominated UNet is devised. It consists of multiple variable frame rate blocks, each of which involves the causal time down-/up-sampling layer with varying compression ratios and the dual-path structure for inter- and intra-band modeling. The model is configured from 50 M/s to 6.8 G/s in terms of MACs, and the experimental results show that the proposed approach yields competitive or even better performance over existing baselines, and has the full potential to adapt to more general scenarios with varying complexity requirements.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo Segment
Authors:
Yiwen Shao,
Shi-Xiong Zhang,
Yong Xu,
Meng Yu,
Dong Yu,
Daniel Povey,
Sanjeev Khudanpur
Abstract:
In the field of multi-channel, multi-speaker Automatic Speech Recognition (ASR), the task of discerning and accurately transcribing a target speaker's speech within background noise remains a formidable challenge. Traditional approaches often rely on microphone array configurations and the information of the target speaker's location or voiceprint. This study introduces the Solo Spatial Feature (S…
▽ More
In the field of multi-channel, multi-speaker Automatic Speech Recognition (ASR), the task of discerning and accurately transcribing a target speaker's speech within background noise remains a formidable challenge. Traditional approaches often rely on microphone array configurations and the information of the target speaker's location or voiceprint. This study introduces the Solo Spatial Feature (Solo-SF), an innovative method that utilizes a target speaker's isolated speech segment to enhance ASR performance, thereby circumventing the need for conventional inputs like microphone array layouts. We explore effective strategies for selecting optimal solo segments, a crucial aspect for Solo-SF's success. Through evaluations conducted on the AliMeeting dataset and AISHELL-1 simulations, Solo-SF demonstrates superior performance over existing techniques, significantly lowering Character Error Rates (CER) in various test conditions. Our findings highlight Solo-SF's potential as an effective solution for addressing the complexities of multi-channel, multi-speaker ASR tasks.
△ Less
Submitted 17 June, 2024; v1 submitted 13 June, 2024;
originally announced June 2024.
-
HawkVision: Low-Latency Modeless Edge AI Serving
Authors:
ChonLam Lao,
Jiaqi Gao,
Ganesh Ananthanarayanan,
Aditya Akella,
Minlan Yu
Abstract:
The trend of modeless ML inference is increasingly growing in popularity as it hides the complexity of model inference from users and caters to diverse user and application accuracy requirements. Previous work mostly focuses on modeless inference in data centers. To provide low-latency inference, in this paper, we promote modeless inference at the edge. The edge environment introduces additional c…
▽ More
The trend of modeless ML inference is increasingly growing in popularity as it hides the complexity of model inference from users and caters to diverse user and application accuracy requirements. Previous work mostly focuses on modeless inference in data centers. To provide low-latency inference, in this paper, we promote modeless inference at the edge. The edge environment introduces additional challenges related to low power consumption, limited device memory, and volatile network environments.
To address these challenges, we propose HawkVision, which provides low-latency modeless serving of vision DNNs. HawkVision leverages a two-layer edge-DC architecture that employs confidence scaling to reduce the number of model options while meeting diverse accuracy requirements. It also supports lossy inference under volatile network environments. Our experimental results show that HawkVision outperforms current serving systems by up to 1.6X in P99 latency for providing modeless service. Our FPGA prototype demonstrates similar performance at certain accuracy levels with up to a 3.34X reduction in power consumption.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
A Valuation Framework for Customers Impacted by Extreme Temperature-Related Outages
Authors:
Min Gyung Yu,
Monish Mukherjee,
Shiva Poudela,
Sadie R. Bender,
Sarmad Hanif,
Trevor D. Hardy,
Hayden M. Reeve
Abstract:
Extreme temperature outages can lead to not just economic losses but also various non-energy impacts (NEI) due to significant degradation of indoor operating conditions caused by service disruptions. However, existing resilience assessment approaches lack specificity for extreme temperature conditions. They often overlook temperature-related mortality and neglect the customer characteristics and g…
▽ More
Extreme temperature outages can lead to not just economic losses but also various non-energy impacts (NEI) due to significant degradation of indoor operating conditions caused by service disruptions. However, existing resilience assessment approaches lack specificity for extreme temperature conditions. They often overlook temperature-related mortality and neglect the customer characteristics and grid response in the calculation, despite the significant influence of these factors on NEI-related economic losses. This paper aims to address these gaps by introducing a comprehensive framework to estimate the impact of resilience enhancement not only on the direct economic losses incurred by customers but also on potential NEI, including mortality and the value of statistical life during extreme temperature-related outages. The proposed resilience valuation integrates customer characteristics and grid response variables based on a scalable grid simulation environment. This study adopts a holistic approach to quantify customer-oriented economic impacts, utilizing probabilistic loss scenarios that incorporate health-related factors and damage/loss models as a function of exposure for valuation. The proposed methodology is demonstrated through comparative resilient outage planning, using grid response models emulating a Texas weather zone during the 2021 winter storm Uri. The case study results show that enhanced outage planning with hardened infrastructure can improve the system resilience and thereby reduce the relative risk of mortality by 16% and save the total costs related to non-energy impacts by 74%. These findings underscore the efficacy of the framework by assessing the financial implications of each case, providing valuable insights for decision-makers and stakeholders involved in extreme-weather related resilience planning for risk management and mitigation strategies.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Functional Imaging Constrained Diffusion for Brain PET Synthesis from Structural MRI
Authors:
Minhui Yu,
Mengqi Wu,
Ling Yue,
Andrea Bozoki,
Mingxia Liu
Abstract:
Magnetic resonance imaging (MRI) and positron emission tomography (PET) are increasingly used in multimodal analysis of neurodegenerative disorders. While MRI is broadly utilized in clinical settings, PET is less accessible. Many studies have attempted to use deep generative models to synthesize PET from MRI scans. However, they often suffer from unstable training and inadequately preserve brain f…
▽ More
Magnetic resonance imaging (MRI) and positron emission tomography (PET) are increasingly used in multimodal analysis of neurodegenerative disorders. While MRI is broadly utilized in clinical settings, PET is less accessible. Many studies have attempted to use deep generative models to synthesize PET from MRI scans. However, they often suffer from unstable training and inadequately preserve brain functional information conveyed by PET. To this end, we propose a functional imaging constrained diffusion (FICD) framework for 3D brain PET image synthesis with paired structural MRI as input condition, through a new constrained diffusion model (CDM). The FICD introduces noise to PET and then progressively removes it with CDM, ensuring high output fidelity throughout a stable training phase. The CDM learns to predict denoised PET with a functional imaging constraint introduced to ensure voxel-wise alignment between each denoised PET and its ground truth. Quantitative and qualitative analyses conducted on 293 subjects with paired T1-weighted MRI and 18F-fluorodeoxyglucose (FDG)-PET scans suggest that FICD achieves superior performance in generating FDG-PET data compared to state-of-the-art methods. We further validate the effectiveness of the proposed FICD on data from a total of 1,262 subjects through three downstream tasks, with experimental results suggesting its utility and generalizability.
△ Less
Submitted 8 May, 2024; v1 submitted 3 May, 2024;
originally announced May 2024.
-
Unleashing the Power of T1-cells in SFQ Arithmetic Circuits
Authors:
Rassul Bairamkulov,
Mingfei Yu,
Giovanni De Micheli
Abstract:
Rapid single-flux quantum (RSFQ), a leading cryogenic superconductive electronics (SCE) technology, offers extremely low power dissipation and high speed. However, implementing RSFQ systems at VLSI complexity faces challenges, such as substantial area overhead from gate-level pipelining and path balancing, exacerbated by RSFQ's limited layout density. T1 flip-flop (T1-FF) is an RSFQ logic cell ope…
▽ More
Rapid single-flux quantum (RSFQ), a leading cryogenic superconductive electronics (SCE) technology, offers extremely low power dissipation and high speed. However, implementing RSFQ systems at VLSI complexity faces challenges, such as substantial area overhead from gate-level pipelining and path balancing, exacerbated by RSFQ's limited layout density. T1 flip-flop (T1-FF) is an RSFQ logic cell operating as a pulse counter. Using T1-FF the full adder function can be realized with only 40% of the area required by the conventional realization. This cell however imposes complex constraints on input signal timing, complicating its use. Multiphase clocking has been recently proposed to alleviate gate-level pipelining overhead. The fanin signals can be efficiently controlled using multiphase clocking. We present the novel two-stage SFQ technology map** methodology supporting the T1-FF. Compatible parts of the SFQ network are first replaced by the efficient T1-FFs. Multiphase retiming is next applied to assign clock phases to each logic gate and insert DFFs to satisfy the input timing. Using our flow, the area of the SFQ networks is reduced, on average, by 6% with up to 25% reduction in optimizing the 128-bit adder.
△ Less
Submitted 9 March, 2024;
originally announced March 2024.
-
LMI-based robust model predictive control for a quarter car with series active variable geometry suspension
Authors:
Zilin Feng,
Anastasis Georgiou,
Simos A. Evangelou,
Min Yu,
Imad M Jaimoukha,
Daniele Dini
Abstract:
This paper proposes a robust model predictive control-based solution for the recently introduced series active variable geometry suspension (SAVGS) to improve the ride comfort and road holding of a quarter car. In order to close the gap between the nonlinear multi-body SAVGS model and its linear equivalent, a new uncertain system characterization is proposed that captures unmodeled dynamics, param…
▽ More
This paper proposes a robust model predictive control-based solution for the recently introduced series active variable geometry suspension (SAVGS) to improve the ride comfort and road holding of a quarter car. In order to close the gap between the nonlinear multi-body SAVGS model and its linear equivalent, a new uncertain system characterization is proposed that captures unmodeled dynamics, parameter variation, and external disturbances. Based on the newly proposed linear uncertain model for the quarter car SAVGS system, a constrained optimal control problem (OCP) is presented in the form of a linear matrix inequality (LMI) optimization. More specifically, utilizing semidefinite relaxation techniques a state-feedback robust model predictive control (RMPC) scheme is presented and integrated with the nonlinear multi-body SAVGS model, where state-feedback gain and control perturbation are computed online to optimise performance, while physical and design constraints are preserved. Numerical simulation results with different ISO-defined road events demonstrate the robustness and significant performance improvement in terms of ride comfort and road holding of the proposed approach, as compared to the conventional passive suspension, as well as, to actively controlled SAVGS by a previously developed conventional H-infinity control scheme.
△ Less
Submitted 29 January, 2024; v1 submitted 12 January, 2024;
originally announced January 2024.
-
Windformer:Bi-Directional Long-Distance Spatio-Temporal Network For Wind Speed Prediction
Authors:
Xuewei Li,
Zewen Shang,
Zhiqiang Liu,
Jian Yu,
Wei Xiong,
Mei Yu
Abstract:
Wind speed prediction is critical to the management of wind power generation. Due to the large range of wind speed fluctuations and wake effect, there may also be strong correlations between long-distance wind turbines. This difficult-to-extract feature has become a bottleneck for improving accuracy. History and future time information includes the trend of airflow changes, whether this dynamic in…
▽ More
Wind speed prediction is critical to the management of wind power generation. Due to the large range of wind speed fluctuations and wake effect, there may also be strong correlations between long-distance wind turbines. This difficult-to-extract feature has become a bottleneck for improving accuracy. History and future time information includes the trend of airflow changes, whether this dynamic information can be utilized will also affect the prediction effect. In response to the above problems, this paper proposes Windformer. First, Windformer divides the wind turbine cluster into multiple non-overlap** windows and calculates correlations inside the windows, then shifts the windows partially to provide connectivity between windows, and finally fuses multi-channel features based on detailed and global information. To dynamically model the change process of wind speed, this paper extracts time series in both history and future directions simultaneously. Compared with other current-advanced methods, the Mean Square Error (MSE) of Windformer is reduced by 0.5\% to 15\% on two datasets from NERL.
△ Less
Submitted 24 November, 2023;
originally announced November 2023.
-
Deep Audio Zooming: Beamwidth-Controllable Neural Beamformer
Authors:
Meng Yu,
Dong Yu
Abstract:
Audio zooming, a signal processing technique, enables selective focusing and enhancement of sound signals from a specified region, attenuating others. While traditional beamforming and neural beamforming techniques, centered on creating a directional array, necessitate the designation of a singular target direction, they often overlook the concept of a field of view (FOV), that defines an angular…
▽ More
Audio zooming, a signal processing technique, enables selective focusing and enhancement of sound signals from a specified region, attenuating others. While traditional beamforming and neural beamforming techniques, centered on creating a directional array, necessitate the designation of a singular target direction, they often overlook the concept of a field of view (FOV), that defines an angular area. In this paper, we proposed a simple yet effective FOV feature, amalgamating all directional attributes within the user-defined field. In conjunction, we've introduced a counter FOV feature capturing directional aspects outside the desired field. Such advancements ensure refined sound capture, particularly emphasizing the FOV's boundaries, and guarantee the enhanced capture of all desired sound sources inside the user-defined field. The results from the experiment demonstrate the efficacy of the introduced angular FOV feature and its seamless incorporation into a low-power subband model suited for real-time applica?tions.
△ Less
Submitted 21 November, 2023;
originally announced November 2023.
-
Mobility-Induced Graph Learning for WiFi Positioning
Authors:
Kyuwon Han,
Seung Min Yu,
Seong-Lyun Kim,
Seung-Woo Ko
Abstract:
A smartphone-based user mobility tracking could be effective in finding his/her location, while the unpredictable error therein due to low specification of built-in inertial measurement units (IMUs) rejects its standalone usage but demands the integration to another positioning technique like WiFi positioning. This paper aims to propose a novel integration technique using a graph neural network ca…
▽ More
A smartphone-based user mobility tracking could be effective in finding his/her location, while the unpredictable error therein due to low specification of built-in inertial measurement units (IMUs) rejects its standalone usage but demands the integration to another positioning technique like WiFi positioning. This paper aims to propose a novel integration technique using a graph neural network called Mobility-INduced Graph LEarning (MINGLE), which is designed based on two types of graphs made by capturing different user mobility features. Specifically, considering sequential measurement points (MPs) as nodes, a user's regular mobility pattern allows us to connect neighbor MPs as edges, called time-driven mobility graph (TMG). Second, a user's relatively straight transition at a constant pace when moving from one position to another can be captured by connecting the nodes on each path, called a direction-driven mobility graph (DMG). Then, we can design graph convolution network (GCN)-based cross-graph learning, where two different GCN models for TMG and DMG are jointly trained by feeding different input features created by WiFi RTTs yet sharing their weights. Besides, the loss function includes a mobility regularization term such that the differences between adjacent location estimates should be less variant due to the user's stable moving pace. Noting that the regularization term does not require ground-truth location, MINGLE can be designed under semi- and self-supervised learning frameworks. The proposed MINGLE's effectiveness is extensively verified through field experiments, showing a better positioning accuracy than benchmarks, say root mean square errors (RMSEs) being 1.398 (m) and 1.073 (m) for self- and semi-supervised learning cases, respectively.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Using ResNet to Utilize 4-class T2-FLAIR Slice Classification Based on the Cholinergic Pathways Hyperintensities Scale for Pathological Aging
Authors:
Wei-Chun Kevin Tsai,
Yi-Chien Liu,
Ming-Chun Yu,
Chia-Ju Chou,
Sui-Hing Yan,
Yang-Teng Fan,
Yan-Hsiang Huang,
Yen-Ling Chiu,
Yi-Fang Chuang,
Ran-Zan Wang,
Yao-Chia Shih
Abstract:
The Cholinergic Pathways Hyperintensities Scale (CHIPS) is a visual rating scale used to assess the extent of cholinergic white matter hyperintensities in T2-FLAIR images, serving as an indicator of dementia severity. However, the manual selection of four specific slices for rating throughout the entire brain is a time-consuming process. Our goal was to develop a deep learning-based model capable…
▽ More
The Cholinergic Pathways Hyperintensities Scale (CHIPS) is a visual rating scale used to assess the extent of cholinergic white matter hyperintensities in T2-FLAIR images, serving as an indicator of dementia severity. However, the manual selection of four specific slices for rating throughout the entire brain is a time-consuming process. Our goal was to develop a deep learning-based model capable of automatically identifying the four slices relevant to CHIPS. To achieve this, we trained a 4-class slice classification model (BSCA) using the ADNI T2-FLAIR dataset (N=150) with the assistance of ResNet. Subsequently, we tested the model's performance on a local dataset (N=30). The results demonstrated the efficacy of our model, with an accuracy of 99.82% and an F1-score of 99.83%. This achievement highlights the potential impact of BSCA as an automatic screening tool, streamlining the selection of four specific T2-FLAIR slices that encompass white matter landmarks along the cholinergic pathways. Clinicians can leverage this tool to assess the risk of clinical dementia development efficiently.
△ Less
Submitted 9 November, 2023;
originally announced November 2023.
-
Enhancing Building Energy Efficiency through Advanced Sizing and Dispatch Methods for Energy Storage
Authors:
Min Gyung Yu,
Xu Ma,
Bowen Huang,
Karthik Devaprasad,
Fredericka Brown,
Di Wu
Abstract:
Energy storage and electrification of buildings hold great potential for future decarbonized energy systems. However, there are several technical and economic barriers that prevent large-scale adoption and integration of energy storage in buildings. These barriers include integration with building control systems, high capital costs, and the necessity to identify and quantify value streams for dif…
▽ More
Energy storage and electrification of buildings hold great potential for future decarbonized energy systems. However, there are several technical and economic barriers that prevent large-scale adoption and integration of energy storage in buildings. These barriers include integration with building control systems, high capital costs, and the necessity to identify and quantify value streams for different stakeholders. To overcome these obstacles, it is crucial to develop advanced sizing and dispatch methods to assist planning and operational decision-making for integrating energy storage in buildings. This work develops a simple and flexible optimal sizing and dispatch framework for thermal energy storage (TES) and battery energy storage (BES) systems in large-scale office buildings. The optimal sizes of TES, BES, as well as other building assets are determined in a joint manner instead of sequentially to avoid sub-optimal solutions. The solution is determined considering both capital costs in optimal sizing and operational benefits in optimal dispatch. With the optimally sized systems, we implemented real-time operation using the model-based control (MPC), facilitating the effective and efficient management of energy resources. Comprehensive assessments are performed using simulation studies to quantify potential energy and economic benefits by different utility tariffs and climate locations, to improve our understanding of the techno-economic performance of different TES and BES systems, and to identify barriers to adopting energy storage for buildings. Finally, the proposed framework will provide guidance to a broad range of stakeholders to properly design energy storage in buildings and maximize potential benefits, thereby advancing affordable building energy storage deployment and hel** accelerate the transition towards a cleaner and more equitable energy economy.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
How Good Are Synthetic Medical Images? An Empirical Study with Lung Ultrasound
Authors:
Menghan Yu,
Sourabh Kulhare,
Courosh Mehanian,
Charles B Delahunt,
Daniel E Shea,
Zohreh Laverriere,
Ishan Shah,
Matthew P Horning
Abstract:
Acquiring large quantities of data and annotations is known to be effective for develo** high-performing deep learning models, but is difficult and expensive to do in the healthcare context. Adding synthetic training data using generative models offers a low-cost method to deal effectively with the data scarcity challenge, and can also address data imbalance and patient privacy issues. In this s…
▽ More
Acquiring large quantities of data and annotations is known to be effective for develo** high-performing deep learning models, but is difficult and expensive to do in the healthcare context. Adding synthetic training data using generative models offers a low-cost method to deal effectively with the data scarcity challenge, and can also address data imbalance and patient privacy issues. In this study, we propose a comprehensive framework that fits seamlessly into model development workflows for medical image analysis. We demonstrate, with datasets of varying size, (i) the benefits of generative models as a data augmentation method; (ii) how adversarial methods can protect patient privacy via data substitution; (iii) novel performance metrics for these use cases by testing models on real holdout data. We show that training with both synthetic and real data outperforms training with real data alone, and that models trained solely with synthetic data approach their real-only counterparts. Code is available at https://github.com/Global-Health-Labs/US-DCGAN.
△ Less
Submitted 5 October, 2023;
originally announced October 2023.
-
Neural Network Augmented Kalman Filter for Robust Acoustic Howling Suppression
Authors:
Yixuan Zhang,
Hao Zhang,
Meng Yu,
Dong Yu
Abstract:
Acoustic howling suppression (AHS) is a critical challenge in audio communication systems. In this paper, we propose a novel approach that leverages the power of neural networks (NN) to enhance the performance of traditional Kalman filter algorithms for AHS. Specifically, our method involves the integration of NN modules into the Kalman filter, enabling refining reference signal, a key factor in e…
▽ More
Acoustic howling suppression (AHS) is a critical challenge in audio communication systems. In this paper, we propose a novel approach that leverages the power of neural networks (NN) to enhance the performance of traditional Kalman filter algorithms for AHS. Specifically, our method involves the integration of NN modules into the Kalman filter, enabling refining reference signal, a key factor in effective adaptive filtering, and estimating covariance metrics for the filter which are crucial for adaptability in dynamic conditions, thereby obtaining improved AHS performance. As a result, the proposed method achieves improved AHS performance compared to both standalone NN and Kalman filter methods. Experimental evaluations validate the effectiveness of our approach.
△ Less
Submitted 27 September, 2023;
originally announced September 2023.
-
Advancing Acoustic Howling Suppression through Recursive Training of Neural Networks
Authors:
Hao Zhang,
Yixuan Zhang,
Meng Yu,
Dong Yu
Abstract:
In this paper, we introduce a novel training framework designed to comprehensively address the acoustic howling issue by examining its fundamental formation process. This framework integrates a neural network (NN) module into the closed-loop system during training with signals generated recursively on the fly to closely mimic the streaming process of acoustic howling suppression (AHS). The propose…
▽ More
In this paper, we introduce a novel training framework designed to comprehensively address the acoustic howling issue by examining its fundamental formation process. This framework integrates a neural network (NN) module into the closed-loop system during training with signals generated recursively on the fly to closely mimic the streaming process of acoustic howling suppression (AHS). The proposed recursive training strategy bridges the gap between training and real-world inference scenarios, marking a departure from previous NN-based methods that typically approach AHS as either noise suppression or acoustic echo cancellation. Within this framework, we explore two methodologies: one exclusively relying on NN and the other combining NN with the traditional Kalman filter. Additionally, we propose strategies, including howling detection and initialization using pre-trained offline models, to bolster trainability and expedite the training process. Experimental results validate that this framework offers a substantial improvement over previous methodologies for acoustic howling suppression.
△ Less
Submitted 27 September, 2023;
originally announced September 2023.
-
Unifying Robustness and Fidelity: A Comprehensive Study of Pretrained Generative Methods for Speech Enhancement in Adverse Conditions
Authors:
Heming Wang,
Meng Yu,
Hao Zhang,
Chunlei Zhang,
Zhongweiyang Xu,
Muqiao Yang,
Yixuan Zhang,
Dong Yu
Abstract:
Enhancing speech signal quality in adverse acoustic environments is a persistent challenge in speech processing. Existing deep learning based enhancement methods often struggle to effectively remove background noise and reverberation in real-world scenarios, hampering listening experiences. To address these challenges, we propose a novel approach that uses pre-trained generative methods to resynth…
▽ More
Enhancing speech signal quality in adverse acoustic environments is a persistent challenge in speech processing. Existing deep learning based enhancement methods often struggle to effectively remove background noise and reverberation in real-world scenarios, hampering listening experiences. To address these challenges, we propose a novel approach that uses pre-trained generative methods to resynthesize clean, anechoic speech from degraded inputs. This study leverages pre-trained vocoder or codec models to synthesize high-quality speech while enhancing robustness in challenging scenarios. Generative methods effectively handle information loss in speech signals, resulting in regenerated speech that has improved fidelity and reduced artifacts. By harnessing the capabilities of pre-trained models, we achieve faithful reproduction of the original speech in adverse conditions. Experimental evaluations on both simulated datasets and realistic samples demonstrate the effectiveness and robustness of our proposed methods. Especially by leveraging codec, we achieve superior subjective scores for both simulated and realistic recordings. The generated speech exhibits enhanced audio quality, reduced background noise, and reverberation. Our findings highlight the potential of pre-trained generative techniques in speech processing, particularly in scenarios where traditional methods falter. Demos are available at https://whmrtm.github.io/SoundResynthesis.
△ Less
Submitted 16 September, 2023;
originally announced September 2023.
-
Using electrical impedance spectroscopy to identify equivalent circuit models of lubricated contacts with complex geometry: in-situ application to mini traction machine
Authors:
Min Yu,
Jie Zhang,
Arndt Joedicke,
Tom Reddyhoff
Abstract:
Electrical contact resistance or capacitance as measured between a lubricated contact has been used in tribometers, partially reflecting the lubrication condition. In contrast, the electrical impedance provides rich information of magnitude and phase, which can be interpreted using equivalent circuit models, enabling more comprehensive measurements, including the variation of lubricant film thickn…
▽ More
Electrical contact resistance or capacitance as measured between a lubricated contact has been used in tribometers, partially reflecting the lubrication condition. In contrast, the electrical impedance provides rich information of magnitude and phase, which can be interpreted using equivalent circuit models, enabling more comprehensive measurements, including the variation of lubricant film thickness and the asperity (metal to metal) contact area. An accurate circuit model of the lubricated contact is critical as needed for the electrical impedance analysis. However, existing circuit models are hand derived and suited to interfaces with simple geometry, such as parallel plates, concentric and eccentric cylinders. Circuit model identification of lubricated contacts with complex geometry is challenging. This work takes the ball-on-disc lubricated contact in a Mini Traction Machine (MTM) as an example, where screws on the ball, grooves on the disc, and contact close to the disc edge make the overall interface geometry complicated. The electrical impedance spectroscopy (EIS) is used to capture its frequency response, with a group of load, speed, and temperature varied and tested separately. The results enable an identification of equivalent circuit models by fitting parallel resistor-capacitor models, the dependence on the oil film thickness is further calibrated using a high-accuracy optical interferometry, which is operated under the same lubrication condition as in the MTM. Overall, the proposed method is applicable to general lubricated interfaces for the identification of equivalent circuit models, which in turn facilitates in-situ tribo-contacts with electric impedance measurement of oil film thickness. It does not need transparent materials as optical techniques do, or structural modifications for piezoelectric sensor mounting as ultrasound techniques do.
△ Less
Submitted 7 July, 2023;
originally announced July 2023.
-
Hybrid AHS: A Hybrid of Kalman Filter and Deep Learning for Acoustic Howling Suppression
Authors:
Hao Zhang,
Meng Yu,
Yuzhong Wu,
Tao Yu,
Dong Yu
Abstract:
Deep learning has been recently introduced for efficient acoustic howling suppression (AHS). However, the recurrent nature of howling creates a mismatch between offline training and streaming inference, limiting the quality of enhanced speech. To address this limitation, we propose a hybrid method that combines a Kalman filter with a self-attentive recurrent neural network (SARNN) to leverage thei…
▽ More
Deep learning has been recently introduced for efficient acoustic howling suppression (AHS). However, the recurrent nature of howling creates a mismatch between offline training and streaming inference, limiting the quality of enhanced speech. To address this limitation, we propose a hybrid method that combines a Kalman filter with a self-attentive recurrent neural network (SARNN) to leverage their respective advantages for robust AHS. During offline training, a pre-processed signal obtained from the Kalman filter and an ideal microphone signal generated via teacher-forced training strategy are used to train the deep neural network (DNN). During streaming inference, the DNN's parameters are fixed while its output serves as a reference signal for updating the Kalman filter. Evaluation in both offline and streaming inference scenarios using simulated and real-recorded data shows that the proposed method efficiently suppresses howling and consistently outperforms baselines.
△ Less
Submitted 4 May, 2023;
originally announced May 2023.
-
Deep Learning for Joint Acoustic Echo and Acoustic Howling Suppression in Hybrid Meetings
Authors:
Hao Zhang,
Meng Yu,
Dong Yu
Abstract:
Hybrid meetings have become increasingly necessary during the post-COVID period and also brought new challenges for solving audio-related problems. In particular, the interplay between acoustic echo and acoustic howling in a hybrid meeting makes the joint suppression of them difficult. This paper proposes a deep learning approach to tackle this problem by formulating a recurrent feedback suppressi…
▽ More
Hybrid meetings have become increasingly necessary during the post-COVID period and also brought new challenges for solving audio-related problems. In particular, the interplay between acoustic echo and acoustic howling in a hybrid meeting makes the joint suppression of them difficult. This paper proposes a deep learning approach to tackle this problem by formulating a recurrent feedback suppression process as an instantaneous speech separation task using the teacher-forced training strategy. Specifically, a self-attentive recurrent neural network is utilized to extract the target speech from microphone recordings with accessible and learned reference signals, thus suppressing acoustic echo and acoustic howling simultaneously. Different combinations of input signals and loss functions have been investigated for performance improvement. Experimental results demonstrate the effectiveness of the proposed method for suppressing echo and howling jointly in hybrid meetings.
△ Less
Submitted 4 May, 2023; v1 submitted 2 May, 2023;
originally announced May 2023.
-
Two-Stream Joint-Training for Speaker Independent Acoustic-to-Articulatory Inversion
Authors:
Jianrong Wang,
**yu Liu,
Li Liu,
Xuewei Li,
Mei Yu,
Jie Gao,
Qiang Fang
Abstract:
Acoustic-to-articulatory inversion (AAI) aims to estimate the parameters of articulators from speech audio. There are two common challenges in AAI, which are the limited data and the unsatisfactory performance in speaker independent scenario. Most current works focus on extracting features directly from speech and ignoring the importance of phoneme information which may limit the performance of AA…
▽ More
Acoustic-to-articulatory inversion (AAI) aims to estimate the parameters of articulators from speech audio. There are two common challenges in AAI, which are the limited data and the unsatisfactory performance in speaker independent scenario. Most current works focus on extracting features directly from speech and ignoring the importance of phoneme information which may limit the performance of AAI. To this end, we propose a novel network called SPN that uses two different streams to carry out the AAI task. Firstly, to improve the performance of speaker-independent experiment, we propose a new phoneme stream network to estimate the articulatory parameters as the phoneme features. To the best of our knowledge, this is the first work that extracts the speaker-independent features from phonemes to improve the performance of AAI. Secondly, in order to better represent the speech information, we train a speech stream network to combine the local features and the global features. Compared with state-of-the-art (SOTA), the proposed method reduces 0.18mm on RMSE and increases 6.0% on Pearson correlation coefficient in the speaker-independent experiment. The code has been released at https://github.com/liu**yu123/AAINetwork-SPN.
△ Less
Submitted 26 February, 2023;
originally announced February 2023.
-
Deep AHS: A Deep Learning Approach to Acoustic Howling Suppression
Authors:
Hao Zhang,
Meng Yu,
Dong Yu
Abstract:
In this paper, we formulate acoustic howling suppression (AHS) as a supervised learning problem and propose a deep learning approach, called Deep AHS, to address it. Deep AHS is trained in a teacher forcing way which converts the recurrent howling suppression process into an instantaneous speech separation process to simplify the problem and accelerate the model training. The proposed method utili…
▽ More
In this paper, we formulate acoustic howling suppression (AHS) as a supervised learning problem and propose a deep learning approach, called Deep AHS, to address it. Deep AHS is trained in a teacher forcing way which converts the recurrent howling suppression process into an instantaneous speech separation process to simplify the problem and accelerate the model training. The proposed method utilizes properly designed features and trains an attention based recurrent neural network (RNN) to extract the target signal from the microphone recording, thus attenuating the playback signal that may lead to howling. Different training strategies are investigated and a streaming inference method implemented in a recurrent mode used to evaluate the performance of the proposed method for real-time howling suppression. Deep AHS avoids howling detection and intrinsically prohibits howling from happening, allowing for more flexibility in the design of audio systems. Experimental results show the effectiveness of the proposed method for howling suppression under different scenarios.
△ Less
Submitted 17 August, 2023; v1 submitted 18 February, 2023;
originally announced February 2023.
-
NeuralKalman: A Learnable Kalman Filter for Acoustic Echo Cancellation
Authors:
Yixuan Zhang,
Meng Yu,
Hao Zhang,
Dong Yu,
DeLiang Wang
Abstract:
The robustness of the Kalman filter to double talk and its rapid convergence make it a popular approach for addressing acoustic echo cancellation (AEC) challenges. However, the inability to model nonlinearity and the need to tune control parameters cast limitations on such adaptive filtering algorithms. In this paper, we integrate the frequency domain Kalman filter (FDKF) and deep neural networks…
▽ More
The robustness of the Kalman filter to double talk and its rapid convergence make it a popular approach for addressing acoustic echo cancellation (AEC) challenges. However, the inability to model nonlinearity and the need to tune control parameters cast limitations on such adaptive filtering algorithms. In this paper, we integrate the frequency domain Kalman filter (FDKF) and deep neural networks (DNNs) into a hybrid method, called NeuralKalman, to leverage the advantages of deep learning and adaptive filtering algorithms. Specifically, we employ a DNN to estimate nonlinearly distorted far-end signals, a transition factor, and the nonlinear transition function in the state equation of the FDKF algorithm. Experimental results show that the proposed NeuralKalman improves the performance of FDKF significantly and outperforms strong baseline methods.
△ Less
Submitted 26 December, 2023; v1 submitted 29 January, 2023;
originally announced January 2023.
-
Hybrid Representation Learning for Cognitive Diagnosis in Late-Life Depression Over 5 Years with Structural MRI
Authors:
Lintao Zhang,
Lihong Wang,
Minhui Yu,
Rong Wu,
David C. Steffens,
Guy G. Potter,
Mingxia Liu
Abstract:
Late-life depression (LLD) is a highly prevalent mood disorder occurring in older adults and is frequently accompanied by cognitive impairment (CI). Studies have shown that LLD may increase the risk of Alzheimer's disease (AD). However, the heterogeneity of presentation of geriatric depression suggests that multiple biological mechanisms may underlie it. Current biological research on LLD progress…
▽ More
Late-life depression (LLD) is a highly prevalent mood disorder occurring in older adults and is frequently accompanied by cognitive impairment (CI). Studies have shown that LLD may increase the risk of Alzheimer's disease (AD). However, the heterogeneity of presentation of geriatric depression suggests that multiple biological mechanisms may underlie it. Current biological research on LLD progression incorporates machine learning that combines neuroimaging data with clinical observations. There are few studies on incident cognitive diagnostic outcomes in LLD based on structural MRI (sMRI). In this paper, we describe the development of a hybrid representation learning (HRL) framework for predicting cognitive diagnosis over 5 years based on T1-weighted sMRI data. Specifically, we first extract prediction-oriented MRI features via a deep neural network, and then integrate them with handcrafted MRI features via a Transformer encoder for cognitive diagnosis prediction. Two tasks are investigated in this work, including (1) identifying cognitively normal subjects with LLD and never-depressed older healthy subjects, and (2) identifying LLD subjects who developed CI (or even AD) and those who stayed cognitively normal over five years. To the best of our knowledge, this is among the first attempts to study the complex heterogeneous progression of LLD based on task-oriented and handcrafted MRI features. We validate the proposed HRL on 294 subjects with T1-weighted MRIs from two clinically harmonized studies. Experimental results suggest that the HRL outperforms several classical machine learning and state-of-the-art deep learning methods in LLD identification and prediction tasks.
△ Less
Submitted 24 December, 2022;
originally announced December 2022.
-
Analyzing At-Scale Distribution Grid Response to Extreme Temperatures
Authors:
Sarmad Hanif,
Monish Mukherjee,
Shiva Poudel,
Rohit A **siwale,
Min Gyung Yu,
Trevor Hardy,
Hayden Reeve
Abstract:
Threats against power grids continue to increase, as extreme weather conditions and natural disasters (extreme events) become more frequent. Hence, there is a need for the simulation and modeling of power grids to reflect realistic conditions during extreme events conditions, especially distribution systems. This paper presents a modeling and simulation platform for electric distribution grids whi…
▽ More
Threats against power grids continue to increase, as extreme weather conditions and natural disasters (extreme events) become more frequent. Hence, there is a need for the simulation and modeling of power grids to reflect realistic conditions during extreme events conditions, especially distribution systems. This paper presents a modeling and simulation platform for electric distribution grids which can estimate overall power demand during extreme weather conditions. The presented platform's efficacy is shown by demonstrating estimation of electrical demand for 1) Electricity Reliability Council of Texas (ERCOT) during winter storm Uri in 2021, and 2) alternative hypothetical scenarios of integrating Distributed Energy Resources (DERs), weatherization, and load electrification. In comparing to the actual demand served by ERCOT during the winter storm Uri of 2021, the proposed platform estimates approximately 34 GW of peak capacity deficit1. For the case of the future electrification of heating loads, peak capacity of 78 GW (124% increase) is estimated, which would be reduced to 47 GW (38% increase) with the adoption of efficient heating appliances and improved thermal insulation. Integrating distributed solar PV and storage into the grid causes improvement in the local energy utilization and hence reduces the potential unmet energy by 31% and 40%, respectively.
△ Less
Submitted 7 December, 2022;
originally announced December 2022.
-
Deep Neural Mel-Subband Beamformer for In-car Speech Separation
Authors:
Vinay Kothapally,
Yong Xu,
Meng Yu,
Shi-Xiong Zhang,
Dong Yu
Abstract:
While current deep learning (DL)-based beamforming techniques have been proved effective in speech separation, they are often designed to process narrow-band (NB) frequencies independently which results in higher computational costs and inference times, making them unsuitable for real-world use. In this paper, we propose DL-based mel-subband spatio-temporal beamformer to perform speech separation…
▽ More
While current deep learning (DL)-based beamforming techniques have been proved effective in speech separation, they are often designed to process narrow-band (NB) frequencies independently which results in higher computational costs and inference times, making them unsuitable for real-world use. In this paper, we propose DL-based mel-subband spatio-temporal beamformer to perform speech separation in a car environment with reduced computation cost and inference time. As opposed to conventional subband (SB) approaches, our framework uses a mel-scale based subband selection strategy which ensures a fine-grained processing for lower frequencies where most speech formant structure is present, and coarse-grained processing for higher frequencies. In a recursive way, robust frame-level beamforming weights are determined for each speaker location/zone in a car from the estimated subband speech and noise covariance matrices. Furthermore, proposed framework also estimates and suppresses any echoes from the loudspeaker(s) by using the echo reference signals. We compare the performance of our proposed framework to several NB, SB, and full-band (FB) processing techniques in terms of speech quality and recognition metrics. Based on experimental evaluations on simulated and real-world recordings, we find that our proposed framework achieves better separation performance over all SB and FB approaches and achieves performance closer to NB processing techniques while requiring lower computing cost.
△ Less
Submitted 11 March, 2023; v1 submitted 22 November, 2022;
originally announced November 2022.
-
LiSnowNet: Real-time Snow Removal for LiDAR Point Cloud
Authors:
Ming-Yuan Yu,
Ram Vasudevan,
Matthew Johnson-Roberson
Abstract:
LiDARs have been widely adopted to modern self-driving vehicles, providing 3D information of the scene and surrounding objects. However, adverser weather conditions still pose significant challenges to LiDARs since point clouds captured during snowfall can easily be corrupted. The resulting noisy point clouds degrade downstream tasks such as map**. Existing works in de-noising point clouds corru…
▽ More
LiDARs have been widely adopted to modern self-driving vehicles, providing 3D information of the scene and surrounding objects. However, adverser weather conditions still pose significant challenges to LiDARs since point clouds captured during snowfall can easily be corrupted. The resulting noisy point clouds degrade downstream tasks such as map**. Existing works in de-noising point clouds corrupted by snow are based on nearest-neighbor search, and thus do not scale well with modern LiDARs which usually capture $100k$ or more points at 10Hz. In this paper, we introduce an unsupervised de-noising algorithm, LiSnowNet, running 52$\times$ faster than the state-of-the-art methods while achieving superior performance in de-noising. Unlike previous methods, the proposed algorithm is based on a deep convolutional neural network and can be easily deployed to hardware accelerators such as GPUs. In addition, we demonstrate how to use the proposed method for map** even with corrupted point clouds.
△ Less
Submitted 17 November, 2022;
originally announced November 2022.
-
Parametrically driven inertial sensing in chip-scale optomechanical cavities at the thermodynamical limits with extended dynamic range
Authors:
Jaime Gonzalo Flor Flores,
Talha Yerebakan,
Wenting Wang,
Mingbin Yu,
Dim-Lee Kwong,
Andrey Matsko,
Chee Wei Wong
Abstract:
Recent scientific and technological advances have enabled the detection of gravitational waves, autonomous driving, and the proposal of a communications network on the Moon (Lunar Internet or LunaNet). These efforts are based on the measurement of minute displacements and correspondingly the forces or fields transduction, which translate to acceleration, velocity, and position determination for na…
▽ More
Recent scientific and technological advances have enabled the detection of gravitational waves, autonomous driving, and the proposal of a communications network on the Moon (Lunar Internet or LunaNet). These efforts are based on the measurement of minute displacements and correspondingly the forces or fields transduction, which translate to acceleration, velocity, and position determination for navigation. State-of-the-art accelerometers use capacitive or piezo resistive techniques, and micro-electromechanical systems (MEMS) via integrated circuit (IC) technologies in order to drive the transducer and convert its output for electric readout. In recent years, laser optomechanical transduction and readout have enabled highly sensitive detection of motional displacement. Here we further examine the theoretical framework for the novel mechanical frequency readout technique of optomechanical transduction when the sensor is driven into oscillation mode [8]. We demonstrate theoretical and physical agreement and characterize the most relevant performance parameters with a device with 1.5mg/Hz acceleration sensitivity, a 2.5 fm/Hz1/2 displacement resolution corresponding to a 17.02 ug/Hz1/2 force-equivalent acceleration, and a 5.91 Hz/nW power sensitivity, at the thermodynamical limits. In addition, we present a novel technique for dynamic range extension while maintaining the precision sensing sensitivity. Our inertial accelerometer is integrated on-chip, and enabled for packaging, with a laser-detuning-enabled approach.
△ Less
Submitted 30 October, 2022;
originally announced October 2022.
-
MVNet: Memory Assistance and Vocal Reinforcement Network for Speech Enhancement
Authors:
Jianrong Wang,
Xiaomin Li,
Xuewei Li,
Mei Yu,
Qiang Fang,
Li Liu
Abstract:
Speech enhancement improves speech quality and promotes the performance of various downstream tasks. However, most current speech enhancement work was mainly devoted to improving the performance of downstream automatic speech recognition (ASR), only a relatively small amount of work focused on the automatic speaker verification (ASV) task. In this work, we propose a MVNet consisted of a memory ass…
▽ More
Speech enhancement improves speech quality and promotes the performance of various downstream tasks. However, most current speech enhancement work was mainly devoted to improving the performance of downstream automatic speech recognition (ASR), only a relatively small amount of work focused on the automatic speaker verification (ASV) task. In this work, we propose a MVNet consisted of a memory assistance module which improves the performance of downstream ASR and a vocal reinforcement module which boosts the performance of ASV. In addition, we design a new loss function to improve speaker vocal similarity. Experimental results on the Libri2mix dataset show that our method outperforms baseline methods in several metrics, including speech quality, intelligibility, and speaker vocal similarity et al.
△ Less
Submitted 15 September, 2022;
originally announced September 2022.
-
Identification of cancer-kee** genes as therapeutic targets by finding network control hubs
Authors:
Xizhe Zhang,
Chunyu Pan,
Xinru Wei,
Meng Yu,
Shuangjie Liu,
Jun An,
Jie** Yang,
Baojun Wei,
Wenjun Hao,
Yang Yao,
Yuyan Zhu,
Weixiong Zhang
Abstract:
Finding cancer driver genes has been a focal theme of cancer research and clinical studies. One of the recent approaches is based on network structural controllability that focuses on finding a control scheme and driver genes that can steer the cell from an arbitrary state to a designated state. While theoretically sound, this approach is impractical for many reasons, e.g., the control scheme is o…
▽ More
Finding cancer driver genes has been a focal theme of cancer research and clinical studies. One of the recent approaches is based on network structural controllability that focuses on finding a control scheme and driver genes that can steer the cell from an arbitrary state to a designated state. While theoretically sound, this approach is impractical for many reasons, e.g., the control scheme is often not unique and half of the nodes may be driver genes for the cell. We developed a novel approach that transcends structural controllability. Instead of considering driver genes for one control scheme, we considered control hub genes that reside in the middle of a control path of every control scheme. Control hubs are the most vulnerable spots for controlling the cell and exogenous stimuli on them may render the cell uncontrollable. We adopted control hubs as cancer-keep genes (CKGs) and applied them to a gene regulatory network of bladder cancer (BLCA). All the genes on the cell cycle and p53 singling pathways in BLCA are CKGs, confirming the importance of these genes and the two pathways in cancer. A smaller set of 35 sensitive CKGs (sCKGs) for BLCA was identified by removing network links. Six sCKGs (RPS6KA3, FGFR3, N-cadherin (CDH2), EP300, caspase-1, and FN1) were subjected to small-interferencing-RNA knockdown in four cell lines to validate their effects on the proliferation or migration of cancer cells. Knocking down RPS6KA3 in a mouse model of BLCA significantly inhibited the growth of tumor xenografts in the mouse model. Combined, our results demonstrated the value of CKGs as therapeutic targets for cancer therapy and the potential of CKGs as an effective means for studying and characterizing cancer etiology.
△ Less
Submitted 13 June, 2022;
originally announced June 2022.
-
NeuralEcho: A Self-Attentive Recurrent Neural Network For Unified Acoustic Echo Suppression And Speech Enhancement
Authors:
Meng Yu,
Yong Xu,
Chunlei Zhang,
Shi-Xiong Zhang,
Dong Yu
Abstract:
Acoustic echo cancellation (AEC) plays an important role in the full-duplex speech communication as well as the front-end speech enhancement for recognition in the conditions when the loudspeaker plays back. In this paper, we present an all-deep-learning framework that implicitly estimates the second order statistics of echo/noise and target speech, and jointly solves echo and noise suppression th…
▽ More
Acoustic echo cancellation (AEC) plays an important role in the full-duplex speech communication as well as the front-end speech enhancement for recognition in the conditions when the loudspeaker plays back. In this paper, we present an all-deep-learning framework that implicitly estimates the second order statistics of echo/noise and target speech, and jointly solves echo and noise suppression through an attention based recurrent neural network. The proposed model outperforms the state-of-the-art joint echo cancellation and speech enhancement method F-T-LSTM in terms of objective speech quality metrics, speech recognition accuracy and model complexity. We show that this model can work with speaker embedding for better target speech enhancement and furthermore develop a branch for automatic gain control (AGC) task to form an all-in-one front-end speech enhancement system.
△ Less
Submitted 20 May, 2022;
originally announced May 2022.
-
Reinforced Swin-Convs Transformer for Underwater Image Enhancement
Authors:
Tingdi Ren,
Haiyong Xu,
Gangyi Jiang,
Mei Yu,
Ting Luo
Abstract:
Underwater Image Enhancement (UIE) technology aims to tackle the challenge of restoring the degraded underwater images due to light absorption and scattering. To address problems, a novel U-Net based Reinforced Swin-Convs Transformer for the Underwater Image Enhancement method (URSCT-UIE) is proposed. Specifically, with the deficiency of U-Net based on pure convolutions, we embedded the Swin Trans…
▽ More
Underwater Image Enhancement (UIE) technology aims to tackle the challenge of restoring the degraded underwater images due to light absorption and scattering. To address problems, a novel U-Net based Reinforced Swin-Convs Transformer for the Underwater Image Enhancement method (URSCT-UIE) is proposed. Specifically, with the deficiency of U-Net based on pure convolutions, we embedded the Swin Transformer into U-Net for improving the ability to capture the global dependency. Then, given the inadequacy of the Swin Transformer capturing the local attention, the reintroduction of convolutions may capture more local attention. Thus, we provide an ingenious manner for the fusion of convolutions and the core attention mechanism to build a Reinforced Swin-Convs Transformer Block (RSCTB) for capturing more local attention, which is reinforced in the channel and the spatial attention of the Swin Transformer. Finally, the experimental results on available datasets demonstrate that the proposed URSCT-UIE achieves state-of-the-art performance compared with other methods in terms of both subjective and objective evaluations. The code will be released on GitHub after acceptance.
△ Less
Submitted 1 May, 2022;
originally announced May 2022.
-
EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers
Authors:
Soumi Maiti,
Yushi Ueda,
Shinji Watanabe,
Chunlei Zhang,
Meng Yu,
Shi-Xiong Zhang,
Yong Xu
Abstract:
In this paper, we present a novel framework that jointly performs three tasks: speaker diarization, speech separation, and speaker counting. Our proposed framework integrates speaker diarization based on end-to-end neural diarization (EEND) models, speaker counting with encoder-decoder based attractors (EDA), and speech separation using Conv-TasNet. In addition, we propose a multiple 1x1 convoluti…
▽ More
In this paper, we present a novel framework that jointly performs three tasks: speaker diarization, speech separation, and speaker counting. Our proposed framework integrates speaker diarization based on end-to-end neural diarization (EEND) models, speaker counting with encoder-decoder based attractors (EDA), and speech separation using Conv-TasNet. In addition, we propose a multiple 1x1 convolutional layer architecture for estimating the separation masks corresponding to a flexible number of speakers and a fusion technique for refining the separated speech signal with obtained speaker diarization information to improve the joint framework. Experiments using the LibriMix dataset show that our proposed method outperforms the single-task baselines in both diarization and separation metrics for fixed and flexible numbers of speakers and improves speaker counting performance for flexible numbers of speakers. All materials will be open-sourced and reproducible in ESPnet toolkit.
△ Less
Submitted 15 December, 2022; v1 submitted 31 March, 2022;
originally announced March 2022.
-
Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention VAE
Authors:
Ziang Long,
Yunling Zheng,
Meng Yu,
Jack Xin
Abstract:
Variational auto-encoder (VAE) is an effective neural network architecture to disentangle a speech utterance into speaker identity and linguistic content latent embeddings, then generate an utterance for a target speaker from that of a source speaker. This is possible by concatenating the identity embedding of the target speaker and the content embedding of the source speaker uttering a desired se…
▽ More
Variational auto-encoder (VAE) is an effective neural network architecture to disentangle a speech utterance into speaker identity and linguistic content latent embeddings, then generate an utterance for a target speaker from that of a source speaker. This is possible by concatenating the identity embedding of the target speaker and the content embedding of the source speaker uttering a desired sentence. In this work, we propose to improve VAE models with self-attention and structural regularization (RGSM). Specifically, we found a suitable location of VAE's decoder to add a self-attention layer for incorporating non-local information in generating a converted utterance and hiding the source speaker's identity. We applied relaxed group-wise splitting method (RGSM) to regularize network weights and remarkably enhance generalization performance.
In experiments of zero-shot many-to-many voice conversion task on VCTK data set, with the self-attention layer and relaxed group-wise splitting method, our model achieves a gain of speaker classification accuracy on unseen speakers by 28.3\% while slightly improved conversion voice quality in terms of MOSNet scores. Our encouraging findings point to future research on integrating more variety of attention structures in VAE framework while controlling model size and overfitting for advancing zero-shot many-to-many voice conversions.
△ Less
Submitted 22 August, 2022; v1 submitted 29 March, 2022;
originally announced March 2022.
-
Feedforward PID Control of Full-Car with Parallel Active Link Suspension for Improved Chassis Attitude Stabilization
Authors:
Zilin Feng,
Min Yu,
Simos A. Evangelou,
Imad M Jaimoukha,
Daniele Dini
Abstract:
PID control is commonly utilized in an active suspension system to achieve desirable chassis attitude, where, due to delays, feedback information has much difficulty regulating the roll and pitch behavior, and stabilizing the chassis attitude, which may result in roll over when the vehicle steers at a large longitudinal velocity. To address the problem of the feedback delays in chassis attitude st…
▽ More
PID control is commonly utilized in an active suspension system to achieve desirable chassis attitude, where, due to delays, feedback information has much difficulty regulating the roll and pitch behavior, and stabilizing the chassis attitude, which may result in roll over when the vehicle steers at a large longitudinal velocity. To address the problem of the feedback delays in chassis attitude stabilization, in this paper, a feedforward control strategy is proposed to combine with a previously developed PID control scheme in the recently introduced Parallel Active Link Suspension (PALS). Numerical simulations with a nonlinear multi-body vehicle model are performed, where a set of ISO driving maneuvers are tested. Results demonstrate the feedforward-based control scheme has improved suspension performance as compared to the conventional PID control, with faster speed of convergence in brake in a turn and step steer maneuvers, and surviving the fishhook maneuver (although displaying two-wheel lift-off) with 50 mph maneuver entrance speed at which conventional PID control rolls over.
△ Less
Submitted 8 March, 2022;
originally announced March 2022.
-
Mu-synthesis PID Control of Full-Car with Parallel Active Link Suspension Under Variable Payload
Authors:
Zilin Feng,
Min Yu,
Simos A. Evangelou,
Imad M Jaimoukha,
Daniele Dini
Abstract:
This paper presents a combined mu-synthesis PID control scheme, employing a frequency separation paradigm, for a recently proposed novel active suspension, the Parallel Active Link Suspension (PALS). The developed mu-synthesis control scheme is superior to the conventional H-infinity control, previously designed for the PALS, in terms of ride comfort and road holding (higher frequency dynamics), w…
▽ More
This paper presents a combined mu-synthesis PID control scheme, employing a frequency separation paradigm, for a recently proposed novel active suspension, the Parallel Active Link Suspension (PALS). The developed mu-synthesis control scheme is superior to the conventional H-infinity control, previously designed for the PALS, in terms of ride comfort and road holding (higher frequency dynamics), with important realistic uncertainties, such as in vehicle payload, taken into account. The developed PID control method is applied to guarantee good chassis attitude control capabilities and minimization of pitch and roll motions (low frequency dynamics). A multi-objective control method, which merges the aforementioned PID and mu-synthesis-based controls is further introduced to achieve simultaneously the low frequency mitigation of attitude motions and the high frequency vibration suppression of the vehicle. A seven-degree-of-freedom Sport Utility Vehicle (SUV) full car model with PALS, is employed in this work to test the synthesized controller by nonlinear simulations with different ISO-defined road events and variable vehicle payload. The results demonstrate the control scheme's significant robustness and performance, as compared to the conventional passive suspension as well as the actively controlled PALS by conventional H-infinity control, achieved for a wide range of vehicle payload considered in the investigation.
△ Less
Submitted 8 March, 2022;
originally announced March 2022.
-
Information Prebuilt Recurrent Reconstruction Network for Video Super-Resolution
Authors:
Shuyun Wang,
Ming Yu,
Cuihong Xue,
Yingchun Guo,
Gang Yan
Abstract:
The video super-resolution (VSR) method based on the recurrent convolutional network has strong temporal modeling capability for video sequences. However, the temporal receptive field of different recurrent units in the unidirectional recurrent network is unbalanced. Earlier reconstruction frames receive less spatio-temporal information, resulting in fuzziness or artifacts. Although the bidirectio…
▽ More
The video super-resolution (VSR) method based on the recurrent convolutional network has strong temporal modeling capability for video sequences. However, the temporal receptive field of different recurrent units in the unidirectional recurrent network is unbalanced. Earlier reconstruction frames receive less spatio-temporal information, resulting in fuzziness or artifacts. Although the bidirectional recurrent network can alleviate this problem, it requires more memory space and fails to perform many tasks with low latency requirements. To solve the above problems, we propose an end-to-end information prebuilt recurrent reconstruction network (IPRRN), consisting of an information prebuilt network (IPNet) and a recurrent reconstruction network (RRNet). By integrating sufficient information from the front of the video to build the hidden state needed for the initially recurrent unit to help restore the earlier frames, the information prebuilt network balances the input information difference at different time steps. In addition, we demonstrate an efficient recurrent reconstruction network, which outperforms the existing unidirectional recurrent schemes in all aspects. Many experiments have verified the effectiveness of the network we propose, which can effectively achieve better quantitative and qualitative evaluation performance compared to the existing state-of-the-art methods.
△ Less
Submitted 2 February, 2023; v1 submitted 10 December, 2021;
originally announced December 2021.
-
Joint Modeling of Code-Switched and Monolingual ASR via Conditional Factorization
Authors:
Brian Yan,
Chunlei Zhang,
Meng Yu,
Shi-Xiong Zhang,
Siddharth Dalmia,
Dan Berrebbi,
Chao Weng,
Shinji Watanabe,
Dong Yu
Abstract:
Conversational bilingual speech encompasses three types of utterances: two purely monolingual types and one intra-sententially code-switched type. In this work, we propose a general framework to jointly model the likelihoods of the monolingual and code-switch sub-tasks that comprise bilingual speech recognition. By defining the monolingual sub-tasks with label-to-frame synchronization, our joint m…
▽ More
Conversational bilingual speech encompasses three types of utterances: two purely monolingual types and one intra-sententially code-switched type. In this work, we propose a general framework to jointly model the likelihoods of the monolingual and code-switch sub-tasks that comprise bilingual speech recognition. By defining the monolingual sub-tasks with label-to-frame synchronization, our joint modeling framework can be conditionally factorized such that the final bilingual output, which may or may not be code-switched, is obtained given only monolingual information. We show that this conditionally factorized joint framework can be modeled by an end-to-end differentiable neural network. We demonstrate the efficacy of our proposed model on bilingual Mandarin-English speech recognition across both monolingual and code-switched corpora.
△ Less
Submitted 29 November, 2021;
originally announced November 2021.
-
Joint Neural AEC and Beamforming with Double-Talk Detection
Authors:
Vinay Kothapally,
Yong Xu,
Meng Yu,
Shi-Xiong Zhang,
Dong Yu
Abstract:
Acoustic echo cancellation (AEC) in full-duplex communication systems eliminates acoustic feedback. However, nonlinear distortions induced by audio devices, background noise, reverberation, and double-talk reduce the efficiency of conventional AEC systems. Several hybrid AEC models were proposed to address this, which use deep learning models to suppress residual echo from standard adaptive filter…
▽ More
Acoustic echo cancellation (AEC) in full-duplex communication systems eliminates acoustic feedback. However, nonlinear distortions induced by audio devices, background noise, reverberation, and double-talk reduce the efficiency of conventional AEC systems. Several hybrid AEC models were proposed to address this, which use deep learning models to suppress residual echo from standard adaptive filtering. This paper proposes deep learning-based joint AEC and beamforming model (JAECBF) building on our previous self-attentive recurrent neural network (RNN) beamformer. The proposed network consists of two modules: (i) multi-channel neural-AEC, and (ii) joint AEC-RNN beamformer with a double-talk detection (DTD) that computes time-frequency (T-F) beamforming weights. We train the proposed model in an end-to-end approach to eliminate background noise and echoes from far-end audio devices, which include nonlinear distortions. From experimental evaluations, we find the proposed network outperforms other multi-channel AEC and denoising systems in terms of speech recognition rate and overall speech quality.
△ Less
Submitted 27 June, 2022; v1 submitted 8 November, 2021;
originally announced November 2021.
-
Zero-CPU Collection with Direct Telemetry Access
Authors:
Jonatan Langlet,
Ran Ben Basat,
Sivaramakrishnan Ramanathan,
Gabriele Oliaro,
Michael Mitzenmacher,
Minlan Yu,
Gianni Antichi
Abstract:
Programmable switches are driving a massive increase in fine-grained measurements. This puts significant pressure on telemetry collectors that have to process reports from many switches. Past research acknowledged this problem by either improving collectors' stack performance or by limiting the amount of data sent from switches. In this paper, we take a different and radical approach: switches are…
▽ More
Programmable switches are driving a massive increase in fine-grained measurements. This puts significant pressure on telemetry collectors that have to process reports from many switches. Past research acknowledged this problem by either improving collectors' stack performance or by limiting the amount of data sent from switches. In this paper, we take a different and radical approach: switches are responsible for directly inserting queryable telemetry data into the collectors' memory, bypassing their CPU, and thereby improving their collection scalability. We propose to use a method we call \emph{direct telemetry access}, where switches jointly write telemetry reports directly into the same collector's memory region, without coordination. Our solution, DART, is probabilistic, trading memory redundancy and query success probability for CPU resources at collectors. We prototype DART using commodity hardware such as P4 switches and RDMA NICs and show that we get high query success rates with a reasonable memory overhead. For example, we can collect INT path tracing information on a fat tree topology without a collector's CPU involvement while achieving 99.9\% query success probability and using just 300 bytes per flow.
△ Less
Submitted 11 October, 2021;
originally announced October 2021.
-
FAST-RIR: Fast neural diffuse room impulse response generator
Authors:
Anton Ratnarajah,
Shi-Xiong Zhang,
Meng Yu,
Zhenyu Tang,
Dinesh Manocha,
Dong Yu
Abstract:
We present a neural-network-based fast diffuse room impulse response generator (FAST-RIR) for generating room impulse responses (RIRs) for a given acoustic environment. Our FAST-RIR takes rectangular room dimensions, listener and speaker positions, and reverberation time as inputs and generates specular and diffuse reflections for a given acoustic environment. Our FAST-RIR is capable of generating…
▽ More
We present a neural-network-based fast diffuse room impulse response generator (FAST-RIR) for generating room impulse responses (RIRs) for a given acoustic environment. Our FAST-RIR takes rectangular room dimensions, listener and speaker positions, and reverberation time as inputs and generates specular and diffuse reflections for a given acoustic environment. Our FAST-RIR is capable of generating RIRs for a given input reverberation time with an average error of 0.02s. We evaluate our generated RIRs in automatic speech recognition (ASR) applications using Google Speech API, Microsoft Speech API, and Kaldi tools. We show that our proposed FAST-RIR with batch size 1 is 400 times faster than a state-of-the-art diffuse acoustic simulator (DAS) on a CPU and gives similar performance to DAS in ASR experiments. Our FAST-RIR is 12 times faster than an existing GPU-based RIR generator (gpuRIR). We show that our FAST-RIR outperforms gpuRIR by 2.5% in an AMI far-field ASR benchmark.
△ Less
Submitted 5 February, 2022; v1 submitted 7 October, 2021;
originally announced October 2021.
-
Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition
Authors:
Jianrong Wang,
Ziyue Tang,
Xuewei Li,
Mei Yu,
Qiang Fang,
Li Liu
Abstract:
Cued Speech (CS) is a visual communication system for the deaf or hearing impaired people. It combines lip movements with hand cues to obtain a complete phonetic repertoire. Current deep learning based methods on automatic CS recognition suffer from a common problem, which is the data scarcity. Until now, there are only two public single speaker datasets for French (238 sentences) and British Engl…
▽ More
Cued Speech (CS) is a visual communication system for the deaf or hearing impaired people. It combines lip movements with hand cues to obtain a complete phonetic repertoire. Current deep learning based methods on automatic CS recognition suffer from a common problem, which is the data scarcity. Until now, there are only two public single speaker datasets for French (238 sentences) and British English (97 sentences). In this work, we propose a cross-modal knowledge distillation method with teacher-student structure, which transfers audio speech information to CS to overcome the limited data problem. Firstly, we pretrain a teacher model for CS recognition with a large amount of open source audio speech data, and simultaneously pretrain the feature extractors for lips and hands using CS data. Then, we distill the knowledge from teacher model to the student model with frame-level and sequence-level distillation strategies. Importantly, for frame-level, we exploit multi-task learning to weigh losses automatically, to obtain the balance coefficient. Besides, we establish a five-speaker British English CS dataset for the first time. The proposed method is evaluated on French and British English CS datasets, showing superior CS recognition performance to the state-of-the-art (SOTA) by a large margin.
△ Less
Submitted 25 June, 2021;
originally announced June 2021.
-
MIMO Self-attentive RNN Beamformer for Multi-speaker Speech Separation
Authors:
Xiyun Li,
Yong Xu,
Meng Yu,
Shi-Xiong Zhang,
Jiaming Xu,
Bo Xu,
Dong Yu
Abstract:
Recently, our proposed recurrent neural network (RNN) based all deep learning minimum variance distortionless response (ADL-MVDR) beamformer method yielded superior performance over the conventional MVDR by replacing the matrix inversion and eigenvalue decomposition with two recurrent neural networks. In this work, we present a self-attentive RNN beamformer to further improve our previous RNN-base…
▽ More
Recently, our proposed recurrent neural network (RNN) based all deep learning minimum variance distortionless response (ADL-MVDR) beamformer method yielded superior performance over the conventional MVDR by replacing the matrix inversion and eigenvalue decomposition with two recurrent neural networks. In this work, we present a self-attentive RNN beamformer to further improve our previous RNN-based beamformer by leveraging on the powerful modeling capability of self-attention. Temporal-spatial self-attention module is proposed to better learn the beamforming weights from the speech and noise spatial covariance matrices. The temporal self-attention module could help RNN to learn global statistics of covariance matrices. The spatial self-attention module is designed to attend on the cross-channel correlation in the covariance matrices. Furthermore, a multi-channel input with multi-speaker directional features and multi-speaker speech separation outputs (MIMO) model is developed to improve the inference efficiency. The evaluations demonstrate that our proposed MIMO self-attentive RNN beamformer improves both the automatic speech recognition (ASR) accuracy and the perceptual estimation of speech quality (PESQ) against prior arts.
△ Less
Submitted 26 April, 2021; v1 submitted 17 April, 2021;
originally announced April 2021.
-
MetricNet: Towards Improved Modeling For Non-Intrusive Speech Quality Assessment
Authors:
Meng Yu,
Chunlei Zhang,
Yong Xu,
Shixiong Zhang,
Dong Yu
Abstract:
The objective speech quality assessment is usually conducted by comparing received speech signal with its clean reference, while human beings are capable of evaluating the speech quality without any reference, such as in the mean opinion score (MOS) tests. Non-intrusive speech quality assessment has attracted much attention recently due to the lack of access to clean reference signals for objectiv…
▽ More
The objective speech quality assessment is usually conducted by comparing received speech signal with its clean reference, while human beings are capable of evaluating the speech quality without any reference, such as in the mean opinion score (MOS) tests. Non-intrusive speech quality assessment has attracted much attention recently due to the lack of access to clean reference signals for objective evaluations in real scenarios. In this paper, we propose a novel non-intrusive speech quality measurement model, MetricNet, which leverages label distribution learning and joint speech reconstruction learning to achieve significantly improved performance compared to the existing non-intrusive speech quality measurement models. We demonstrate that the proposed approach yields promisingly high correlation to the intrusive objective evaluation of speech quality on clean, noisy and processed speech data.
△ Less
Submitted 2 April, 2021;
originally announced April 2021.
-
TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation
Authors:
Helin Wang,
Bo Wu,
Lianwu Chen,
Meng Yu,
Jianwei Yu,
Yong Xu,
Shi-Xiong Zhang,
Chao Weng,
Dan Su,
Dong Yu
Abstract:
In this paper, we exploit the effective way to leverage contextual information to improve the speech dereverberation performance in real-world reverberant environments. We propose a temporal-contextual attention approach on the deep neural network (DNN) for environment-aware speech dereverberation, which can adaptively attend to the contextual information. More specifically, a FullBand based Tempo…
▽ More
In this paper, we exploit the effective way to leverage contextual information to improve the speech dereverberation performance in real-world reverberant environments. We propose a temporal-contextual attention approach on the deep neural network (DNN) for environment-aware speech dereverberation, which can adaptively attend to the contextual information. More specifically, a FullBand based Temporal Attention approach (FTA) is proposed, which models the correlations between the fullband information of the context frames. In addition, considering the difference between the attenuation of high frequency bands and low frequency bands (high frequency bands attenuate faster than low frequency bands) in the room impulse response (RIR), we also propose a SubBand based Temporal Attention approach (STA). In order to guide the network to be more aware of the reverberant environments, we jointly optimize the dereverberation network and the reverberation time (RT60) estimator in a multi-task manner. Our experimental results indicate that the proposed method outperforms our previously proposed reverberation-time-aware DNN and the learned attention weights are fully physical consistent. We also report a preliminary yet promising dereverberation and recognition experiment on real test data.
△ Less
Submitted 26 August, 2021; v1 submitted 31 March, 2021;
originally announced March 2021.
-
Towards Robust Speaker Verification with Target Speaker Enhancement
Authors:
Chunlei Zhang,
Meng Yu,
Chao Weng,
Dong Yu
Abstract:
This paper proposes the target speaker enhancement based speaker verification network (TASE-SVNet), an all neural model that couples target speaker enhancement and speaker embedding extraction for robust speaker verification (SV). Specifically, an enrollment speaker conditioned speech enhancement module is employed as the front-end for extracting target speaker from its mixture with interfering sp…
▽ More
This paper proposes the target speaker enhancement based speaker verification network (TASE-SVNet), an all neural model that couples target speaker enhancement and speaker embedding extraction for robust speaker verification (SV). Specifically, an enrollment speaker conditioned speech enhancement module is employed as the front-end for extracting target speaker from its mixture with interfering speakers and environmental noises. Compared with the conventional target speaker enhancement models, nontarget speaker/interference suppression should draw additional attention for SV. Therefore, an effective nontarget speaker sampling strategy is explored. To improve speaker embedding extraction with a light-weighted model, a teacher-student (T/S) training is proposed to distill speaker discriminative information from large models to small models. Iterative inference is investigated to address the noisy speaker enrollment problem. We evaluate the proposed method on two SV tasks, i.e., one heavily overlapped speech and the other one with comprehensive noise types in vehicle environments. Experiments show significant and consistent improvements in Equal Error Rate (EER) over the state-of-the-art baselines.
△ Less
Submitted 15 March, 2021;
originally announced March 2021.
-
Deep Learning based Multi-Source Localization with Source Splitting and its Effectiveness in Multi-Talker Speech Recognition
Authors:
Aswin Shanmugam Subramanian,
Chao Weng,
Shinji Watanabe,
Meng Yu,
Dong Yu
Abstract:
Multi-source localization is an important and challenging technique for multi-talker conversation analysis. This paper proposes a novel supervised learning method using deep neural networks to estimate the direction of arrival (DOA) of all the speakers simultaneously from the audio mixture. At the heart of the proposal is a source splitting mechanism that creates source-specific intermediate repre…
▽ More
Multi-source localization is an important and challenging technique for multi-talker conversation analysis. This paper proposes a novel supervised learning method using deep neural networks to estimate the direction of arrival (DOA) of all the speakers simultaneously from the audio mixture. At the heart of the proposal is a source splitting mechanism that creates source-specific intermediate representations inside the network. This allows our model to give source-specific posteriors as the output unlike the traditional multi-label classification approach. Existing deep learning methods perform a frame level prediction, whereas our approach performs an utterance level prediction by incorporating temporal selection and averaging inside the network to avoid post-processing. We also experiment with various loss functions and show that a variant of earth mover distance (EMD) is very effective in classifying DOA at a very high resolution by modeling inter-class relationships. In addition to using the prediction error as a metric for evaluating our localization model, we also establish its potency as a frontend with automatic speech recognition (ASR) as the downstream task. We convert the estimated DOAs into a feature suitable for ASR and pass it as an additional input feature to a strong multi-channel and multi-talker speech recognition baseline. This added input feature drastically improves the ASR performance and gives a word error rate (WER) of 6.3% on the evaluation data of our simulated noisy two speaker mixtures, while the baseline which doesn't use explicit localization input has a WER of 11.5%. We also perform ASR evaluation on real recordings with the overlapped set of the MC-WSJ-AV corpus in addition to simulated mixtures.
△ Less
Submitted 28 November, 2021; v1 submitted 15 February, 2021;
originally announced February 2021.
-
Generalized Spatio-Temporal RNN Beamformer for Target Speech Separation
Authors:
Yong Xu,
Zhuohuang Zhang,
Meng Yu,
Shi-Xiong Zhang,
Dong Yu
Abstract:
Although the conventional mask-based minimum variance distortionless response (MVDR) could reduce the non-linear distortion, the residual noise level of the MVDR separated speech is still high. In this paper, we propose a spatio-temporal recurrent neural network based beamformer (RNN-BF) for target speech separation. This new beamforming framework directly learns the beamforming weights from the e…
▽ More
Although the conventional mask-based minimum variance distortionless response (MVDR) could reduce the non-linear distortion, the residual noise level of the MVDR separated speech is still high. In this paper, we propose a spatio-temporal recurrent neural network based beamformer (RNN-BF) for target speech separation. This new beamforming framework directly learns the beamforming weights from the estimated speech and noise spatial covariance matrices. Leveraging on the temporal modeling capability of RNNs, the RNN-BF could automatically accumulate the statistics of the speech and noise covariance matrices to learn the frame-level beamforming weights in a recursive way. An RNN-based generalized eigenvalue (RNN-GEV) beamformer and a more generalized RNN beamformer (GRNN-BF) are proposed. We further improve the RNN-GEV and the GRNN-BF by using layer normalization to replace the commonly used mask normalization on the covariance matrices. The proposed GRNN-BF obtains better performance against prior arts in terms of speech quality (PESQ), speech-to-noise ratio (SNR) and word error rate (WER).
△ Less
Submitted 3 April, 2021; v1 submitted 4 January, 2021;
originally announced January 2021.
-
Multi-channel Multi-frame ADL-MVDR for Target Speech Separation
Authors:
Zhuohuang Zhang,
Yong Xu,
Meng Yu,
Shi-Xiong Zhang,
Lianwu Chen,
Donald S. Williamson,
Dong Yu
Abstract:
Many purely neural network based speech separation approaches have been proposed to improve objective assessment scores, but they often introduce nonlinear distortions that are harmful to modern automatic speech recognition (ASR) systems. Minimum variance distortionless response (MVDR) filters are often adopted to remove nonlinear distortions, however, conventional neural mask-based MVDR systems s…
▽ More
Many purely neural network based speech separation approaches have been proposed to improve objective assessment scores, but they often introduce nonlinear distortions that are harmful to modern automatic speech recognition (ASR) systems. Minimum variance distortionless response (MVDR) filters are often adopted to remove nonlinear distortions, however, conventional neural mask-based MVDR systems still result in relatively high levels of residual noise. Moreover, the matrix inverse involved in the MVDR solution is sometimes numerically unstable during joint training with neural networks. In this study, we propose a multi-channel multi-frame (MCMF) all deep learning (ADL)-MVDR approach for target speech separation, which extends our preliminary multi-channel ADL-MVDR approach. The proposed MCMF ADL-MVDR system addresses linear and nonlinear distortions. Spatio-temporal cross correlations are also fully utilized in the proposed approach. The proposed systems are evaluated using a Mandarin audio-visual corpus and are compared with several state-of-the-art approaches. Experimental results demonstrate the superiority of our proposed systems under different scenarios and across several objective evaluation metrics, including ASR performance.
△ Less
Submitted 15 November, 2021; v1 submitted 24 December, 2020;
originally announced December 2020.
-
Self-supervised Text-independent Speaker Verification using Prototypical Momentum Contrastive Learning
Authors:
Wei Xia,
Chunlei Zhang,
Chao Weng,
Meng Yu,
Dong Yu
Abstract:
In this study, we investigate self-supervised representation learning for speaker verification (SV). First, we examine a simple contrastive learning approach (SimCLR) with a momentum contrastive (MoCo) learning framework, where the MoCo speaker embedding system utilizes a queue to maintain a large set of negative examples. We show that better speaker embeddings can be learned by momentum contrasti…
▽ More
In this study, we investigate self-supervised representation learning for speaker verification (SV). First, we examine a simple contrastive learning approach (SimCLR) with a momentum contrastive (MoCo) learning framework, where the MoCo speaker embedding system utilizes a queue to maintain a large set of negative examples. We show that better speaker embeddings can be learned by momentum contrastive learning. Next, alternative augmentation strategies are explored to normalize extrinsic speaker variabilities of two random segments from the same speech utterance. Specifically, augmentation in the waveform largely improves the speaker representations for SV tasks. The proposed MoCo speaker embedding is further improved when a prototypical memory bank is introduced, which encourages the speaker embeddings to be closer to their assigned prototypes with an intermediate clustering step. In addition, we generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled. Comprehensive experiments on the Voxceleb dataset demonstrate that our proposed self-supervised approach achieves competitive performance compared with existing techniques, and can approach fully supervised results with partially labeled data.
△ Less
Submitted 14 February, 2021; v1 submitted 13 December, 2020;
originally announced December 2020.