-
Evaluation of Connected Vehicle Identification-Aware Mixed Traffic Freeway Cooperative Merging
Authors:
Haoji Liu,
Fatemeh Jahedinia,
Zeyu Mu,
B. Brian Park
Abstract:
Cooperative on-ramp merging control for connected automated vehicles (CAVs) has been extensively investigated. However, they did neglect the connected vehicle identification process, which is a must for CAV cooperations. In this paper, we introduced a connected vehicle identification system (VIS) into the on-ramp merging control process for the first time and proposed an evaluation framework to as…
▽ More
Cooperative on-ramp merging control for connected automated vehicles (CAVs) has been extensively investigated. However, they did neglect the connected vehicle identification process, which is a must for CAV cooperations. In this paper, we introduced a connected vehicle identification system (VIS) into the on-ramp merging control process for the first time and proposed an evaluation framework to assess the impacts of VIS on on-ramp merging performance. First, the mixed-traffic cooperative merging problem was formulated. Then, a real-world merging trajectory dataset was processed to generate dangerous merging scenarios. Aiming at resolving the potential collision risks in mixed traffic where CAVs and traditional human-driven vehicles (THVs) coexist, we proposed on-ramp merging strategies for CAVs in different mixed traffic situations considering the connected vehicle identification process. The performances were evaluated via simulations. Results indicated that while safety was assured for all cases with CAVs, the cases with VIS had delayed initiation of cooperation, limiting the range of cooperative merging and leading to increased fuel consumption and acceleration variations.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction
Authors:
Zhaoxi Mu,
Xinyu Yang
Abstract:
The integration of visual cues has revitalized the performance of the target speech extraction task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm often encounters the challenge of modality imbalance. In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance. To ta…
▽ More
The integration of visual cues has revitalized the performance of the target speech extraction task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm often encounters the challenge of modality imbalance. In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance. To tackle this issue, we propose AVSepChain, drawing inspiration from the speech chain concept. Our approach partitions the audio-visual target speech extraction task into two stages: speech perception and speech production. In the speech perception stage, audio serves as the dominant modality, while visual information acts as the conditional modality. Conversely, in the speech production stage, the roles are reversed. This transformation of modality status aims to alleviate the problem of modality imbalance. Additionally, we introduce a contrastive semantic matching loss to ensure that the semantic information conveyed by the generated speech aligns with the semantic information conveyed by lip movements during the speech production stage. Through extensive experiments conducted on multiple benchmark datasets for audio-visual target speech extraction, we showcase the superior performance achieved by our proposed method.
△ Less
Submitted 5 May, 2024; v1 submitted 19 April, 2024;
originally announced April 2024.
-
Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction
Authors:
Zhaoxi Mu,
Xinyu Yang,
Sining Sun,
Qing Yang
Abstract:
Speech signals are inherently complex as they encompass both global acoustic characteristics and local semantic information. However, in the task of target speech extraction, certain elements of global and local semantic information in the reference speech, which are irrelevant to speaker identity, can lead to speaker confusion within the speech extraction network. To overcome this challenge, we p…
▽ More
Speech signals are inherently complex as they encompass both global acoustic characteristics and local semantic information. However, in the task of target speech extraction, certain elements of global and local semantic information in the reference speech, which are irrelevant to speaker identity, can lead to speaker confusion within the speech extraction network. To overcome this challenge, we propose a self-supervised disentangled representation learning method. Our approach tackles this issue through a two-phase process, utilizing a reference speech encoding network and a global information disentanglement network to gradually disentangle the speaker identity information from other irrelevant factors. We exclusively employ the disentangled speaker identity information to guide the speech extraction network. Moreover, we introduce the adaptive modulation Transformer to ensure that the acoustic representation of the mixed signal remains undisturbed by the speaker embeddings. This component incorporates speaker embeddings as conditional information, facilitating natural and efficient guidance for the speech extraction network. Experimental results substantiate the effectiveness of our meticulously crafted approach, showcasing a substantial reduction in the likelihood of speaker confusion.
△ Less
Submitted 19 January, 2024; v1 submitted 15 December, 2023;
originally announced December 2023.
-
Multi-Dimensional and Multi-Scale Modeling for Speech Separation Optimized by Discriminative Learning
Authors:
Zhaoxi Mu,
Xinyu Yang,
Wen**g Zhu
Abstract:
Transformer has shown advanced performance in speech separation, benefiting from its ability to capture global features. However, capturing local features and channel information of audio sequences in speech separation is equally important. In this paper, we present a novel approach named Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation. Specifically, we design a new network…
▽ More
Transformer has shown advanced performance in speech separation, benefiting from its ability to capture global features. However, capturing local features and channel information of audio sequences in speech separation is equally important. In this paper, we present a novel approach named Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation. Specifically, we design a new network SE-Conformer that can model audio sequences in multiple dimensions and scales, and apply it to the dual-path speech separation framework. Furthermore, we propose Multi-Block Feature Aggregation to improve the separation effect by selectively utilizing information from the intermediate blocks of the separation network. Meanwhile, we propose a speaker similarity discriminative loss to optimize the speech separation model to address the problem of poor performance when speakers have similar voices. Experimental results on the benchmark datasets WSJ0-2mix and WHAM! show that ISCIT can achieve state-of-the-art results.
△ Less
Submitted 7 March, 2023;
originally announced March 2023.
-
A Multi-Stage Triple-Path Method for Speech Separation in Noisy and Reverberant Environments
Authors:
Zhaoxi Mu,
Xinyu Yang,
Xiangyuan Yang,
Wen**g Zhu
Abstract:
In noisy and reverberant environments, the performance of deep learning-based speech separation methods drops dramatically because previous methods are not designed and optimized for such situations. To address this issue, we propose a multi-stage end-to-end learning method that decouples the difficult speech separation problem in noisy and reverberant environments into three sub-problems: speech…
▽ More
In noisy and reverberant environments, the performance of deep learning-based speech separation methods drops dramatically because previous methods are not designed and optimized for such situations. To address this issue, we propose a multi-stage end-to-end learning method that decouples the difficult speech separation problem in noisy and reverberant environments into three sub-problems: speech denoising, separation, and de-reverberation. The probability and speed of searching for the optimal solution of the speech separation model are improved by reducing the solution space. Moreover, since the channel information of the audio sequence in the time domain is crucial for speech separation, we propose a triple-path structure capable of modeling the channel dimension of audio sequences. Experimental results show that the proposed multi-stage triple-path method can improve the performance of speech separation models at the cost of little model parameter increment.
△ Less
Submitted 7 March, 2023;
originally announced March 2023.
-
Wind power ramp prediction algorithm based on wavelet deep belief network
Authors:
Zhenhao Tang,
Qingyu Meng,
Shengxian Cao,
Yang Li,
Zhongha Mu,
Xiaoya Pang
Abstract:
The wind power ramp events threaten the power grid safety significantly. To improve the ramp prediction accuracy, a hybrid wavelet deep belief network algorithm with adaptive feature selection (WDBNAFS) is proposed. First, the wind power characteristic is analyzed. Then, wavelet decomposition is addressed to the time series, and an adaptive feature selection algorithm is proposed to select the inp…
▽ More
The wind power ramp events threaten the power grid safety significantly. To improve the ramp prediction accuracy, a hybrid wavelet deep belief network algorithm with adaptive feature selection (WDBNAFS) is proposed. First, the wind power characteristic is analyzed. Then, wavelet decomposition is addressed to the time series, and an adaptive feature selection algorithm is proposed to select the inputs of the prediction model. Finally, a deep belief network is employed to predict the wind power ramp event, and the proposed WDBNAFS was testified with the experiments based on the practical data. The simulation results demonstrate that the prediction accuracy of the proposed algorithm is more than 90%.
△ Less
Submitted 10 February, 2022;
originally announced February 2022.
-
Review of end-to-end speech synthesis technology based on deep learning
Authors:
Zhaoxi Mu,
Xinyu Yang,
Yizhuo Dong
Abstract:
As an indispensable part of modern human-computer interaction system, speech synthesis technology helps users get the output of intelligent machine more easily and intuitively, thus has attracted more and more attention. Due to the limitations of high complexity and low efficiency of traditional speech synthesis technology, the current research focus is the deep learning-based end-to-end speech sy…
▽ More
As an indispensable part of modern human-computer interaction system, speech synthesis technology helps users get the output of intelligent machine more easily and intuitively, thus has attracted more and more attention. Due to the limitations of high complexity and low efficiency of traditional speech synthesis technology, the current research focus is the deep learning-based end-to-end speech synthesis technology, which has more powerful modeling ability and a simpler pipeline. It mainly consists of three modules: text front-end, acoustic model, and vocoder. This paper reviews the research status of these three parts, and classifies and compares various methods according to their emphasis. Moreover, this paper also summarizes the open-source speech corpus of English, Chinese and other languages that can be used for speech synthesis tasks, and introduces some commonly used subjective and objective speech quality evaluation method. Finally, some attractive future research directions are pointed out.
△ Less
Submitted 20 April, 2021;
originally announced April 2021.
-
Design and Control of a Highly Redundant Rigid-Flexible Coupling Robot to Assist the COVID-19 Oropharyngeal-Swab Sampling
Authors:
Yingbai Hu,
Jian Li,
Yongquan Chen,
Qiwen Wang,
Chuliang Chi,
Heng Zhang,
Qing Gao,
Yuanmin Lan,
Zheng Li,
Zonggao Mu,
Zhenglong Sun,
Alois Knoll
Abstract:
The outbreak of novel coronavirus pneumonia (COVID-19) has caused mortality and morbidity worldwide. Oropharyngeal-swab (OP-swab) sampling is widely used for the diagnosis of COVID-19 in the world. To avoid the clinical staff from being affected by the virus, we developed a 9-degree-of-freedom (DOF) rigid-flexible coupling (RFC) robot to assist the COVID-19 OP-swab sampling. This robot is composed…
▽ More
The outbreak of novel coronavirus pneumonia (COVID-19) has caused mortality and morbidity worldwide. Oropharyngeal-swab (OP-swab) sampling is widely used for the diagnosis of COVID-19 in the world. To avoid the clinical staff from being affected by the virus, we developed a 9-degree-of-freedom (DOF) rigid-flexible coupling (RFC) robot to assist the COVID-19 OP-swab sampling. This robot is composed of a visual system, UR5 robot arm, micro-pneumatic actuator and force-sensing system. The robot is expected to reduce risk and free up the clinical staff from the long-term repetitive sampling work. Compared with a rigid sampling robot, the developed force-sensing RFC robot can facilitate OP-swab sampling procedures in a safer and softer way. In addition, a varying-parameter zeroing neural network-based optimization method is also proposed for motion planning of the 9-DOF redundant manipulator. The developed robot system is validated by OP-swab sampling on both oral cavity phantoms and volunteers.
△ Less
Submitted 25 February, 2021;
originally announced February 2021.