Search | arXiv e-print repository

CTC-based Non-autoregressive Textless Speech-to-Speech Translation

Authors: Qingkai Fang, Zhengrui Ma, Yan Zhou, Min Zhang, Yang Feng

Abstract: Direct speech-to-speech translation (S2ST) has achieved impressive translation quality, but it often faces the challenge of slow decoding due to the considerable length of speech sequences. Recently, some research has turned to non-autoregressive (NAR) models to expedite decoding, yet the translation quality typically lags behind autoregressive (AR) models significantly. In this paper, we investig… ▽ More Direct speech-to-speech translation (S2ST) has achieved impressive translation quality, but it often faces the challenge of slow decoding due to the considerable length of speech sequences. Recently, some research has turned to non-autoregressive (NAR) models to expedite decoding, yet the translation quality typically lags behind autoregressive (AR) models significantly. In this paper, we investigate the performance of CTC-based NAR models in S2ST, as these models have shown impressive results in machine translation. Experimental results demonstrate that by combining pretraining, knowledge distillation, and advanced NAR training techniques such as glancing training and non-monotonic latent alignments, CTC-based NAR models achieve translation quality comparable to the AR model, while preserving up to 26.81$\times$ decoding speedup. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: ACL 2024 Findings

ACM Class: I.2.7

arXiv:2406.07289 [pdf, other]

Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?

Authors: Qingkai Fang, Shaolei Zhang, Zhengrui Ma, Min Zhang, Yang Feng

Abstract: Recently proposed two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS) within an end-to-end model, yielding promising results. However, the training of these models still relies on parallel speech data, which is extremely challenging to collect. In contrast, S2TT and TTS have accumulated a large amount of data… ▽ More Recently proposed two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS) within an end-to-end model, yielding promising results. However, the training of these models still relies on parallel speech data, which is extremely challenging to collect. In contrast, S2TT and TTS have accumulated a large amount of data and pretrained models, which have not been fully utilized in the development of S2ST models. Inspired by this, in this paper, we first introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model. Furthermore, to eliminate the reliance on parallel speech data, we propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data. It aligns representations in the latent space through contrastive learning, enabling the speech synthesis capability learned from the TTS data to generalize to S2ST in a zero-shot manner. Experimental results on the CVSS dataset show that when the parallel speech data is available, ComSpeech surpasses previous two-pass models like UnitY and Translatotron 2 in both translation quality and decoding speed. When there is no parallel speech data, ComSpeech-ZS lags behind \name by only 0.7 ASR-BLEU and outperforms the cascaded models. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: ACL 2024 main conference. Project Page: https://ictnlp.github.io/ComSpeech-Site/

ACM Class: I.2.7

arXiv:2406.06937 [pdf, other]

A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation

Authors: Zhengrui Ma, Qingkai Fang, Shaolei Zhang, Shoutao Guo, Yang Feng, Min Zhang

Abstract: Simultaneous translation models play a crucial role in facilitating communication. However, existing research primarily focuses on text-to-text or speech-to-text models, necessitating additional cascade components to achieve speech-to-speech translation. These pipeline methods suffer from error propagation and accumulate delays in each cascade component, resulting in reduced synchronization betwee… ▽ More Simultaneous translation models play a crucial role in facilitating communication. However, existing research primarily focuses on text-to-text or speech-to-text models, necessitating additional cascade components to achieve speech-to-speech translation. These pipeline methods suffer from error propagation and accumulate delays in each cascade component, resulting in reduced synchronization between the speaker and listener. To overcome these challenges, we propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X), which integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework. We develop a non-autoregressive decoder capable of concurrently generating multiple text or acoustic unit tokens upon receiving fixed-length speech chunks. The decoder can generate blank or repeated tokens and employ CTC decoding to dynamically adjust its latency. Experimental results show that NAST-S2X outperforms state-of-the-art models in both speech-to-text and speech-to-speech tasks. It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: ACL 2024; Codes and demos are at https://github.com/ictnlp/NAST-S2x

arXiv:2406.03049 [pdf, other]

StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning

Authors: Shaolei Zhang, Qingkai Fang, Shoutao Guo, Zhengrui Ma, Min Zhang, Yang Feng

Abstract: Simultaneous speech-to-speech translation (Simul-S2ST, a.k.a streaming speech translation) outputs target speech while receiving streaming speech inputs, which is critical for real-time communication. Beyond accomplishing translation between speech, Simul-S2ST requires a policy to control the model to generate corresponding target speech at the opportune moment within speech inputs, thereby posing… ▽ More Simultaneous speech-to-speech translation (Simul-S2ST, a.k.a streaming speech translation) outputs target speech while receiving streaming speech inputs, which is critical for real-time communication. Beyond accomplishing translation between speech, Simul-S2ST requires a policy to control the model to generate corresponding target speech at the opportune moment within speech inputs, thereby posing a double challenge of translation and policy. In this paper, we propose StreamSpeech, a direct Simul-S2ST model that jointly learns translation and simultaneous policy in a unified framework of multi-task learning. Adhering to a multi-task learning approach, StreamSpeech can perform offline and simultaneous speech recognition, speech translation and speech synthesis via an "All-in-One" seamless model. Experiments on CVSS benchmark demonstrate that StreamSpeech achieves state-of-the-art performance in both offline S2ST and Simul-S2ST tasks. Besides, StreamSpeech is able to present high-quality intermediate results (i.e., ASR or translation results) during simultaneous translation process, offering a more comprehensive real-time communication experience. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: Accepted to ACL 2024 main conference, Project Page: https://ictnlp.github.io/StreamSpeech-site/

arXiv:2310.07403 [pdf, other]

DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation

Authors: Qingkai Fang, Yan Zhou, Yang Feng

Abstract: Direct speech-to-speech translation (S2ST) translates speech from one language into another using a single model. However, due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution, posing challenges to achieving both high-quality translations and fast decoding speeds for S2ST models. In this paper, we propose DASpeech, a non-autoregressi… ▽ More Direct speech-to-speech translation (S2ST) translates speech from one language into another using a single model. However, due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution, posing challenges to achieving both high-quality translations and fast decoding speeds for S2ST models. In this paper, we propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST. To better capture the complex distribution of the target speech, DASpeech adopts the two-pass architecture to decompose the generation process into two steps, where a linguistic decoder first generates the target text, and an acoustic decoder then generates the target speech based on the hidden states of the linguistic decoder. Specifically, we use the decoder of DA-Transformer as the linguistic decoder, and use FastSpeech 2 as the acoustic decoder. DA-Transformer models translations with a directed acyclic graph (DAG). To consider all potential paths in the DAG during training, we calculate the expected hidden states for each target token via dynamic programming, and feed them into the acoustic decoder to predict the target mel-spectrogram. During inference, we select the most probable path and take hidden states on that path as input to the acoustic decoder. Experiments on the CVSS Fr-En benchmark demonstrate that DASpeech can achieve comparable or even better performance than the state-of-the-art S2ST model Translatotron 2, while preserving up to 18.53x speedup compared to the autoregressive baseline. Compared with the previous non-autoregressive S2ST model, DASpeech does not rely on knowledge distillation and iterative decoding, achieving significant improvements in both translation quality and decoding speed. Furthermore, DASpeech shows the ability to preserve the speaker's voice of the source speech during translation. △ Less

Submitted 11 October, 2023; originally announced October 2023.

Comments: NeurIPS 2023. Audio samples are available at https://ictnlp.github.io/daspeech-demo/

ACM Class: I.2.7

arXiv:2306.05281

A Graph Reconstruction by Dynamic Signal Coefficient for Fault Classification

Authors: Wenbin He, Jianxu Mao, Yaonan Wang, Zhe Li, Qiu Fang, Haotian Wu

Abstract: To improve the performance in identifying the faults under strong noise for rotating machinery, this paper presents a dynamic feature reconstruction signal graph method, which plays the key role of the proposed end-to-end fault diagnosis model. Specifically, the original mechanical signal is first decomposed by wavelet packet decomposition (WPD) to obtain multiple subbands including coefficient ma… ▽ More To improve the performance in identifying the faults under strong noise for rotating machinery, this paper presents a dynamic feature reconstruction signal graph method, which plays the key role of the proposed end-to-end fault diagnosis model. Specifically, the original mechanical signal is first decomposed by wavelet packet decomposition (WPD) to obtain multiple subbands including coefficient matrix. Then, with originally defined two feature extraction factors MDD and DDD, a dynamic feature selection method based on L2 energy norm (DFSL) is proposed, which can dynamically select the feature coefficient matrix of WPD based on the difference in the distribution of norm energy, enabling each sub-signal to take adaptive signal reconstruction. Next the coefficient matrices of the optimal feature sub-bands are reconstructed and reorganized to obtain the feature signal graphs. Finally, deep features are extracted from the feature signal graphs by 2D-Convolutional neural network (2D-CNN). Experimental results on a public data platform of a bearing and our laboratory platform of robot grinding show that this method is better than the existing methods under different noise intensities. △ Less

Submitted 29 September, 2023; v1 submitted 30 May, 2023; originally announced June 2023.

Comments: The feature extraction algorithm DFSL has errors in derivation and experimental deficiencies

arXiv:2305.14635 [pdf, other]

CMOT: Cross-modal Mixup via Optimal Transport for Speech Translation

Authors: Yan Zhou, Qingkai Fang, Yang Feng

Abstract: End-to-end speech translation (ST) is the task of translating speech signals in the source language into text in the target language. As a cross-modal task, end-to-end ST is difficult to train with limited data. Existing methods often try to transfer knowledge from machine translation (MT), but their performances are restricted by the modality gap between speech and text. In this paper, we propose… ▽ More End-to-end speech translation (ST) is the task of translating speech signals in the source language into text in the target language. As a cross-modal task, end-to-end ST is difficult to train with limited data. Existing methods often try to transfer knowledge from machine translation (MT), but their performances are restricted by the modality gap between speech and text. In this paper, we propose Cross-modal Mixup via Optimal Transport CMOT to overcome the modality gap. We find the alignment between speech and text sequences via optimal transport and then mix up the sequences from different modalities at a token level using the alignment. Experiments on the MuST-C ST benchmark demonstrate that CMOT achieves an average BLEU of 30.0 in 8 translation directions, outperforming previous methods. Further analysis shows CMOT can adaptively find the alignment between modalities, which helps alleviate the modality gap between speech and text. Code is publicly available at https://github.com/ictnlp/CMOT. △ Less

Submitted 25 May, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

Comments: ACL 2023 main conference

arXiv:2305.08709 [pdf, other]

Back Translation for Speech-to-text Translation Without Transcripts

Authors: Qingkai Fang, Yang Feng

Abstract: The success of end-to-end speech-to-text translation (ST) is often achieved by utilizing source transcripts, e.g., by pre-training with automatic speech recognition (ASR) and machine translation (MT) tasks, or by introducing additional ASR and MT data. Unfortunately, transcripts are only sometimes available since numerous unwritten languages exist worldwide. In this paper, we aim to utilize large… ▽ More The success of end-to-end speech-to-text translation (ST) is often achieved by utilizing source transcripts, e.g., by pre-training with automatic speech recognition (ASR) and machine translation (MT) tasks, or by introducing additional ASR and MT data. Unfortunately, transcripts are only sometimes available since numerous unwritten languages exist worldwide. In this paper, we aim to utilize large amounts of target-side monolingual data to enhance ST without transcripts. Motivated by the remarkable success of back translation in MT, we develop a back translation algorithm for ST (BT4ST) to synthesize pseudo ST data from monolingual target data. To ease the challenges posed by short-to-long generation and one-to-many map**, we introduce self-supervised discrete units and achieve back translation by cascading a target-to-unit model and a unit-to-speech model. With our synthetic ST data, we achieve an average boost of 2.3 BLEU on MuST-C En-De, En-Fr, and En-Es datasets. More experiments show that our method is especially effective in low-resource scenarios. △ Less

Submitted 15 May, 2023; originally announced May 2023.

Comments: ACL 2023 main conference

ACM Class: I.2.7

arXiv:2305.08706 [pdf, other]

Understanding and Bridging the Modality Gap for Speech Translation

Authors: Qingkai Fang, Yang Feng

Abstract: How to achieve better end-to-end speech translation (ST) by leveraging (text) machine translation (MT) data? Among various existing techniques, multi-task learning is one of the effective ways to share knowledge between ST and MT in which additional MT data can help to learn source-to-target map**. However, due to the differences between speech and text, there is always a gap between ST and MT.… ▽ More How to achieve better end-to-end speech translation (ST) by leveraging (text) machine translation (MT) data? Among various existing techniques, multi-task learning is one of the effective ways to share knowledge between ST and MT in which additional MT data can help to learn source-to-target map**. However, due to the differences between speech and text, there is always a gap between ST and MT. In this paper, we first aim to understand this modality gap from the target-side representation differences, and link the modality gap to another well-known problem in neural machine translation: exposure bias. We find that the modality gap is relatively small during training except for some difficult cases, but keeps increasing during inference due to the cascading effect. To address these problems, we propose the Cross-modal Regularization with Scheduled Sampling (Cress) method. Specifically, we regularize the output predictions of ST and MT, whose target-side contexts are derived by sampling between ground truth words and self-generated words with a varying probability. Furthermore, we introduce token-level adaptive training which assigns different training weights to target tokens to handle difficult cases with large modality gaps. Experiments and analysis show that our approach effectively bridges the modality gap, and achieves promising results in all eight directions of the MuST-C dataset. △ Less

Submitted 15 May, 2023; originally announced May 2023.

Comments: ACL 2023 main conference

ACM Class: I.2.7

arXiv:2302.13273 [pdf, other]

Two-Stream Joint-Training for Speaker Independent Acoustic-to-Articulatory Inversion

Authors: Jianrong Wang, **yu Liu, Li Liu, Xuewei Li, Mei Yu, Jie Gao, Qiang Fang

Abstract: Acoustic-to-articulatory inversion (AAI) aims to estimate the parameters of articulators from speech audio. There are two common challenges in AAI, which are the limited data and the unsatisfactory performance in speaker independent scenario. Most current works focus on extracting features directly from speech and ignoring the importance of phoneme information which may limit the performance of AA… ▽ More Acoustic-to-articulatory inversion (AAI) aims to estimate the parameters of articulators from speech audio. There are two common challenges in AAI, which are the limited data and the unsatisfactory performance in speaker independent scenario. Most current works focus on extracting features directly from speech and ignoring the importance of phoneme information which may limit the performance of AAI. To this end, we propose a novel network called SPN that uses two different streams to carry out the AAI task. Firstly, to improve the performance of speaker-independent experiment, we propose a new phoneme stream network to estimate the articulatory parameters as the phoneme features. To the best of our knowledge, this is the first work that extracts the speaker-independent features from phonemes to improve the performance of AAI. Secondly, in order to better represent the speech information, we train a speech stream network to combine the local features and the global features. Compared with state-of-the-art (SOTA), the proposed method reduces 0.18mm on RMSE and increases 6.0% on Pearson correlation coefficient in the speaker-independent experiment. The code has been released at https://github.com/liu**yu123/AAINetwork-SPN. △ Less

Submitted 26 February, 2023; originally announced February 2023.

arXiv:2302.12571 [pdf]

3D PETCT Tumor Lesion Segmentation via GCN Refinement

Authors: Hengzhi Xue, Qingqing Fang, Yudong Yao, Yueyang Teng

Abstract: Whole-body PET/CT scan is an important tool for diagnosing various malignancies (e.g., malignant melanoma, lymphoma, or lung cancer), and accurate segmentation of tumors is a key part for subsequent treatment. In recent years, CNN-based segmentation methods have been extensively investigated. However, these methods often give inaccurate segmentation results, such as over-segmentation and under-seg… ▽ More Whole-body PET/CT scan is an important tool for diagnosing various malignancies (e.g., malignant melanoma, lymphoma, or lung cancer), and accurate segmentation of tumors is a key part for subsequent treatment. In recent years, CNN-based segmentation methods have been extensively investigated. However, these methods often give inaccurate segmentation results, such as over-segmentation and under-segmentation. Therefore, to address such issues, we propose a post-processing method based on a graph convolutional neural network (GCN) to refine inaccurate segmentation parts and improve the overall segmentation accuracy. Firstly, nnUNet is used as an initial segmentation framework, and the uncertainty in the segmentation results is analyzed. Certainty and uncertainty nodes establish the nodes of a graph neural network. Each node and its 6 neighbors form an edge, and 32 nodes are randomly selected for uncertain nodes to form edges. The highly uncertain nodes are taken as the subsequent refinement targets. Secondly, the nnUNet result of the certainty nodes is used as label to form a semi-supervised graph network problem, and the uncertainty part is optimized through training the GCN network to improve the segmentation performance. This describes our proposed nnUNet-GCN segmentation framework. We perform tumor segmentation experiments on the PET/CT dataset in the MICCIA2022 autoPET challenge. Among them, 30 cases are randomly selected for testing, and the experimental results show that the false positive rate is effectively reduced with nnUNet-GCN refinement. In quantitative analysis, there is an improvement of 2.12 % on the average Dice score, 6.34 on 95 % Hausdorff Distance (HD95), and 1.72 on average symmetric surface distance (ASSD). The quantitative and qualitative evaluation results show that GCN post-processing methods can effectively improve tumor segmentation performance. △ Less

Submitted 24 February, 2023; originally announced February 2023.

Comments: 10 pages,5 figures,38 reference

arXiv:2209.07302 [pdf, other]

MVNet: Memory Assistance and Vocal Reinforcement Network for Speech Enhancement

Authors: Jianrong Wang, Xiaomin Li, Xuewei Li, Mei Yu, Qiang Fang, Li Liu

Abstract: Speech enhancement improves speech quality and promotes the performance of various downstream tasks. However, most current speech enhancement work was mainly devoted to improving the performance of downstream automatic speech recognition (ASR), only a relatively small amount of work focused on the automatic speaker verification (ASV) task. In this work, we propose a MVNet consisted of a memory ass… ▽ More Speech enhancement improves speech quality and promotes the performance of various downstream tasks. However, most current speech enhancement work was mainly devoted to improving the performance of downstream automatic speech recognition (ASR), only a relatively small amount of work focused on the automatic speaker verification (ASV) task. In this work, we propose a MVNet consisted of a memory assistance module which improves the performance of downstream ASR and a vocal reinforcement module which boosts the performance of ASV. In addition, we design a new loss function to improve speaker vocal similarity. Experimental results on the Libri2mix dataset show that our method outperforms baseline methods in several metrics, including speech quality, intelligibility, and speaker vocal similarity et al. △ Less

Submitted 15 September, 2022; originally announced September 2022.

Comments: ICONIP 2022

arXiv:2204.01672 [pdf, other]

Residual-guided Personalized Speech Synthesis based on Face Image

Authors: Jianrong Wang, Zixuan Wang, Xiaosheng Hu, Xuewei Li, Qiang Fang, Li Liu

Abstract: Previous works derive personalized speech features by training the model on a large dataset composed of his/her audio sounds. It was reported that face information has a strong link with the speech sound. Thus in this work, we innovatively extract personalized speech features from human faces to synthesize personalized speech using neural vocoder. A Face-based Residual Personalized Speech Synthesi… ▽ More Previous works derive personalized speech features by training the model on a large dataset composed of his/her audio sounds. It was reported that face information has a strong link with the speech sound. Thus in this work, we innovatively extract personalized speech features from human faces to synthesize personalized speech using neural vocoder. A Face-based Residual Personalized Speech Synthesis Model (FR-PSS) containing a speech encoder, a speech synthesizer and a face encoder is designed for PSS. In this model, by designing two speech priors, a residual-guided strategy is introduced to guide the face feature to approach the true speech feature in the training. Moreover, considering the error of feature's absolute values and their directional bias, we formulate a novel tri-item loss function for face encoder. Experimental results show that the speech synthesized by our model is comparable to the personalized speech synthesized by training a large amount of audio data in previous works. △ Less

Submitted 1 April, 2022; originally announced April 2022.

Comments: ICASSP 2022

arXiv:2202.13764 [pdf, ps, other]

A Note on "Optimum Sets of Interference-Free Sequences With Zero Autocorrelation Zone"

Authors: Qi** Fang, Zilong Wang

Abstract: In this paper, a simple construction of interference-free zero correlation zone (IF-ZCZ) sequence sets is proposed by well designed finite Zak transform lattice tessellation. Each set is characterized by the period of sequences $KM$, the set size $K$ and the length of zero correlation zone $M-1$, which is optimal with respect to the Tang-Fan-Matsufuji bound. Secondly, the transformations that keep… ▽ More In this paper, a simple construction of interference-free zero correlation zone (IF-ZCZ) sequence sets is proposed by well designed finite Zak transform lattice tessellation. Each set is characterized by the period of sequences $KM$, the set size $K$ and the length of zero correlation zone $M-1$, which is optimal with respect to the Tang-Fan-Matsufuji bound. Secondly, the transformations that keep the properties of the optimal IF-ZCZ sequence set unchanged are given, and the equivalent relation of the optimal IF-ZCZ sequence set is defined based on these transformations. Then, it is proved that the general construction of the optimal IF-ZCZ sequence set proposed by Popovic is equivalent to the simple construction of the optimal IF-ZCZ sequence set, which indicates that the generation of the optimal IF-ZCZ sequence set can be simplified. Moreover, it is pointed out that the alphabet size for the special case of the simple construction of the optimal IF-ZCZ sequence set can be a factor of the period. Finally, both the simple construction of the optimal IF-ZCZ sequence set and its special case have sparse and highly structured Zak spectra, which can greatly reduce the computational complexity of implementing matched filter banks. △ Less

Submitted 28 February, 2022; originally announced February 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:1912.09781

arXiv:2112.02991 [pdf, other]

Cross-Modality Attentive Feature Fusion for Object Detection in Multispectral Remote Sensing Imagery

Authors: Qingyun Fang, Zhaokui Wang

Abstract: Cross-modality fusing complementary information of multispectral remote sensing image pairs can improve the perception ability of detection algorithms, making them more robust and reliable for a wider range of applications, such as nighttime detection. Compared with prior methods, we think different features should be processed specifically, the modality-specific features should be retained and en… ▽ More Cross-modality fusing complementary information of multispectral remote sensing image pairs can improve the perception ability of detection algorithms, making them more robust and reliable for a wider range of applications, such as nighttime detection. Compared with prior methods, we think different features should be processed specifically, the modality-specific features should be retained and enhanced, while the modality-shared features should be cherry-picked from the RGB and thermal IR modalities. Following this idea, a novel and lightweight multispectral feature fusion approach with joint common-modality and differential-modality attentions are proposed, named Cross-Modality Attentive Feature Fusion (CMAFF). Given the intermediate feature maps of RGB and IR images, our module parallel infers attention maps from two separate modalities, common- and differential-modality, then the attention maps are multiplied to the input feature map respectively for adaptive feature enhancement or selection. Extensive experiments demonstrate that our proposed approach can achieve the state-of-the-art performance at a low computation cost. △ Less

Submitted 6 December, 2021; originally announced December 2021.

Comments: 23 pages,11 figures, under consideration at Pattern Recognition

arXiv:2106.13686 [pdf, other]

Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition

Authors: Jianrong Wang, Ziyue Tang, Xuewei Li, Mei Yu, Qiang Fang, Li Liu

Abstract: Cued Speech (CS) is a visual communication system for the deaf or hearing impaired people. It combines lip movements with hand cues to obtain a complete phonetic repertoire. Current deep learning based methods on automatic CS recognition suffer from a common problem, which is the data scarcity. Until now, there are only two public single speaker datasets for French (238 sentences) and British Engl… ▽ More Cued Speech (CS) is a visual communication system for the deaf or hearing impaired people. It combines lip movements with hand cues to obtain a complete phonetic repertoire. Current deep learning based methods on automatic CS recognition suffer from a common problem, which is the data scarcity. Until now, there are only two public single speaker datasets for French (238 sentences) and British English (97 sentences). In this work, we propose a cross-modal knowledge distillation method with teacher-student structure, which transfers audio speech information to CS to overcome the limited data problem. Firstly, we pretrain a teacher model for CS recognition with a large amount of open source audio speech data, and simultaneously pretrain the feature extractors for lips and hands using CS data. Then, we distill the knowledge from teacher model to the student model with frame-level and sequence-level distillation strategies. Importantly, for frame-level, we exploit multi-task learning to weigh losses automatically, to obtain the balance coefficient. Besides, we establish a five-speaker British English CS dataset for the first time. The proposed method is evaluated on French and British English CS datasets, showing superior CS recognition performance to the state-of-the-art (SOTA) by a large margin. △ Less

Submitted 25 June, 2021; originally announced June 2021.

Showing 1–16 of 16 results for author: Fang, Q