-
Revisiting Interpolation Augmentation for Speech-to-Text Generation
Authors:
Chen Xu,
Jie Wang,
Xiaoqian Liu,
Qianqian Dong,
Chunliang Zhang,
Tong Xiao,
**gbo Zhu,
Dapeng Man,
Wu Yang
Abstract:
Speech-to-text (S2T) generation systems frequently face challenges in low-resource scenarios, primarily due to the lack of extensive labeled datasets. One emerging solution is constructing virtual training samples by interpolating inputs and labels, which has notably enhanced system generalization in other domains. Despite its potential, this technique's application in S2T tasks has remained under…
▽ More
Speech-to-text (S2T) generation systems frequently face challenges in low-resource scenarios, primarily due to the lack of extensive labeled datasets. One emerging solution is constructing virtual training samples by interpolating inputs and labels, which has notably enhanced system generalization in other domains. Despite its potential, this technique's application in S2T tasks has remained under-explored. In this paper, we delve into the utility of interpolation augmentation, guided by several pivotal questions. Our findings reveal that employing an appropriate strategy in interpolation augmentation significantly enhances performance across diverse tasks, architectures, and data scales, offering a promising avenue for more robust S2T systems in resource-constrained settings.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
SAM-dPCR: Real-Time and High-throughput Absolute Quantification of Biological Samples Using Zero-Shot Segment Anything Model
Authors:
Yuanyuan Wei,
Shanhang Luo,
Changran Xu,
Yingqi Fu,
Qingyue Dong,
Yi Zhang,
Fuyang Qu,
Guangyao Cheng,
Yi-** Ho,
Ho-Pui Ho,
Wu Yuan
Abstract:
Digital PCR (dPCR) has revolutionized nucleic acid diagnostics by enabling absolute quantification of rare mutations and target sequences. However, current detection methodologies face challenges, as flow cytometers are costly and complex, while fluorescence imaging methods, relying on software or manual counting, are time-consuming and prone to errors. To address these limitations, we present SAM…
▽ More
Digital PCR (dPCR) has revolutionized nucleic acid diagnostics by enabling absolute quantification of rare mutations and target sequences. However, current detection methodologies face challenges, as flow cytometers are costly and complex, while fluorescence imaging methods, relying on software or manual counting, are time-consuming and prone to errors. To address these limitations, we present SAM-dPCR, a novel self-supervised learning-based pipeline that enables real-time and high-throughput absolute quantification of biological samples. Leveraging the zero-shot SAM model, SAM-dPCR efficiently analyzes diverse microreactors with over 97.7% accuracy within a rapid processing time of 3.16 seconds. By utilizing commonly available lab fluorescence microscopes, SAM-dPCR facilitates the quantification of sample concentrations. The accuracy of SAM-dPCR is validated by the strong linear relationship observed between known and inferred sample concentrations. Additionally, SAM-dPCR demonstrates versatility through comprehensive verification using various samples and reactor morphologies. This accessible, cost-effective tool transcends the limitations of traditional detection methods or fully supervised AI models, marking the first application of SAM in nucleic acid detection or molecular diagnostics. By eliminating the need for annotated training data, SAM-dPCR holds great application potential for nucleic acid quantification in resource-limited settings.
△ Less
Submitted 22 January, 2024;
originally announced March 2024.
-
Speech Translation with Large Language Models: An Industrial Practice
Authors:
Zhichao Huang,
Rong Ye,
Tom Ko,
Qianqian Dong,
Shanbo Cheng,
Mingxuan Wang,
Hang Li
Abstract:
Given the great success of large language models (LLMs) across various tasks, in this paper, we introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained LLM. By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations, even from long au…
▽ More
Given the great success of large language models (LLMs) across various tasks, in this paper, we introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained LLM. By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations, even from long audio inputs. Furthermore, our findings indicate that the implementation of Chain-of-Thought (CoT) prompting can yield advantages in the context of LLM-ST. Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST, establishing a new benchmark in the field of speech translation. Demo: https://speechtranslation.github.io/llm-st/.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition
Authors:
Chen Xu,
Xiaoqian Liu,
Erfeng He,
Yuhao Zhang,
Qianqian Dong,
Tong Xiao,
**gbo Zhu,
Dapeng Man,
Wu Yang
Abstract:
In this study, we present synchronous bilingual Connectionist Temporal Classification (CTC), an innovative framework that leverages dual CTC to bridge the gaps of both modality and language in the speech translation (ST) task. Utilizing transcript and translation as concurrent objectives for CTC, our model bridges the gap between audio and text as well as between source and target languages. Build…
▽ More
In this study, we present synchronous bilingual Connectionist Temporal Classification (CTC), an innovative framework that leverages dual CTC to bridge the gaps of both modality and language in the speech translation (ST) task. Utilizing transcript and translation as concurrent objectives for CTC, our model bridges the gap between audio and text as well as between source and target languages. Building upon the recent advances in CTC application, we develop an enhanced variant, BiL-CTC+, that establishes new state-of-the-art performances on the MuST-C ST benchmarks under resource-constrained scenarios. Intriguingly, our method also yields significant improvements in speech recognition performance, revealing the effect of cross-lingual learning on transcription and demonstrating its broad applicability. The source code is available at https://github.com/xuchennlp/S2T.
△ Less
Submitted 21 September, 2023;
originally announced September 2023.
-
Preserving Tumor Volumes for Unsupervised Medical Image Registration
Authors:
Qihua Dong,
Hao Du,
Ying Song,
Yan Xu,
**g Liao
Abstract:
Medical image registration is a critical task that estimates the spatial correspondence between pairs of images. However, current traditional and deep-learning-based methods rely on similarity measures to generate a deforming field, which often results in disproportionate volume changes in dissimilar regions, especially in tumor regions. These changes can significantly alter the tumor size and und…
▽ More
Medical image registration is a critical task that estimates the spatial correspondence between pairs of images. However, current traditional and deep-learning-based methods rely on similarity measures to generate a deforming field, which often results in disproportionate volume changes in dissimilar regions, especially in tumor regions. These changes can significantly alter the tumor size and underlying anatomy, which limits the practical use of image registration in clinical diagnosis. To address this issue, we have formulated image registration with tumors as a constraint problem that preserves tumor volumes while maximizing image similarity in other normal regions. Our proposed strategy involves a two-stage process. In the first stage, we use similarity-based registration to identify potential tumor regions by their volume change, generating a soft tumor mask accordingly. In the second stage, we propose a volume-preserving registration with a novel adaptive volume-preserving loss that penalizes the change in size adaptively based on the masks calculated from the previous stage. Our approach balances image similarity and volume preservation in different regions, i.e., normal and tumor regions, by using soft tumor masks to adjust the imposition of volume-preserving loss on each one. This ensures that the tumor volume is preserved during the registration process. We have evaluated our strategy on various datasets and network architectures, demonstrating that our method successfully preserves the tumor volume while achieving comparable registration results with state-of-the-art methods. Our codes is available at: \url{https://dddraxxx.github.io/Volume-Preserving-Registration/}.
△ Less
Submitted 9 May, 2024; v1 submitted 18 September, 2023;
originally announced September 2023.
-
Recent Advances in Direct Speech-to-text Translation
Authors:
Chen Xu,
Rong Ye,
Qianqian Dong,
Chengqi Zhao,
Tom Ko,
Mingxuan Wang,
Tong Xiao,
**gbo Zhu
Abstract:
Recently, speech-to-text translation has attracted more and more attention and many studies have emerged rapidly. In this paper, we present a comprehensive survey on direct speech translation aiming to summarize the current state-of-the-art techniques. First, we categorize the existing research work into three directions based on the main challenges -- modeling burden, data scarcity, and applicati…
▽ More
Recently, speech-to-text translation has attracted more and more attention and many studies have emerged rapidly. In this paper, we present a comprehensive survey on direct speech translation aiming to summarize the current state-of-the-art techniques. First, we categorize the existing research work into three directions based on the main challenges -- modeling burden, data scarcity, and application issues. To tackle the problem of modeling burden, two main structures have been proposed, encoder-decoder framework (Transformer and the variants) and multitask frameworks. For the challenge of data scarcity, recent work resorts to many sophisticated techniques, such as data augmentation, pre-training, knowledge distillation, and multilingual modeling. We analyze and summarize the application issues, which include real-time, segmentation, named entity, gender bias, and code-switching. Finally, we discuss some promising directions for future work.
△ Less
Submitted 20 June, 2023;
originally announced June 2023.
-
MOSPC: MOS Prediction Based on Pairwise Comparison
Authors:
Kexin Wang,
Yunlong Zhao,
Qianqian Dong,
Tom Ko,
Mingxuan Wang
Abstract:
As a subjective metric to evaluate the quality of synthesized speech, Mean opinion score~(MOS) usually requires multiple annotators to score the same speech. Such an annotation approach requires a lot of manpower and is also time-consuming. MOS prediction model for automatic evaluation can significantly reduce labor cost. In previous works, it is difficult to accurately rank the quality of speech…
▽ More
As a subjective metric to evaluate the quality of synthesized speech, Mean opinion score~(MOS) usually requires multiple annotators to score the same speech. Such an annotation approach requires a lot of manpower and is also time-consuming. MOS prediction model for automatic evaluation can significantly reduce labor cost. In previous works, it is difficult to accurately rank the quality of speech when the MOS scores are close. However, in practical applications, it is more important to correctly rank the quality of synthesis systems or sentences than simply predicting MOS scores. Meanwhile, as each annotator scores multiple audios during annotation, the score is probably a relative value based on the first or the first few speech scores given by the annotator. Motivated by the above two points, we propose a general framework for MOS prediction based on pair comparison (MOSPC), and we utilize C-Mixup algorithm to enhance the generalization performance of MOSPC. The experiments on BVCC and VCC2018 show that our framework outperforms the baselines on most of the correlation coefficient metrics, especially on the metric KTAU related to quality ranking. And our framework also surpasses the strong baseline in ranking accuracy on each fine-grained segment. These results indicate that our framework contributes to improving the ranking accuracy of speech quality.
△ Less
Submitted 18 June, 2023;
originally announced June 2023.
-
PolyVoice: Language Models for Speech to Speech Translation
Authors:
Qianqian Dong,
Zhiying Huang,
Qiao Tian,
Chen Xu,
Tom Ko,
Yunlong Zhao,
Siyuan Feng,
Tang Li,
Kexin Wang,
Xuxin Cheng,
Fengpeng Yue,
Ye Bai,
Xi Chen,
Lu Lu,
Zejun Ma,
Yu** Wang,
Mingxuan Wang,
Yuxuan Wang
Abstract:
We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt…
▽ More
We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model. This grants our framework the ability to preserve the voice characteristics and the speaking style of the original speech. We examine our system on Chinese $\rightarrow$ English and English $\rightarrow$ Spanish pairs. Experimental results show that our system can generate speech with high translation quality and audio quality. Speech samples are available at https://speechtranslation.github.io/polyvoice.
△ Less
Submitted 13 June, 2023; v1 submitted 5 June, 2023;
originally announced June 2023.
-
Weakly-Supervised 3D Medical Image Segmentation using Geometric Prior and Contrastive Similarity
Authors:
Hao Du,
Qihua Dong,
Yan Xu,
**g Liao
Abstract:
Medical image segmentation is almost the most important pre-processing procedure in computer-aided diagnosis but is also a very challenging task due to the complex shapes of segments and various artifacts caused by medical imaging, (i.e., low-contrast tissues, and non-homogenous textures). In this paper, we propose a simple yet effective segmentation framework that incorporates the geometric prior…
▽ More
Medical image segmentation is almost the most important pre-processing procedure in computer-aided diagnosis but is also a very challenging task due to the complex shapes of segments and various artifacts caused by medical imaging, (i.e., low-contrast tissues, and non-homogenous textures). In this paper, we propose a simple yet effective segmentation framework that incorporates the geometric prior and contrastive similarity into the weakly-supervised segmentation framework in a loss-based fashion. The proposed geometric prior built on point cloud provides meticulous geometry to the weakly-supervised segmentation proposal, which serves as better supervision than the inherent property of the bounding-box annotation (i.e., height and width). Furthermore, we propose contrastive similarity to encourage organ pixels to gather around in the contrastive embedding space, which helps better distinguish low-contrast tissues. The proposed contrastive embedding space can make up for the poor representation of the conventionally-used gray space. Extensive experiments are conducted to verify the effectiveness and the robustness of the proposed weakly-supervised segmentation framework. The proposed framework is superior to state-of-the-art weakly-supervised methods on the following publicly accessible datasets: LiTS 2017 Challenge, KiTS 2021 Challenge, and LPBA40. We also dissect our method and evaluate the performance of each component.
△ Less
Submitted 4 February, 2023;
originally announced February 2023.
-
M3ST: Mix at Three Levels for Speech Translation
Authors:
Xuxin Cheng,
Qianqian Dong,
Fengpeng Yue,
Tom Ko,
Mingxuan Wang,
Yuexian Zou
Abstract:
How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)? It's well known that data augmentation is an efficient method to improve performance for many tasks by enlarging the dataset. In this paper, we propose Mix at three levels for Speech Translation (M^3ST) method to increase the diversity of the augmented training corpus. Specifically, we conduct two phases of fine…
▽ More
How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)? It's well known that data augmentation is an efficient method to improve performance for many tasks by enlarging the dataset. In this paper, we propose Mix at three levels for Speech Translation (M^3ST) method to increase the diversity of the augmented training corpus. Specifically, we conduct two phases of fine-tuning based on a pre-trained model using external machine translation (MT) data. In the first stage of fine-tuning, we mix the training corpus at three levels, including word level, sentence level and frame level, and fine-tune the entire model with mixed data. At the second stage of fine-tuning, we take both original speech sequences and original text sequences in parallel into the model to fine-tune the network, and use Jensen-Shannon divergence to regularize their outputs. Experiments on MuST-C speech translation benchmark and analysis show that M^3ST outperforms current strong baselines and achieves state-of-the-art results on eight directions with an average BLEU of 29.9.
△ Less
Submitted 7 December, 2022;
originally announced December 2022.
-
Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation
Authors:
Qianqian Dong,
Fengpeng Yue,
Tom Ko,
Mingxuan Wang,
Qibing Bai,
Yu Zhang
Abstract:
Direct Speech-to-speech translation (S2ST) has drawn more and more attention recently. The task is very challenging due to data scarcity and complex speech-to-speech map**. In this paper, we report our recent achievements in S2ST. Firstly, we build a S2ST Transformer baseline which outperforms the original Translatotron. Secondly, we utilize the external data by pseudo-labeling and obtain a new…
▽ More
Direct Speech-to-speech translation (S2ST) has drawn more and more attention recently. The task is very challenging due to data scarcity and complex speech-to-speech map**. In this paper, we report our recent achievements in S2ST. Firstly, we build a S2ST Transformer baseline which outperforms the original Translatotron. Secondly, we utilize the external data by pseudo-labeling and obtain a new state-of-the-art result on the Fisher English-to-Spanish test set. Indeed, we exploit the pseudo data with a combination of popular techniques which are not trivial when applied to S2ST. Moreover, we evaluate our approach on both syntactically similar (Spanish-English) and distant (English-Chinese) language pairs. Our implementation is available at https://github.com/fengpeng-yue/speech-to-speech-translation.
△ Less
Submitted 18 May, 2022;
originally announced May 2022.
-
Cross-Modal ASR Post-Processing System for Error Correction and Utterance Rejection
Authors:
**g Du,
Shiliang Pu,
Qinbo Dong,
Chao **,
Xin Qi,
Dian Gu,
Ru Wu,
Hongwei Zhou
Abstract:
Although modern automatic speech recognition (ASR) systems can achieve high performance, they may produce errors that weaken readers' experience and do harm to downstream tasks. To improve the accuracy and reliability of ASR hypotheses, we propose a cross-modal post-processing system for speech recognizers, which 1) fuses acoustic features and textual features from different modalities, 2) joints…
▽ More
Although modern automatic speech recognition (ASR) systems can achieve high performance, they may produce errors that weaken readers' experience and do harm to downstream tasks. To improve the accuracy and reliability of ASR hypotheses, we propose a cross-modal post-processing system for speech recognizers, which 1) fuses acoustic features and textual features from different modalities, 2) joints a confidence estimator and an error corrector in multi-task learning fashion and 3) unifies error correction and utterance rejection modules. Compared with single-modal or single-task models, our proposed system is proved to be more effective and efficient. Experiment result shows that our post-processing system leads to more than 10% relative reduction of character error rate (CER) for both single-speaker and multi-speaker speech on our industrial ASR system, with about 1.7ms latency for each token, which ensures that extra latency introduced by post-processing is acceptable in streaming speech recognition.
△ Less
Submitted 10 January, 2022;
originally announced January 2022.
-
Learning When to Translate for Streaming Speech
Authors:
Qianqian Dong,
Yaoming Zhu,
Mingxuan Wang,
Lei Li
Abstract:
How to find proper moments to generate partial sentence translation given a streaming speech input? Existing approaches waiting-and-translating for a fixed duration often break the acoustic units in speech, since the boundaries between acoustic units in speech are not even. In this paper, we propose MoSST, a simple yet effective method for translating streaming speech content. Given a usually long…
▽ More
How to find proper moments to generate partial sentence translation given a streaming speech input? Existing approaches waiting-and-translating for a fixed duration often break the acoustic units in speech, since the boundaries between acoustic units in speech are not even. In this paper, we propose MoSST, a simple yet effective method for translating streaming speech content. Given a usually long speech sequence, we develop an efficient monotonic segmentation module inside an encoder-decoder model to accumulate acoustic information incrementally and detect proper speech unit boundaries for the input in speech translation task. Experiments on multiple translation directions of the MuST-C dataset show that MoSST outperforms existing methods and achieves the best trade-off between translation quality (BLEU) and latency. Our code is available at https://github.com/dqqcasia/mosst.
△ Less
Submitted 22 March, 2022; v1 submitted 15 September, 2021;
originally announced September 2021.
-
Adaptive dynamic programming-based adaptive-gain sliding mode tracking control for fixed-wing UAV with disturbances
Authors:
Chaofan Zhang,
Guoshan Zhang,
Qi Dong
Abstract:
This paper proposes an adaptive dynamic programming-based adaptive-gain sliding mode control (ADP-ASMC) scheme for a fixed-wing unmanned aerial vehicle (UAV) with matched and unmatched disturbances. Starting from the dynamic of fixed-wing UAV, the control-oriented model composed of attitude subsystem and airspeed subsystem is established. According to the different issues in two subsystems, two no…
▽ More
This paper proposes an adaptive dynamic programming-based adaptive-gain sliding mode control (ADP-ASMC) scheme for a fixed-wing unmanned aerial vehicle (UAV) with matched and unmatched disturbances. Starting from the dynamic of fixed-wing UAV, the control-oriented model composed of attitude subsystem and airspeed subsystem is established. According to the different issues in two subsystems, two novel adaptive-gain generalized super-twisting (AGST) algorithms are developed to eliminate the effects of disturbances in two subsystems and make the system trajectories tend to the designed integral sliding manifolds (ISMs) in finite time. Then, based on the expected equivalent sliding-mode dynamics, the modified adaptive dynamic programming (ADP) approach with actor-critic (AC) structure is utilized to generate the nearly optimal control laws and achieve the nearly optimal performance of the sliding-mode dynamics. Furthermore, through the Lyapunov stability theorem, the tracking errors and the weight estimation errors of two neural networks (NNs) are all uniformly ultimately bounded (UUB). Finally, comparative simulations demonstrate the superior performance of the proposed control scheme for the fixed-wing UAV.
△ Less
Submitted 13 July, 2021;
originally announced July 2021.
-
The Volctrans Neural Speech Translation System for IWSLT 2021
Authors:
Chengqi Zhao,
Zhicheng Liu,
Jian Tong,
Tao Wang,
Mingxuan Wang,
Rong Ye,
Qianqian Dong,
Jun Cao,
Lei Li
Abstract:
This paper describes the systems submitted to IWSLT 2021 by the Volctrans team. We participate in the offline speech translation and text-to-text simultaneous translation tracks. For offline speech translation, our best end-to-end model achieves 8.1 BLEU improvements over the benchmark on the MuST-C test set and is even approaching the results of a strong cascade solution. For text-to-text simulta…
▽ More
This paper describes the systems submitted to IWSLT 2021 by the Volctrans team. We participate in the offline speech translation and text-to-text simultaneous translation tracks. For offline speech translation, our best end-to-end model achieves 8.1 BLEU improvements over the benchmark on the MuST-C test set and is even approaching the results of a strong cascade solution. For text-to-text simultaneous translation, we explore the best practice to optimize the wait-k model. As a result, our final submitted systems exceed the benchmark at around 7 BLEU on the same latency regime. We will publish our code and model to facilitate both future research works and industrial applications.
This paper describes the systems submitted to IWSLT 2021 by the Volctrans team. We participate in the offline speech translation and text-to-text simultaneous translation tracks. For offline speech translation, our best end-to-end model achieves 7.9 BLEU improvements over the benchmark on the MuST-C test set and is even approaching the results of a strong cascade solution. For text-to-text simultaneous translation, we explore the best practice to optimize the wait-k model. As a result, our final submitted systems exceed the benchmark at around 7 BLEU on the same latency regime. We release our code and model at \url{https://github.com/bytedance/neurst/tree/master/examples/iwslt21} to facilitate both future research works and industrial applications.
△ Less
Submitted 30 June, 2021; v1 submitted 15 May, 2021;
originally announced May 2021.
-
Predicting Future Cognitive Decline with Hyperbolic Stochastic Coding
Authors:
J. Zhang,
Q. Dong,
J. Shi,
Q. Li,
C. M. Stonnington,
B. A. Gutman,
K. Chen,
E. M. Reiman,
R. J. Caselli,
P. M. Thompson,
J. Ye,
Y. Wang
Abstract:
Hyperbolic geometry has been successfully applied in modeling brain cortical and subcortical surfaces with general topological structures. However such approaches, similar to other surface based brain morphology analysis methods, usually generate high dimensional features. It limits their statistical power in cognitive decline prediction research, especially in datasets with limited subject number…
▽ More
Hyperbolic geometry has been successfully applied in modeling brain cortical and subcortical surfaces with general topological structures. However such approaches, similar to other surface based brain morphology analysis methods, usually generate high dimensional features. It limits their statistical power in cognitive decline prediction research, especially in datasets with limited subject numbers. To address the above limitation, we propose a novel framework termed as hyperbolic stochastic coding (HSC). Our preliminary experimental results show that our algorithm achieves superior results on various classification tasks. Our work may enrich surface based brain imaging research tools and potentially result in a diagnostic and prognostic indicator to be useful in individualized treatment strategies.
△ Less
Submitted 20 February, 2021;
originally announced February 2021.
-
NeurST: Neural Speech Translation Toolkit
Authors:
Chengqi Zhao,
Mingxuan Wang,
Qianqian Dong,
Rong Ye,
Lei Li
Abstract:
NeurST is an open-source toolkit for neural speech translation. The toolkit mainly focuses on end-to-end speech translation, which is easy to use, modify, and extend to advanced speech translation research and products. NeurST aims at facilitating the speech translation research for NLP researchers and building reliable benchmarks for this field. It provides step-by-step recipes for feature extrac…
▽ More
NeurST is an open-source toolkit for neural speech translation. The toolkit mainly focuses on end-to-end speech translation, which is easy to use, modify, and extend to advanced speech translation research and products. NeurST aims at facilitating the speech translation research for NLP researchers and building reliable benchmarks for this field. It provides step-by-step recipes for feature extraction, data preprocessing, distributed training, and evaluation. In this paper, we will introduce the framework design of NeurST and show experimental results for different benchmark datasets, which can be regarded as reliable baselines for future research. The toolkit is publicly available at https://github.com/bytedance/neurst/ and we will continuously update the performance of NeurST with other counterparts and studies at https://st-benchmark.github.io/.
△ Less
Submitted 15 June, 2021; v1 submitted 17 December, 2020;
originally announced December 2020.
-
Consecutive Decoding for Speech-to-text Translation
Authors:
Qianqian Dong,
Mingxuan Wang,
Hao Zhou,
Shuang Xu,
Bo Xu,
Lei Li
Abstract:
Speech-to-text translation (ST), which directly translates the source language speech to the target language text, has attracted intensive attention recently. However, the combination of speech recognition and machine translation in a single model poses a heavy burden on the direct cross-modal cross-lingual map**. To reduce the learning difficulty, we propose COnSecutive Transcription and Transl…
▽ More
Speech-to-text translation (ST), which directly translates the source language speech to the target language text, has attracted intensive attention recently. However, the combination of speech recognition and machine translation in a single model poses a heavy burden on the direct cross-modal cross-lingual map**. To reduce the learning difficulty, we propose COnSecutive Transcription and Translation (COSTT), an integral approach for speech-to-text translation. The key idea is to generate source transcript and target translation text with a single decoder. It benefits the model training so that additional large parallel text corpus can be fully exploited to enhance the speech translation training. Our method is verified on three mainstream datasets, including Augmented LibriSpeech English-French dataset, IWSLT2018 English-German dataset, and TED English-Chinese dataset. Experiments show that our proposed COSTT outperforms or on par with the previous state-of-the-art methods on the three datasets. We have released our code at \url{https://github.com/dqqcasia/st}.
△ Less
Submitted 14 April, 2022; v1 submitted 21 September, 2020;
originally announced September 2020.
-
"Listen, Understand and Translate": Triple Supervision Decouples End-to-end Speech-to-text Translation
Authors:
Qianqian Dong,
Rong Ye,
Mingxuan Wang,
Hao Zhou,
Shuang Xu,
Bo Xu,
Lei Li
Abstract:
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language. Existing methods are limited by the amount of parallel corpus. Can we build a system to fully utilize signals in a parallel ST corpus? We are inspired by human understanding system which is composed of auditory perception and cognitive processing. In this paper, we propose List…
▽ More
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language. Existing methods are limited by the amount of parallel corpus. Can we build a system to fully utilize signals in a parallel ST corpus? We are inspired by human understanding system which is composed of auditory perception and cognitive processing. In this paper, we propose Listen-Understand-Translate, (LUT), a unified framework with triple supervision signals to decouple the end-to-end speech-to-text translation task. LUT is able to guide the acoustic encoder to extract as much information from the auditory input. In addition, LUT utilizes a pre-trained BERT model to enforce the upper encoder to produce as much semantic information as possible, without extra data. We perform experiments on a diverse set of speech translation benchmarks, including Librispeech English-French, IWSLT English-German and TED English-Chinese. Our results demonstrate LUT achieves the state-of-the-art performance, outperforming previous methods. The code is available at https://github.com/dqqcasia/st.
△ Less
Submitted 5 April, 2021; v1 submitted 21 September, 2020;
originally announced September 2020.
-
Automatic Ischemic Stroke Lesion Segmentation from Computed Tomography Perfusion Images by Image Synthesis and Attention-Based Deep Neural Networks
Authors:
Guotai Wang,
Tao Song,
Qiang Dong,
Mei Cui,
Ning Huang,
Shaoting Zhang
Abstract:
Ischemic stroke lesion segmentation from Computed Tomography Perfusion (CTP) images is important for accurate diagnosis of stroke in acute care units. However, it is challenged by low image contrast and resolution of the perfusion parameter maps, in addition to the complex appearance of the lesion. To deal with this problem, we propose a novel framework based on synthesized pseudo Diffusion-Weight…
▽ More
Ischemic stroke lesion segmentation from Computed Tomography Perfusion (CTP) images is important for accurate diagnosis of stroke in acute care units. However, it is challenged by low image contrast and resolution of the perfusion parameter maps, in addition to the complex appearance of the lesion. To deal with this problem, we propose a novel framework based on synthesized pseudo Diffusion-Weighted Imaging (DWI) from perfusion parameter maps to obtain better image quality for more accurate segmentation. Our framework consists of three components based on Convolutional Neural Networks (CNNs) and is trained end-to-end. First, a feature extractor is used to obtain both a low-level and high-level compact representation of the raw spatiotemporal Computed Tomography Angiography (CTA) images. Second, a pseudo DWI generator takes as input the concatenation of CTP perfusion parameter maps and our extracted features to obtain the synthesized pseudo DWI. To achieve better synthesis quality, we propose a hybrid loss function that pays more attention to lesion regions and encourages high-level contextual consistency. Finally, we segment the lesion region from the synthesized pseudo DWI, where the segmentation network is based on switchable normalization and channel calibration for better performance. Experimental results showed that our framework achieved the top performance on ISLES 2018 challenge and: 1) our method using synthesized pseudo DWI outperformed methods segmenting the lesion from perfusion parameter maps directly; 2) the feature extractor exploiting additional spatiotemporal CTA images led to better synthesized pseudo DWI quality and higher segmentation accuracy; and 3) the proposed loss functions and network structure improved the pseudo DWI synthesis and lesion segmentation performance.
△ Less
Submitted 7 July, 2020;
originally announced July 2020.
-
Time Varying Channel Tracking for Multi-UAV Wideband Communications with Beam Squint
Authors:
Jianwei Zhao,
Qi Dong,
Yanjie Zhao,
Bolei Wang,
Feifei Gao
Abstract:
Unmanned aerial vehicle (UAV) has become an appealing solution for a wide range of commercial and civilian applications because of its high mobility and flexible deployment. Due to the continuous UAV navigation, the channel between UAV and base station (BS) is subject to the Doppler effect. Meanwhile, when the BS is equipped with massive number of antennas, the non-negligible propagation delay acr…
▽ More
Unmanned aerial vehicle (UAV) has become an appealing solution for a wide range of commercial and civilian applications because of its high mobility and flexible deployment. Due to the continuous UAV navigation, the channel between UAV and base station (BS) is subject to the Doppler effect. Meanwhile, when the BS is equipped with massive number of antennas, the non-negligible propagation delay across the array aperture would cause beam squint effect. In this paper, we first investigate the channel of UAV communications under both Doppler shift effect and beam squint effect. Then, we design a gridless compressed sensing (GCS) based channel tracking method, where the high dimension uplink channel can be derived by estimating a few physical parameters such as the direction of arrival (DOA), Doppler shift, and the complex gain information. Besides, with the Doppler shift reciprocity and angular reciprocity, the downlink channel can be derived by only one pilot symbol, which greatly decreases the downlink channel training overhead. Various simulation results are provided to verify the effectiveness of the proposed methods.
△ Less
Submitted 21 November, 2019;
originally announced November 2019.
-
Face representation by deep learning: a linear encoding in a parameter space?
Authors:
Qiulei Dong,
Jiayin Sun,
Zhanyi Hu
Abstract:
Recently, Convolutional Neural Networks (CNNs) have achieved tremendous performances on face recognition, and one popular perspective regarding CNNs' success is that CNNs could learn discriminative face representations from face images with complex image feature encoding. However, it is still unclear what is the intrinsic mechanism of face representation in CNNs. In this work, we investigate this…
▽ More
Recently, Convolutional Neural Networks (CNNs) have achieved tremendous performances on face recognition, and one popular perspective regarding CNNs' success is that CNNs could learn discriminative face representations from face images with complex image feature encoding. However, it is still unclear what is the intrinsic mechanism of face representation in CNNs. In this work, we investigate this problem by formulating face images as points in a shape-appearance parameter space, and our results demonstrate that: (i) The encoding and decoding of the neuron responses (representations) to face images in CNNs could be achieved under a linear model in the parameter space, in agreement with the recent discovery in primate IT face neurons, but different from the aforementioned perspective on CNNs' face representation with complex image feature encoding; (ii) The linear model for face encoding and decoding in the parameter space could achieve close or even better performances on face recognition and verification than state-of-the-art CNNs, which might provide new lights on the design strategies for face recognition systems; (iii) The neuron responses to face images in CNNs could not be adequately modelled by the axis model, a model recently proposed on face modelling in primate IT cortex. All these results might shed some lights on the often complained blackbox nature behind CNNs' tremendous performances on face recognition.
△ Less
Submitted 22 October, 2019;
originally announced October 2019.
-
Fuzzy Logic Control of a Hybrid Energy Storage Module for Naval Pulsed Power Applications
Authors:
Isaac J. Cohen,
David A. Wetz,
Stepfanie Veiga,
Qing Dong,
John Heinzel
Abstract:
There is need for an energy storage device capable of transferring high power in transient situations aboard naval vessels. Currently, batteries are used to accomplish this task, but previous research has shown that when utilized at high power rates, these devices deteriorate over time causing a loss in lifespan. It has been shown that a hybrid energy storage configuration is capable of meeting su…
▽ More
There is need for an energy storage device capable of transferring high power in transient situations aboard naval vessels. Currently, batteries are used to accomplish this task, but previous research has shown that when utilized at high power rates, these devices deteriorate over time causing a loss in lifespan. It has been shown that a hybrid energy storage configuration is capable of meeting such a demand while reducing the strain placed on individual components. While designing a custom converter capable of controlling the power to and from a battery would be ideal for this application, it can be costly to develop when compared to purchasing commercially available products. Commercially available products offer limited controllability in exchange for their proven performance and lower cost point - often times only allowing a system level control input without any way to interface with low level controls that are frequently used in controller design. This paper proposes the use of fuzzy logic control in order to provide a system level control to the converters responsible for limiting power to and from the battery. A system will be described mathematically, modeled in MATLAB/Simulink, and a fuzzy logic controller will be compared with a typical controller.
△ Less
Submitted 5 February, 2016;
originally announced February 2016.